在测试期间,我们突然错误率增加.
在分析平台和应用程序行为后,我们注意到:
> Postgres RDS的cpu为100%
>可用内存在同一服务器上丢弃
在postgres日志中,我们看到:
2018-08-21 08:19:48 UTC::@:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed
在调查和阅读文档之后,似乎有一种可能性是linux oomkiller运行已经杀死了这个过程.
但由于我们使用的是RDS,因此我们无法访问系统日志/ var / log消息进行确认.
有人可以这样说:
>确认oom杀手真正在Postgres的AWS RDS上运行
>给我们一个方法来检查这个?
>给我们一种方法来根据连接数计算Postgres使用的最大内存?
我在这里找不到答案:
> http://postgresql.freeideas.cz/server-process-was-terminated-by-signal-9-killed/
> https://www.postgresql.org/message-id/CAOR%3Dd%3D25iOzXpZFY%3DSjL%3DWD0noBL2Fio9LwpvO2%3DSTnjTW%3DMqQ%40mail.gmail.com
> https://www.postgresql.org/message-id/04e301d1fee9%24537ab200%24fa701600%24%40JetBrains.com
在内存分配方面,这是建议:
An Amazon RDS performance best practice is to allocate enough RAM so
that your working set resides almost completely in memory. To tell if
your working set is almost all in memory,check the ReadioPS metric
(using Amazon CloudWatch) while the DB instance is under load. The
value of ReadioPS should be small and stable. If scaling up the DB
instance class—to a class with more RAM—results in a dramatic drop in
ReadioPS,your working set was not almost completely in memory.
Continue to scale up until ReadioPS no longer drops dramatically after
a scaling operation,or ReadioPS is reduced to a very small amount.
For information on monitoring a DB instance’s metrics,see 07001.
此外,这是他们建议解决可能的操作系统问题:
Amazon RDS provides metrics in real time for the operating system (OS)
that your DB instance runs on. You can view the metrics for your DB
instance using the console,or consume the Enhanced Monitoring JSON
output from Amazon CloudWatch Logs in a monitoring system of your
choice. For more information about Enhanced Monitoring,see 07002
请注意,作为最后的手段,您可以切换到Aurora,它与Postgresql兼容:
Aurora features a distributed,fault-tolerant,self-healing storage
system that auto-scales up to 64TB per database instance. Aurora
delivers high performance and availability with up to 15 low-latency
read replicas,point-in-time recovery,continuous backup to Amazon S3,
and replication across three Availability Zones.
编辑:专门讨论你的问题与Postgresql,检查这Stack Exchange thread – 他们有一个很长的连接自动提交设置为false.
We had a long connection with auto commit set to false:
connection.setAutoCommit(false)
During that time we were doing a lot
of small queries and a few queries with a cursor:
statement.setFetchSize(SOME_FETCH_SIZE)
In JDBC you create a connection object,and from that connection you
create statements. When you execute the statments you get a result
set.Now,every one of these objects needs to be closed,but if you close
statement,the entry set is closed,and if you close the connection
all the statements are closed and their result sets.We were used to short living queries with connections of their own so
we never closed statements assuming the connection will handle the
things once it is closed.The problem was now with this long transaction (~24 hours) which never
closed the connection. The statements were never closed. Apparently,
the statement object holds resources both on the server that runs the
code and on the Postgresql database.My best guess to what resources are left in the DB is the things
related to the cursor. The statements that used the cursor were never
closed,so the result set they returned never closed as well. This
meant the database didn’t free the relevant cursor resources in the
DB,and since it was over a huge table it took a lot of RAM.
希望能帮助到你!