我们的弹性beanstalk实例在error_log中报告以下消息
[Mon Jun 26 22:01:01.878892 2017] [mpm_prefork:notice] [pid 8595] AH00173: SIGHUP received. Attempting to restart *** Error in (wsgi:wsgi) ': double free or corruption (out): 0x00007f564cced560 ***
有时错误序列看起来更像是这样:
[Tue Jun 27 00:01:01.215260 2017] [:error] [pid 6429] [remote XX.XXX.XX.195:29773] mod_wsgi (pid=6429): Exception occurred processing WSGI script '/opt/python/current/app/site/settings/wsgi/__init__.py'. [Tue Jun 27 00:01:01.215320 2017] [:error] [pid 6429] [remote XX.XXX.XX.195:29773] OSError: Failed to write data [Tue Jun 27 00:01:01.222407 2017] [:error] [pid 6430] [remote XX.XXX.XX.60:53313] mod_wsgi (pid=6430): Exception occurred processing WSGI script '/opt/python/current/app/site/settings/wsgi/__init__.py'. [Tue Jun 27 00:01:01.222460 2017] [:error] [pid 6430] [remote XX.XXX.XX.60:53313] OSError: Failed to write data [Tue Jun 27 00:01:04.554810 2017] [core:warn] [pid 8595] AH00045: child process 7614 still did not exit,sending a SIGTERM [Tue Jun 27 00:01:04.554850 2017] [core:warn] [pid 8595] AH00045: child process 7615 still did not exit,sending a SIGTERM [Tue Jun 27 00:01:05.555958 2017] [mpm_prefork:notice] [pid 8595] AH00173: SIGHUP received. Attempting to restart *** Error in (wsgi:wsgi) ': double free or corruption (out): 0x00007f5640cae900 *** *** Error in (wsgi:wsgi) ': double free or corruption (out): 0x00007f78649b7970 ***
几乎每个小时都会这样.常见的信息是:
[Mon Jun 26 22:01:01.878892 2017] [mpm_prefork:notice] [pid 8595] AH00173: SIGHUP received. Attempting to restart
我查找了mpm_prefork模块conf块…并且没有一个,所以所有默认值都在使用.
我查找了由弹性beanstalk执行的logrotation命令
/var/log/httpd/* { size 10M missingok notifempty rotate 5 sharedscripts compress dateext dateformat -%s create postrotate /sbin/service httpd reload > /dev/null 2>/dev/null || true endscript olddir /var/log/httpd/rotated }
很标准的东西.我对重装的理解是尝试优雅重启……
我可以通过执行sudo apachectl -k restart手动触发错误消息,虽然我无法找到在日志轮换期间运行的位置.
我们有下游服务似乎在此服务器挂起它的所有连接时抛出异常.
所以我的问题是,在logrotate期间还有什么可能导致mpm_prefork中的SIGHUP?据我所知,这不应该在错误条件之外发生.
Apache / 2.4.18(亚马逊)mod_wsgi / 3.5 Python / 3.4.3
解决方法
再生产
我们创建最简单的Python WSGI应用程序.
application.py
import time def application(environ,start_response): # somewhat realistic response duration time.sleep(0.5) status = '200 OK' response_headers = [('Content-type','text/plain')] start_response(status,response_headers) return [b'Hello world!\n']
将其压缩到application.zip.然后创建Elastic Beanstalk Python应用程序和环境,上传档案.确保使用您拥有的密钥对.保留其他设置默认值.等到它完成(几分钟).
ssh到底层的EC2实例中(参见EB日志中的实例标识符).输入(httpd的logrotate post-action,见下文):
sudo /sbin/service httpd reload
然后在你的机器上运行:
siege -v -b -c 10 -t 10S http://your-test-eb.you-aws-region.elasticbeanstalk.com/
当它运行时,重复重载命令几次.
然后你会看到如下内容:
** SIEGE 3.0.8 ** Preparing 10 concurrent users for battle. The server is now under siege... HTTP/1.1 200 0.63 secs: 13 bytes ==> GET / HTTP/1.1 200 0.65 secs: 13 bytes ==> GET / HTTP/1.1 200 0.64 secs: 13 bytes ==> GET / HTTP/1.1 200 0.60 secs: 13 bytes ==> GET / ...
这是重装时会发生什么.
HTTP/1.1 504 0.06 secs: 0 bytes ==> GET / HTTP/1.1 504 0.07 secs: 0 bytes ==> GET / HTTP/1.1 504 0.08 secs: 0 bytes ==> GET / HTTP/1.1 504 0.10 secs: 0 bytes ==> GET / HTTP/1.1 504 0.11 secs: 0 bytes ==> GET / HTTP/1.1 504 0.66 secs: 0 bytes ==> GET / HTTP/1.1 504 0.19 secs: 0 bytes ==> GET / HTTP/1.1 504 0.20 secs: 0 bytes ==> GET / HTTP/1.1 504 0.09 secs: 0 bytes ==> GET /
然后它恢复了.
HTTP/1.1 200 1.25 secs: 13 bytes ==> GET / HTTP/1.1 200 1.24 secs: 13 bytes ==> GET / HTTP/1.1 200 1.26 secs: 13 bytes ==> GET / ... Lifting the server siege.. done. Transactions: 75 hits Availability: 81.52 % Elapsed time: 9.40 secs Data transferred: 0.00 MB Response time: 1.21 secs Transaction rate: 7.98 trans/sec Throughput: 0.00 MB/sec Concurrency: 9.68 Successful transactions: 75 Failed transactions: 17 Longest transaction: 4.27 Shortest transaction: 0.06
请注意,ELB似乎对问题没有任何影响,并且可以通过两个SSH会话复制到底层EC2并且(Amazon AMI没有围攻):
ab -v 4 -c 10 -t 10 http://your-test-eb.you-aws-region.elasticbeanstalk.com/
原因
/etc/cron.hourly/cron.logrotate.elasticbeanstalk.httpd.conf
#!/bin/sh test -x /usr/sbin/logrotate || exit 0 /usr/sbin/logrotate /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
/var/log/httpd/* { size 10M missingok notifempty rotate 5 sharedscripts compress dateext dateformat -%s create postrotate /sbin/service httpd reload > /dev/null 2>/dev/null || true endscript olddir /var/log/httpd/rotated }
请注意postrotate. / sbin / service只是/etc/init.d/中脚本的System V包装器.它的手册页说:
service
runs a System V init script in as predictable environment as possible,removing most environment variables and with
current working directory set to/
.
请注意,重新加载不是标准的Apache maintenance command.这是发行版的下游添加.让我们看一下init脚本/etc/init.d/httpd.相关部分如下:
reload() { echo -n $"Reloading $prog: " check13 || exit 1 killproc -p ${pidfile} $httpd -HUP RETVAL=$? echo }
如您所见,它向Apache发送HUP信号,解释为Restart Now:
Sending the
HUP
orrestart
signal to the parent causes it to kill off its children like inTERM
,but the parent doesn’t exit. It re-reads its configuration files,and re-opens any log files. Then it spawns a new set of children and continues serving hits.
TERM很好地解释了504s.但它可能应该如何完成是Graceful Restart,因为它也重新打开日志但不终止正在提供的请求:
The
USR1
orgraceful
signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they’re not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration,which begins serving new requests immediately.…
The code was written to both minimize the time in which the server is unable to serve new requests (they will be queued up by the operating system,so they’re not lost in any event) and to respect your tuning parameters.
可以使用.ebextensions
替换/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf.在根目录中创建
.ebextensions / 10_logs.config包含以下内容(基本上将“reload”替换为“graceful”):
files: "/etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf": mode: "000644" owner: root group: root content: | /var/log/httpd/* { size 10M missingok notifempty rotate 5 sharedscripts compress dateext dateformat -%s create postrotate /sbin/service httpd graceful > /dev/null 2>/dev/null || true endscript olddir /var/log/httpd/rotated }
并重新部署您的Elastic Beanstalk环境.请注意,然而随后的亚秒优雅重启,我能够(零星地)产生503服务不可用,但是,对于日志轮换不是这样,因为均匀间隔的优雅重启没有错误.