令我感到困惑的是,我使用xymon监控系统,并且我没有事先得到任何警告.我有cpu,网络和ram利用率的图表,但我没有看到任何一个表示问题的大“尖峰”.我会发布它们,但目前没有足够的声誉.
我在6发现了.我也通过vmcom和vmlim得到一条红线.我将/ proc / sys / vm / overcommit_ratio从50改为90,红线消失了.如你所见,我有500MB的免费RAM,2GB的免交换,1.2G的缓存.
我修复了问题,还是告诉atop / linux忽略它?
我想要一个稳定的系统.前进,我应该:
>向下调整Apache,Sendmail等的最大子参数?使用ulimit?
>调整oom-killer sysctl值,让我利用所有可用的ram&交换?
>调整swappiness或其他内核值?
我正在寻找计算上述问题答案的好方法.
谢谢.
ATOP输出
ATOP - www1 2013/06/20 10:32:14 10 seconds elapsed PRC | sys 1.34s | user 7.48s | #proc 347 | #zombie 2 | #exit 53 | cpu | sys 11% | user 63% | irq 1% | idle 106% | wait 19% | cpu | sys 7% | user 45% | irq 1% | idle 44% | cpu000 w 3% | cpu | sys 4% | user 18% | irq 0% | idle 62% | cpu001 w 16% | CPL | avg1 0.90 | avg5 1.20 | avg15 1.53 | csw 13548 | intr 5667 | MEM | tot 3.9G | free 504.7M | cache 1.2G | buff 124.8M | slab 445.3M | SWP | tot 2.0G | free 2.0G | | vmcom 4.0G | vmlim 5.5G | DSK | sda | busy 28% | read 1 | write 237 | avio 11 ms | NET | transport | tcpi 1753 | tcpo 1682 | udpi 2105 | udpo 2120 | NET | network | ipi 3918 | ipo 3832 | ipfrw 0 | deliv 3858 | NET | eth0 0% | pcki 1303 | pcko 1474 | si 372 Kbps | so 650 Kbps | NET | eth1 0% | pcki 1996 | pcko 2039 | si 369 Kbps | so 398 Kbps | NET | lo ---- | pcki 619 | pcko 619 | si 118 Kbps | so 118 Kbps | PID MINFLT MAJFLT VSTEXT VSIZE RSIZE VGROW RGROW MEM CMD 1/3 3163 0 0 462K 356.1M 138.5M 0K 0K 3% named 3256 413 0 7432K 387.4M 126.0M 0K 0K 3% MysqLd 579 18 0 3K 179.9M 73548K 0K 0K 2% spamd 784 10 0 3K 176.8M 70436K 0K 0K 2% spamd 8053 137 0 394K 330.2M 62928K 0K 20K 2% apache2 7807 122 0 394K 329.4M 62064K 0K 0K 2% apache2 7158 1 0 394K 328.3M 60004K 0K 0K 1% apache2 8305 41 0 394K 326.9M 59096K 0K 4K 1% apache2 17712 0 0 3K 153.2M 51384K 0K 0K 1% spamd 8057 0 0 394K 319.4M 50600K 0K 0K 1% apache2 7994 127 0 394K 319.4M 50376K 332K 244K 1% apache2 8068 38 0 394K 319.2M 49636K 0K 0K 1% apache2 8164 117 0 394K 319.3M 49544K 0K 100K 1% apache2 8286 79 0 394K 319.3M 49332K 0K 0K 1% apache2 8393 457 0 394K 319.2M 49216K 0K 12K 1% apache2 8222 52 0 394K 318.9M 48852K 0K 52K 1% apache2
oom-killer典型的日志输出
Jun 15 10:21:26 mail kernel: [142707.434078] PHP5-cgi invoked oom-killer: gfp_mask=0x1201d2,order=0,oomkilladj=0 Jun 15 10:21:26 mail kernel: [142707.434083] Pid: 18323,comm: PHP5-cgi Not tainted 2.6.26-2-amd64 #1 Jun 15 10:21:26 mail kernel: [142707.434085] Jun 15 10:21:26 mail kernel: [142707.434085] Call Trace: Jun 15 10:21:26 mail kernel: [142707.434101] [<ffffffff80273994>] oom_kill_process+0x57/0x1dc Jun 15 10:21:26 mail kernel: [142707.434104] [<ffffffff8023b519>] __capable+0x9/0x1c Jun 15 10:21:26 mail kernel: [142707.434106] [<ffffffff80273cbf>] badness+0x188/0x1c7 Jun 15 10:21:26 mail kernel: [142707.434109] [<ffffffff80273ef3>] out_of_memory+0x1f5/0x28e Jun 15 10:21:26 mail kernel: [142707.434114] [<ffffffff80276c44>] __alloc_pages_internal+0x31d/0x3bf Jun 15 10:21:26 mail kernel: [142707.434119] [<ffffffff802788fa>] __do_page_cache_readahead+0x79/0x183 Jun 15 10:21:26 mail kernel: [142707.434123] [<ffffffff802731a9>] filemap_fault+0x15d/0x33c Jun 15 10:21:26 mail kernel: [142707.434127] [<ffffffff8027e728>] __do_fault+0x50/0x3e6 Jun 15 10:21:26 mail kernel: [142707.434132] [<ffffffff80281ae9>] handle_mm_fault+0x452/0x8de Jun 15 10:21:26 mail kernel: [142707.434136] [<ffffffff80246229>] autoremove_wake_function+0x0/0x2e Jun 15 10:21:26 mail kernel: [142707.434139] [<ffffffff80284bb1>] vma_merge+0x141/0x1ee Jun 15 10:21:26 mail kernel: [142707.434144] [<ffffffff80221fbc>] do_page_fault+0x5d8/0x9c8 Jun 15 10:21:26 mail kernel: [142707.434149] [<ffffffff8042aaf9>] error_exit+0x0/0x60 Jun 15 10:21:26 mail kernel: [142707.434154] Jun 15 10:21:26 mail kernel: [142707.434155] Mem-info: Jun 15 10:21:26 mail kernel: [142707.434156] Node 0 DMA per-cpu: Jun 15 10:21:26 mail kernel: [142707.434158] cpu 0: hi: 0,btch: 1 usd: 0 Jun 15 10:21:26 mail kernel: [142707.434159] cpu 1: hi: 0,btch: 1 usd: 0 Jun 15 10:21:26 mail kernel: [142707.434160] Node 0 DMA32 per-cpu: Jun 15 10:21:26 mail kernel: [142707.434162] cpu 0: hi: 186,btch: 31 usd: 153 Jun 15 10:21:26 mail kernel: [142707.434163] cpu 1: hi: 186,btch: 31 usd: 168 Jun 15 10:21:26 mail kernel: [142707.434164] Node 0 Normal per-cpu: Jun 15 10:21:26 mail kernel: [142707.434165] cpu 0: hi: 186,btch: 31 usd: 140 Jun 15 10:21:26 mail kernel: [142707.434167] cpu 1: hi: 186,btch: 31 usd: 81 Jun 15 10:21:26 mail kernel: [142707.434169] Active:118998 inactive:818611 dirty:1 writeback:1267 unstable:0 Jun 15 10:21:26 mail kernel: [142707.434170] free:5922 slab:17276 mapped:191 pagetables:32145 bounce:0 Jun 15 10:21:26 mail kernel: [142707.434172] Node 0 DMA free:11692kB min:20kB low:24kB high:28kB active:0kB inactive:0kB present:10772kB pages_scanned:0 all_unreclaimable? yes Jun 15 10:21:26 mail kernel: [142707.434175] lowmem_reserve[]: 0 3000 4010 4010 Jun 15 10:21:26 mail kernel: [142707.434177] Node 0 DMA32 free:10396kB min:6056kB low:7568kB high:9084kB active:152kB inactive:2812380kB present:3072160kB pages_scanned:64 all_unreclaimable? no Jun 15 10:21:26 mail kernel: [142707.434180] lowmem_reserve[]: 0 0 1010 1010 Jun 15 10:21:26 mail kernel: [142707.434182] Node 0 Normal free:1600kB min:2036kB low:2544kB high:3052kB active:475840kB inactive:462064kB present:1034240kB pages_scanned:148770 all_unreclaimable? no Jun 15 10:21:26 mail kernel: [142707.434185] lowmem_reserve[]: 0 0 0 0 Jun 15 10:21:26 mail kernel: [142707.434187] Node 0 DMA: 3*4kB 6*8kB 3*16kB 6*32kB 4*64kB 1*128kB 1*256kB 1*512kB 2*1024kB 0*2048kB 2*4096kB = 11692kB Jun 15 10:21:26 mail kernel: [142707.434192] Node 0 DMA32: 1130*4kB 0*8kB 2*16kB 2*32kB 0*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 10376kB Jun 15 10:21:26 mail kernel: [142707.434197] Node 0 Normal: 153*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB Jun 15 10:21:26 mail kernel: [142707.434202] 233119 total pagecache pages Jun 15 10:21:26 mail kernel: [142707.434204] Swap cache: add 639299,delete 621816,find 41527/51044 Jun 15 10:21:26 mail kernel: [142707.434205] Free swap = 0kB Jun 15 10:21:26 mail kernel: [142707.434206] Total swap = 2097144kB Jun 15 10:21:26 mail kernel: [142707.444828] 1048576 pages of RAM Jun 15 10:21:26 mail kernel: [142707.444828] 32137 reserved pages Jun 15 10:21:26 mail kernel: [142707.444828] 183748 pages shared Jun 15 10:21:26 mail kernel: [142707.444828] 17483 pages swap cached Jun 15 10:21:26 mail kernel: [142707.444828] Out of memory: kill process 3907 (perdition.imaps) score 179546 or a child Jun 15 10:21:26 mail kernel: [142707.444828] Killed process 29401 (perdition.imaps)
参考文献:
> OOM Killer
> Taming the OOM killer
> How to diagnose causes of oom-killer killing processes
> Understanding the Linux oom-killer’s logs
> What is the best way to prevent out of memory (OOM) freezes on Linux?
> ATOP shows red line vmcom and vmlim. What does it mean?
解决方法
Jun 15 10:21:26 mail kernel: [142707.434172] Node 0 DMA free:11692kB min:20kB low:24kB high:28kB active:0kB inactive:0kB present:10772kB pages_scanned:0 all_unreclaimable? yes Jun 15 10:21:26 mail kernel: [142707.434177] Node 0 DMA32 free:10396kB min:6056kB low:7568kB high:9084kB active:152kB inactive:2812380kB present:3072160kB pages_scanned:64 all_unreclaimable? no Jun 15 10:21:26 mail kernel: [142707.434182] Node 0 Normal free:1600kB min:2036kB low:2544kB high:3052kB active:475840kB inactive:462064kB present:1034240kB pages_scanned:148770 all_unreclaimable? no [...] Jun 15 10:21:26 mail kernel: [142707.434205] Free swap = 0kB Jun 15 10:21:26 mail kernel: [142707.434206] Total swap = 2097144kB
请注意,“正常”区域中的可用内存低于“最小”限制,这意味着用户进程无法再从其中分配内存:
您的DMA和DMA32区域确实有一些可用的内存,但OOM杀手被触发,因为内存请求来自“HIGHMEM”(或“正常”)区域(gfp_mask lower nibble is 2h)
很可能内存使用速度足够快以适应监控系统的两次查询之间的时间间隔,因此您将无法看到峰值 – 系统变得无法使用.
通过设置vm.overcommit_memory = 2和/或vm.overcommit_ratio来禁用过度使用将有助于您不再获得OOM调用.但内存不足将持续存在,并且在“内存已满”状态下要求内存分配的进程可能会异常终止.
要真正了解情况,找出消耗所有内存的内容 – Apache工作人员可能是候选者,尝试启用vm.oom_dump_tasks以获取有关进程和内存使用情况时oom_killer的更多信息.还要看一下this question,你的描述很像.