我的服务器在不同的工具中将频繁的分段错误记录到/var/log/kern.log.到目前为止,我已经在Perl,PHP和rsync中看到过它们.所有安装的软件都是最新的Debian软件包.这是日志文件中的一个exerpt:
Mar 2 01:07:54 gaz kernel: [ 5316.246303] imapsync[4533]: segfault at 8b ip 00007fb448c98fe6 sp 00007ffff571dd68 error 4 in libperl.so.5.10.1[7fb448bd7000+164000] Mar 2 01:17:42 gaz kernel: [ 5904.354307] PHP5-cgi[4441]: segfault at 2bb3dc8 ip 0000000002bb3dc8 sp 00007fffbeeaae48 error 15 Mar 2 02:54:05 gaz kernel: [11687.922316] PHP5-cgi[4495]: segfault at 2d7acf9 ip 0000000002d7acf9 sp 00007fff60c6eb18 error 15 Mar 2 10:50:08 gaz kernel: [40250.390322] BUG: unable to handle kernel paging request at 00000000024b03f0 Mar 2 10:50:08 gaz kernel: [40250.390341] IP: [<00000000024b03f0>] 0x24b03f0 Mar 2 10:50:08 gaz kernel: [40250.390353] PGD 208c71067 PUD 21c811067 PMD 209329067 PTE 8000000211c88067 Mar 2 10:50:08 gaz kernel: [40250.390365] Oops: 0011 [#1] SMP Mar 2 10:50:08 gaz kernel: [40250.390373] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat Mar 2 10:50:08 gaz kernel: [40250.390386] cpu 1 Mar 2 10:50:08 gaz kernel: [40250.390392] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd _hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_m ce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan] Mar 2 10:50:08 gaz kernel: [40250.390566] Pid: 11482,comm: munin-limits Not tainted 2.6.32-5-amd64 #1 MS-7368 Mar 2 10:50:08 gaz kernel: [40250.390576] RIP: 0010:[<00000000024b03f0>] [<00000000024b03f0>] 0x24b03f0 Mar 2 10:50:08 gaz kernel: [40250.390586] RSP: 0018:ffff88021cc8dec0 EFLAGS: 00010286 Mar 2 10:50:08 gaz kernel: [40250.390593] RAX: 000000001ddc1000 RBX: 0000000000000010 RCX: ffffffff810f9904 Mar 2 10:50:08 gaz kernel: [40250.390600] RDX: 0000000000000000 RSI: ffffea0007688200 RDI: 0000000000000286 Mar 2 10:50:08 gaz kernel: [40250.390608] RBP: 00000000ffffffea R08: 0000000000000025 R09: 7865542f30312e35 Mar 2 10:50:08 gaz kernel: [40250.390615] R10: 000000d01cc8ddf8 R11: 0000000000000246 R12: ffff88021cc8def8 Mar 2 10:50:08 gaz kernel: [40250.390622] R13: 0000000002295010 R14: 00000000022c9db0 R15: 0000000002488d78 Mar 2 10:50:08 gaz kernel: [40250.390630] FS: 00007f3b3c8b2700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000 Mar 2 10:50:08 gaz kernel: [40250.390641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 2 10:50:08 gaz kernel: [40250.390648] CR2: 00000000024b03f0 CR3: 000000021c5d1000 CR4: 00000000000006e0 Mar 2 10:50:08 gaz kernel: [40250.390656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 2 10:50:08 gaz kernel: [40250.390663] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Mar 2 10:50:08 gaz kernel: [40250.390671] Process munin-limits (pid: 11482,threadinfo ffff88021cc8c000,task ffff88021bf59530) Mar 2 10:50:08 gaz kernel: [40250.390681] Stack: Mar 2 10:50:08 gaz kernel: [40250.390687] ffffffff810f1d4a ffff880208c63228 0000000000000000 00007fffc2dcecc0 Mar 2 10:50:08 gaz kernel: [40250.390697] <0> 00000000024ba2b0 0000000002295010 ffffffff810f1e3d 0000000000000004 Mar 2 10:50:08 gaz kernel: [40250.390712] <0> ffff88021bf59530 ffff88021c4edc00 ffffffff812fe0b6 ffff88021c4edc60 Mar 2 10:50:08 gaz kernel: [40250.390732] Call Trace: Mar 2 10:50:08 gaz kernel: [40250.390742] [<ffffffff810f1d4a>] ? vfs_fstatat+0x2c/0x57 Mar 2 10:50:08 gaz kernel: [40250.390750] [<ffffffff810f1e3d>] ? sys_newstat+0x11/0x30 Mar 2 10:50:08 gaz kernel: [40250.390760] [<ffffffff812fe0b6>] ? do_page_fault+0x2e0/0x2fc Mar 2 10:50:08 gaz kernel: [40250.390768] [<ffffffff812fbf55>] ? page_fault+0x25/0x30 Mar 2 10:50:08 gaz kernel: [40250.390777] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b Mar 2 10:50:08 gaz kernel: [40250.390783] Code: Bad RIP value. Mar 2 10:50:08 gaz kernel: [40250.390791] RIP [<00000000024b03f0>] 0x24b03f0 Mar 2 10:50:08 gaz kernel: [40250.390799] RSP <ffff88021cc8dec0> Mar 2 10:50:08 gaz kernel: [40250.390805] CR2: 00000000024b03f0 Mar 2 10:50:08 gaz kernel: [40250.391051] ---[ end trace 1cc1473b539c7f6e ]--- Mar 2 11:42:20 gaz kernel: [43382.242301] PHP5-cgi[10963]: segfault at d81160 ip 0000000000d81160 sp 00007fff3adcb058 error 15 Mar 2 21:51:14 gaz kernel: [79916.418302] PHP5-cgi[20089]: segfault at 1c59dc8 ip 0000000001c59dc8 sp 00007fff9b877fb8 error 15 Mar 3 03:45:01 gaz kernel: [101143.334305] munin-update[22519] general protection ip:7f516dce204c sp:7fff6049a978 error:0 in libperl.so.5.10.1[7f516dc7d000+164000] Mar 3 11:22:37 gaz kernel: [128599.570307] PHP5-cgi[22888]: segfault at 36485a8 ip 00000000036485a8 sp 00007fff2d56e1c8 error 15 Mar 4 08:32:17 gaz kernel: [204779.842304] PHP5-cgi[22090]: segfault at 18 ip 0000000000689e5e sp 00007fff677a6a48 error 6 in PHP5-cgi[400000+6f9000] Mar 4 10:01:02 gaz kernel: [210104.434706] rsync[22236] general protection ip:7f14a07137f9 sp:7fff88f940b8 error:0 in libc-2.11.2.so[7f14a069d000+158000] Mar 4 11:32:22 gaz kernel: [215584.262316] BUG: unable to handle kernel paging request at 00000000ffffff9c Mar 4 11:32:22 gaz kernel: [215584.262331] IP: [<00000000ffffff9c>] 0xffffff9c Mar 4 11:32:22 gaz kernel: [215584.262343] PGD 0 Mar 4 11:32:22 gaz kernel: [215584.262350] Oops: 0010 [#2] SMP Mar 4 11:32:22 gaz kernel: [215584.262359] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat Mar 4 11:32:22 gaz kernel: [215584.262371] cpu 1 Mar 4 11:32:22 gaz kernel: [215584.262378] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_mce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan] Mar 4 11:32:22 gaz kernel: [215584.262552] Pid: 1960,comm: proxymap Tainted: G D 2.6.32-5-amd64 #1 MS-7368 Mar 4 11:32:22 gaz kernel: [215584.262563] RIP: 0010:[<00000000ffffff9c>] [<00000000ffffff9c>] 0xffffff9c Mar 4 11:32:22 gaz kernel: [215584.262573] RSP: 0018:ffff880209257e00 EFLAGS: 00010212 Mar 4 11:32:22 gaz kernel: [215584.262580] RAX: ffff8801514eb780 RBX: ffffffff810efb2d RCX: 0000000000000000 Mar 4 11:32:22 gaz kernel: [215584.262590] RDX: 0000000000000020 RSI: 0000000000000001 RDI: ffff8801514eb780 Mar 4 11:32:22 gaz kernel: [215584.262600] RBP: 00000000ffffffe9 R08: 0000000000000000 R09: 0000000000000000 Mar 4 11:32:22 gaz kernel: [215584.262611] R10: ffff880209257e78 R11: ffffffff81152c7c R12: 0000000000000001 Mar 4 11:32:22 gaz kernel: [215584.262622] R13: 0000000000008001 R14: 0000000000000024 R15: 00000000ffffff9c Mar 4 11:32:22 gaz kernel: [215584.262633] FS: 00007fca4de35700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000 Mar 4 11:32:22 gaz kernel: [215584.262644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 4 11:32:22 gaz kernel: [215584.262650] CR2: 00000000ffffff9c CR3: 00000001c9cbb000 CR4: 00000000000006e0 Mar 4 11:32:22 gaz kernel: [215584.262661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 4 11:32:22 gaz kernel: [215584.262671] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Mar 4 11:32:22 gaz kernel: [215584.262682] Process proxymap (pid: 1960,threadinfo ffff880209256000,task ffff88021c4b1c40) Mar 4 11:32:22 gaz kernel: [215584.262693] Stack: Mar 4 11:32:22 gaz kernel: [215584.262698] ffffffff810f8566 ffff880209257e78 ffff88021c7bf000 ffff88021c7bf0c8 Mar 4 11:32:22 gaz kernel: [215584.262709] <0> 0000800000000000 ffff88021fc0f000 ffff880209257e78 00000000fffffffe Mar 4 11:32:22 gaz kernel: [215584.262724] <0> ffffffff810e5881 ffff880209257f48 0000000000000286 ffff88021fc0f000 Mar 4 11:32:22 gaz kernel: [215584.262743] Call Trace: Mar 4 11:32:22 gaz kernel: [215584.262753] [<ffffffff810f8566>] ? do_filp_open+0xa7/0x94b Mar 4 11:32:22 gaz kernel: [215584.262763] [<ffffffff810e5881>] ? virt_to_head_page+0x9/0x2a Mar 4 11:32:22 gaz kernel: [215584.262771] [<ffffffff810f9904>] ? user_path_at+0x52/0x79 Mar 4 11:32:22 gaz kernel: [215584.262779] [<ffffffff810cfec1>] ? get_unmapped_area+0xd7/0x139 Mar 4 11:32:22 gaz kernel: [215584.262787] [<ffffffff811019d5>] ? alloc_fd+0x67/0x10c Mar 4 11:32:22 gaz kernel: [215584.262795] [<ffffffff810eceaf>] ? do_sys_open+0x55/0xfc Mar 4 11:32:22 gaz kernel: [215584.262804] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b Mar 4 11:32:22 gaz kernel: [215584.262811] Code: Bad RIP value. Mar 4 11:32:22 gaz kernel: [215584.262819] RIP [<00000000ffffff9c>] 0xffffff9c Mar 4 11:32:22 gaz kernel: [215584.262828] RSP <ffff880209257e00> Mar 4 11:32:22 gaz kernel: [215584.262833] CR2: 00000000ffffff9c Mar 4 11:32:22 gaz kernel: [215584.263077] ---[ end trace 1cc1473b539c7f6f ]---
如您所见,存在段错误,一般保护错误和内核错误.我的第一个猜测是存在某种类型的硬件问题,我问我的Hoster(这是一个租用的根服务器)进行完整的硬件检查 – 他们做了,但找不到任何问题.
我不知道他们检查了什么以及如何检查,但他们的支持团队通常都非常好.我自己运行了memtester和cpuburn,但也找不到任何错误.
不幸的是,我没有可靠的方法来重现这些段错误,它们似乎或多或少是随机的.在预感中,我禁用了系统的防火墙并运行了一个定期分段的程序(imapsync),并且它似乎需要比以前更长的段错误,因此问题可能与网络堆栈有关.或者只是随机的事情.
以下是内核规范:
# uname -a Linux gaz 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux # cat /etc/debian_version 6.0 # lsmod Module Size Used by cpufreq_userspace 1992 0 cpufreq_stats 2659 0 cpufreq_powersave 902 0 cpufreq_conservative 5162 0 xt_recent 5977 0 xt_tcpudp 2319 0 iptable_nat 4299 0 nf_nat 13388 1 iptable_nat nf_conntrack_ipv4 9833 3 iptable_nat,nf_nat nf_defrag_ipv4 1139 1 nf_conntrack_ipv4 ip6table_filter 2384 0 ip6_tables 15075 1 ip6table_filter xt_DSCP 1995 0 xt_TCPMSS 2919 0 ipt_LOG 4518 0 ipt_REJECT 1953 0 iptable_mangle 2817 0 iptable_filter 2258 0 xt_multiport 2267 0 xt_state 1303 0 xt_limit 1782 0 xt_conntrack 2407 0 nf_conntrack_ftp 5537 0 nf_conntrack 46535 6 iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state,xt_conntrack,nf_conntrack_ftp ip_tables 13899 3 iptable_nat,iptable_mangle,iptable_filter x_tables 12845 13 xt_recent,xt_tcpudp,iptable_nat,ip6_tables,xt_DSCP,xt_TCPMSS,ipt_LOG,ipt_REJECT,xt_multiport,xt_limit,ip_tables loop 11799 0 radeon 573996 0 ttm 39986 1 radeon drm_kms_helper 20065 1 radeon snd_hda_codec_atihdmi 2251 1 drm 142359 3 radeon,ttm,drm_kms_helper snd_hda_intel 20019 0 i2c_algo_bit 4225 1 radeon pcspkr 1699 0 i2c_piix4 8328 0 snd_hda_codec 54244 2 snd_hda_codec_atihdmi,snd_hda_intel i2c_core 15712 5 radeon,drm_kms_helper,drm,i2c_algo_bit,i2c_piix4 snd_hwdep 5380 1 snd_hda_codec snd_pcm 60503 2 snd_hda_intel,snd_hda_codec snd_timer 15582 1 snd_pcm snd 46446 5 snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer soundcore 4598 1 snd evdev 7352 3 snd_page_alloc 6249 2 snd_hda_intel,snd_pcm k8temp 3283 0 edac_core 29261 0 edac_mce_amd 6433 0 shpchp 26264 0 pci_hotplug 21203 1 shpchp button 4650 0 ext3 106518 2 jbd 37085 1 ext3 mbcache 5050 1 ext3 dm_mod 53754 0 powernow_k8 10978 1 aacraid 59779 0 3w_9xxx 28684 0 3w_xxxx 20569 0 raid10 17809 0 raid456 44500 0 async_raid6_recov 5170 1 raid456 async_pq 3479 2 raid456,async_raid6_recov raid6_pq 77179 2 async_raid6_recov,async_pq async_xor 2478 3 raid456,async_raid6_recov,async_pq xor 4380 1 async_xor async_memcpy 1198 2 raid456,async_raid6_recov async_tx 1734 5 raid456,async_pq,async_xor,async_memcpy raid1 18431 3 raid0 5517 0 md_mod 73824 7 raid10,raid456,raid1,raid0 sata_nv 19166 0 sata_sil 7412 0 sata_via 7928 0 sd_mod 29889 8 crc_t10dif 1276 1 sd_mod ata_generic 3047 0 ahci 32374 6 r8169 29229 0 mii 3210 1 r8169 thermal 11674 0 pata_atiixp 3489 0 libata 133632 6 sata_nv,sata_sil,sata_via,ata_generic,ahci,pata_atiixp ohci_hcd 19212 0 ehci_hcd 31151 0 processor 29935 1 powernow_k8 thermal_sys 11942 2 thermal,processor scsi_mod 122149 5 aacraid,3w_9xxx,3w_xxxx,sd_mod,libata usbcore 122034 3 ohci_hcd,ehci_hcd nls_base 6377 1 usbcore # free total used free shared buffers cached Mem: 8166128 1228036 6938092 0 140412 782060 -/+ buffers/cache: 305564 7860564 Swap: 2102456 0 2102456
所以,基本上我的问题是:
>我如何进一步诊断?
>上面的日志中是否有任何数据可以帮助我隔离麻烦制造者?
>在Google上搜索时,我忽略了上述硬件/软件是否存在任何已知问题?
>有没有办法阻止内核自动加载模块(我可能不需要所有这些模块,其中一个可能是罪魁祸首)
解决方法
检查你的记忆!
像这样随机段错误的最常见原因是内存不好.抓住内存检查器(例如memtest86+)并进行测试.