我试图了解MCE消息,以找出服务器上哪个内存模块坏.此消息出现在/var/log/kern.log中的一台服务器中,今天冻结了两次.
Apr 13 22:39:22 mBox kernel: [36247975.116860] sbridge: HANDLING MCE MEMORY ERROR Apr 13 22:39:22 mBox kernel: [36247975.116867] cpu 0: Machine Check Exception: 0 Bank 5: 8c00004000010090 Apr 13 22:39:22 mBox kernel: [36247975.116869] TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0 Apr 13 22:39:22 mBox kernel: [36247975.951013] EDAC MC0: 1 CE memory read error
我怀疑一个坏的内存模块.服务器是2x Xeon E5-2650,带有8x8Go内存模块(每个cpu有8个内存插槽)
这是lshw的内存模块数量:
*-memory:0 description: System Memory physical id: 2d slot: System board or motherboard *-bank:0 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-197.A vendor: Kingston physical id: 0 serial: B83AE5C2 slot: P1_DIMMA1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: P1_DIMMA2 width: 64 bits *-bank:2 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 2 serial: EC309238 slot: P1_DIMMB1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:3 description: DIMM Synchronous [empty] product: Dimm4_PartNum vendor: Dimm4_Manufacturer physical id: 3 serial: Dimm4_SerNum slot: P1_DIMMB2 width: 64 bits *-bank:4 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 4 serial: E9305438 slot: P1_DIMMC1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:5 description: DIMM Synchronous [empty] product: Dimm7_PartNum vendor: Dimm7_Manufacturer physical id: 5 serial: Dimm7_SerNum slot: P1_DIMMC2 width: 64 bits *-bank:6 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 6 serial: E7305738 slot: P1_DIMMD1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:7 description: DIMM Synchronous [empty] product: Dimm10_PartNum vendor: Dimm10_Manufacturer physical id: 7 serial: Dimm10_SerNum slot: P1_DIMMD2 width: 64 bits *-memory:1 description: System Memory physical id: 3f slot: System board or motherboard *-bank:0 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-197.A vendor: Kingston physical id: 0 serial: B63A08C3 slot: P2_DIMME1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: P2_DIMME2 width: 64 bits *-bank:2 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 2 serial: EA309638 slot: P2_DIMMF1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:3 description: DIMM Synchronous [empty] product: Dimm4_PartNum vendor: Dimm4_Manufacturer physical id: 3 serial: Dimm4_SerNum slot: P2_DIMMF2 width: 64 bits *-bank:4 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 4 serial: E7305938 slot: P2_DIMMG1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:5 description: DIMM Synchronous [empty] product: Dimm7_PartNum vendor: Dimm7_Manufacturer physical id: 5 serial: Dimm7_SerNum slot: P2_DIMMG2 width: 64 bits *-bank:6 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 6 serial: E7305B38 slot: P2_DIMMH1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:7 description: DIMM Synchronous [empty] product: Dimm10_PartNum vendor: Dimm10_Manufacturer physical id: 7 serial: Dimm10_SerNum slot: P2_DIMMH2 width: 64 bits *-memory:2 UNCLAIMED physical id: 7 *-memory:3 UNCLAIMED physical id: 9
您可以注意到,#5银行没有内存模块.所以我的问题是:你是否同意这条消息是关于内存故障的?如果是这样,我怎样才能找到要替换的模块?
解决方法
这些错误来自EDAC – 错误检测和纠正
edac_mc设备类.
edac_mc设备类.
您收到的事件是CE事件(可识别的错误).这些都表明DIMM开始出现故障.
EDAC没有报告任何关于它所引用的内存行或通道的具体信息,因此很难确定哪一个要替换,直到那个失败.
但是看看:/ sys / devices / system / edac / mc / mc *这可能会告诉你更多关于哪个行/ dimm可能是错误的行/ dimm.
例如
ls -s / sys / devices / system / edac / mc / mc0
总共0
0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb
0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count
0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
看一下ce_count字段.
在旁注:
系统仍然可以继续运行,但安全性较低.展示CE的内存DIMM的预防性维护和主动部件更换可以降低可怕的UE(不可纠正的错误)事件和系统“恐慌”的可能性.
有关edac的更多信息: