Mat*_*atg 12 linux ubuntu memory
我试图了解 MCE 消息以查找服务器上的哪个内存模块坏了。此消息出现在/var/log/kern.log今天冻结两次的一台服务器中。
Apr 13 22:39:22 mbox kernel: [36247975.116860] sbridge: HANDLING MCE MEMORY ERROR
Apr 13 22:39:22 mbox kernel: [36247975.116867] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
Apr 13 22:39:22 mbox kernel: [36247975.116869] TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0
Apr 13 22:39:22 mbox kernel: [36247975.951013] EDAC MC0: 1 CE memory read error
Run Code Online (Sandbox Code Playgroud)
我怀疑内存模块坏了。服务器是 2x Xeon E5-2650,带有 8x8Go 内存模块(每个 CPU 8 个内存插槽)
这是来自lshw:的内存模块数量:
*-memory:0
description: System Memory
physical id: 2d
slot: System board or motherboard
*-bank:0
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-197.A
vendor: Kingston
physical id: 0
serial: B83AE5C2
slot: P1_DIMMA1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM Synchronous [empty]
product: Dimm1_PartNum
vendor: Dimm1_Manufacturer
physical id: 1
serial: Dimm1_SerNum
slot: P1_DIMMA2
width: 64 bits
*-bank:2
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 2
serial: EC309238
slot: P1_DIMMB1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:3
description: DIMM Synchronous [empty]
product: Dimm4_PartNum
vendor: Dimm4_Manufacturer
physical id: 3
serial: Dimm4_SerNum
slot: P1_DIMMB2
width: 64 bits
*-bank:4
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 4
serial: E9305438
slot: P1_DIMMC1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:5
description: DIMM Synchronous [empty]
product: Dimm7_PartNum
vendor: Dimm7_Manufacturer
physical id: 5
serial: Dimm7_SerNum
slot: P1_DIMMC2
width: 64 bits
*-bank:6
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 6
serial: E7305738
slot: P1_DIMMD1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:7
description: DIMM Synchronous [empty]
product: Dimm10_PartNum
vendor: Dimm10_Manufacturer
physical id: 7
serial: Dimm10_SerNum
slot: P1_DIMMD2
width: 64 bits
*-memory:1
description: System Memory
physical id: 3f
slot: System board or motherboard
*-bank:0
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-197.A
vendor: Kingston
physical id: 0
serial: B63A08C3
slot: P2_DIMME1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM Synchronous [empty]
product: Dimm1_PartNum
vendor: Dimm1_Manufacturer
physical id: 1
serial: Dimm1_SerNum
slot: P2_DIMME2
width: 64 bits
*-bank:2
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 2
serial: EA309638
slot: P2_DIMMF1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:3
description: DIMM Synchronous [empty]
product: Dimm4_PartNum
vendor: Dimm4_Manufacturer
physical id: 3
serial: Dimm4_SerNum
slot: P2_DIMMF2
width: 64 bits
*-bank:4
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 4
serial: E7305938
slot: P2_DIMMG1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:5
description: DIMM Synchronous [empty]
product: Dimm7_PartNum
vendor: Dimm7_Manufacturer
physical id: 5
serial: Dimm7_SerNum
slot: P2_DIMMG2
width: 64 bits
*-bank:6
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 6
serial: E7305B38
slot: P2_DIMMH1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:7
description: DIMM Synchronous [empty]
product: Dimm10_PartNum
vendor: Dimm10_Manufacturer
physical id: 7
serial: Dimm10_SerNum
slot: P2_DIMMH2
width: 64 bits
*-memory:2 UNCLAIMED
physical id: 7
*-memory:3 UNCLAIMED
physical id: 9
Run Code Online (Sandbox Code Playgroud)
如您所见,#5 组上没有内存模块。所以我的问题是:你同意这条消息是关于内存故障的吗?如果是这样,我怎样才能找到要更换的模块?
小智 10
这些错误来自 EDAC - 错误检测和纠正。
edac_mc 设备的类别。
您收到的事件是 CE 事件(可纠正的错误)。这些迹象表明 DIMM 开始出现故障。
EDAC 没有报告关于它所指的内存行或通道的任何具体信息,因此很难判断要替换哪一个,直到那个失败为止。
但是看看/sys/devices/system/edac/mc/mc*这可能会告诉你更多关于哪一行/DIMM可能是有问题的。
例如
ls -s /sys/devices/system/edac/mc/mc0
total 0
0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb
0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count
0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
Run Code Online (Sandbox Code Playgroud)
看看ce_count球场。
旁注:
该系统仍可继续运行,但安全性较低。具有 CE 的内存 DIMM 的预防性维护和主动部件更换可以降低可怕的 UE(不可纠正错误)事件和系统“恐慌”的可能性。
有关 EDAC 的更多信息,请访问:
https://www.kernel.org/doc/Documentation/edac.txt
| 归档时间: |
|
| 查看次数: |
43159 次 |
| 最近记录: |