need help decoding EDAC memory error messages
Posted: 2018/10/13 20:03:19
I've spent this past week researching my problem online from various online sources, but I'm not getting consistent information and I'm not
confident about what is wrong. These errors are coming from a production machine (not life threatening mission critical, but i do
need to minimize downtime), and I need to schedule down time and have a limited window of time for what i can do during that time. So, I'm
trying to figure out exactly what is wrong so I can be prepared to operate on this machine.
For starters, here's information about the machine:
HW: Supermicro X9DRD-7NL4F motherboard, w/ 2x E5-2680v2 (Ivy Bridge),
16x16GB DIMMs.
- there are 8x DIMM slots per CPU socket, labeled P1 and P2 as:
P1-DIMMA1, P1-DIMMB1, P1-DIMMC1, P1-DIMMD1, P1-DIMMA2, P1-DIMMB2, P1-DIMMC2, P1-DIMMD2
P2-DIMME1, P2-DIMMF1, P2-DIMMG1, P2-DIMMH1, P2-DIMME2, P2-DIMMF2, P2-DIMMG2, P2-DIMMH2
OS: CentOS 7.5 with all current updates as of 10/08/2018.
Background: This server was deployed around February of this year. It has been running stable for many months. On 10/8/2018, it was rebooted
for a kernel update. Since then, it has been showing the errors I'm about to share below.
Before I go into the details below, here's a high level summary of my problem: I'm getting memory error messages, but I can't figure out
which module is bad because i get 3 different answers depending on how I analyze or source the data. At a high level (with details below):
1) if I follow the memory address reported, combined with information from dmidecode, my conclusion is that the problem is located at
P2-DIMMF2
2) if I look at the output of edac-util, my conclusion is that the problem is located at P2-DIMME1 and P2-DIMMG1
3) if I look at the BMC event log from the Supermicro IPMI interface, my conclusion is problem is located at P2-DIMMF1
As you can see, the 3 different results above do not bring me confidence.
DETAILS:
Here's modinfo on sb_edac:
Here are sample EDAC messages I'm seeing:
From the above messages, I see that the problem addresses are:
ADDR 2fa25b5000
ADDR 2fa25b5880
If I look through the output of dmidecode -t 20, I find this entry that fits that address range:
2,C00,000,000 < 2,FA2,5B5,000 < 2,FA2,5B5,880 < 2,FFF,FFF,FFF, right?
So, if that is correct, it is saying "Physical Device Handle: 0x0047". So, if I look at dmidecode -t 17, i find:
So, that concludes that problem is located at P2-DIMMF2.
Next, if I look at edac-util -v output:
mc1 points to CPU2, Ch#0-3_DIMM#0, I interpret as P2-DIMME1/F1/G1/H1, where as Ch#0-3_DIMM#1 I interpret as P2-DIMME2/F2/G2/H2. If that is
correct, then the CEs are happening at P2-DIMME1 and P2-DIMMG1.
At this point, I'm already confused, because the interpretation of the output of edac-util is not consistent with following the address of
the error + dmidecode! How can the error report an address that is not where the CEs are being counted?
Finally, to add more confusion to this, I went to look at the BMC event log, and found these messages:
So, according to BMC event log, the problem is located at CPU2 DIMMF1, or P2-DIMMF1.
Where am I misinterpreting this information?
Thanks in advance for any assistance...
confident about what is wrong. These errors are coming from a production machine (not life threatening mission critical, but i do
need to minimize downtime), and I need to schedule down time and have a limited window of time for what i can do during that time. So, I'm
trying to figure out exactly what is wrong so I can be prepared to operate on this machine.
For starters, here's information about the machine:
HW: Supermicro X9DRD-7NL4F motherboard, w/ 2x E5-2680v2 (Ivy Bridge),
16x16GB DIMMs.
- there are 8x DIMM slots per CPU socket, labeled P1 and P2 as:
P1-DIMMA1, P1-DIMMB1, P1-DIMMC1, P1-DIMMD1, P1-DIMMA2, P1-DIMMB2, P1-DIMMC2, P1-DIMMD2
P2-DIMME1, P2-DIMMF1, P2-DIMMG1, P2-DIMMH1, P2-DIMME2, P2-DIMMF2, P2-DIMMG2, P2-DIMMH2
OS: CentOS 7.5 with all current updates as of 10/08/2018.
Background: This server was deployed around February of this year. It has been running stable for many months. On 10/8/2018, it was rebooted
for a kernel update. Since then, it has been showing the errors I'm about to share below.
Before I go into the details below, here's a high level summary of my problem: I'm getting memory error messages, but I can't figure out
which module is bad because i get 3 different answers depending on how I analyze or source the data. At a high level (with details below):
1) if I follow the memory address reported, combined with information from dmidecode, my conclusion is that the problem is located at
P2-DIMMF2
2) if I look at the output of edac-util, my conclusion is that the problem is located at P2-DIMME1 and P2-DIMMG1
3) if I look at the BMC event log from the Supermicro IPMI interface, my conclusion is problem is located at P2-DIMMF1
As you can see, the 3 different results above do not bring me confidence.
DETAILS:
Here's modinfo on sb_edac:
Code: Select all
# modinfo sb_edac
filename:
/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description: MC Driver for Intel Sandy Bridge and Ivy Bridge memory
controllers - Ver: 1.1.1
author: Red Hat Inc. (http://www.redhat.com)
author: Mauro Carvalho Chehab <mchehab@redhat.com>
license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 873094F4D6922741B0DE7CA
alias: x86cpu:vendor:0000:family:0006:model:0085:feature:*
alias: x86cpu:vendor:0000:family:0006:model:0057:feature:*
alias: x86cpu:vendor:0000:family:0006:model:0056:feature:*
alias: x86cpu:vendor:0000:family:0006:model:004F:feature:*
alias: x86cpu:vendor:0000:family:0006:model:003F:feature:*
alias: x86cpu:vendor:0000:family:0006:model:003E:feature:*
alias: x86cpu:vendor:0000:family:0006:model:002D:feature:*
depends:
intree: Y
vermagic: 3.10.0-862.14.4.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: E4:A1:B6:8F:46:8A:CA:5C:22:84:50:53:18:FD:9D:AD:72:4B:13:03
sig_hashalgo: sha256
parm: edac_op_state:EDAC Error Reporting state: 0=Poll,1=NMI (int)
Code: Select all
[124519.723865] mce: [Hardware Error]: Machine check events logged
[124519.723881] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[124519.723883] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010091
[124519.723885] EDAC sbridge MC1: TSC 0
[124519.723886] EDAC sbridge MC1: ADDR 2fa25b5880
[124519.723887] EDAC sbridge MC1: MISC 140724686
[124519.723889] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539121624 SOCKET 1 APIC 20
[124520.445994] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2fa25b5
offset:0x880 grain:32 syndrome:0x0 - areaRAM err_code:0001:0091 socket:1 ha:0 channel_mask:4 rank:1)
[125043.238458] mce: [Hardware Error]: Machine check events logged
[125043.238479] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[125043.238482] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[125043.238483] EDAC sbridge MC1: TSC 0
[125043.238485] EDAC sbridge MC1: ADDR 2fa25b5000
[125043.238486] EDAC sbridge MC1: MISC 90840010001108c
[125043.238488] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539122148 SOCKET 1 APIC 20
[125043.470434] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5
offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
ADDR 2fa25b5000
ADDR 2fa25b5880
If I look through the output of dmidecode -t 20, I find this entry that fits that address range:
Code: Select all
Handle 0x0048, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x02C00000000
Ending Address: 0x02FFFFFFFFF
Range Size: 16 GB
Physical Device Handle: 0x0047
Memory Array Mapped Address Handle: 0x0040
Partition Row Position: 1
So, if that is correct, it is saying "Physical Device Handle: 0x0047". So, if I look at dmidecode -t 17, i find:
Code: Select all
Handle 0x0047, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x003F
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMMF2
Bank Locator: P1_Node1_Channel1_Dimm1
Type: DDR3
Type Detail: Registered (Buffered)
Speed: 1600 MHz
Manufacturer: Hynix Semiconductor
Serial Number: 4F920CCC
Asset Tag: Dimm3_AssetTag
Part Number: HMT42GR7MFR4C-PB
Rank: 2
Configured Clock Speed: 1600 MHz
Next, if I look at edac-util -v output:
Code: Select all
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 7 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 92 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
correct, then the CEs are happening at P2-DIMME1 and P2-DIMMG1.
At this point, I'm already confused, because the interpretation of the output of edac-util is not consistent with following the address of
the error + dmidecode! How can the error report an address that is not where the CEs are being counted?
Finally, to add more confusion to this, I went to look at the BMC event log, and found these messages:
Code: Select all
19 2018/10/12 16:44:20 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
20 2018/10/12 17:01:23 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
21 2018/10/12 17:31:56 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
22 2018/10/12 17:53:25 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
23 2018/10/12 18:16:43 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
24 2018/10/12 18:32:25 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
25 2018/10/12 18:53:08 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
26 2018/10/12 19:19:09 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
27 2018/10/12 19:33:15 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
28 2018/10/13 01:27:19 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
Where am I misinterpreting this information?
Thanks in advance for any assistance...