need help decoding EDAC memory error messages

Issues related to hardware problems
Post Reply
theoriginalguru
Posts: 10
Joined: 2009/04/10 08:08:31

need help decoding EDAC memory error messages

Post by theoriginalguru » 2018/10/13 20:03:19

I've spent this past week researching my problem online from various online sources, but I'm not getting consistent information and I'm not
confident about what is wrong. These errors are coming from a production machine (not life threatening mission critical, but i do
need to minimize downtime), and I need to schedule down time and have a limited window of time for what i can do during that time. So, I'm
trying to figure out exactly what is wrong so I can be prepared to operate on this machine.

For starters, here's information about the machine:

HW: Supermicro X9DRD-7NL4F motherboard, w/ 2x E5-2680v2 (Ivy Bridge),
16x16GB DIMMs.
- there are 8x DIMM slots per CPU socket, labeled P1 and P2 as:
P1-DIMMA1, P1-DIMMB1, P1-DIMMC1, P1-DIMMD1, P1-DIMMA2, P1-DIMMB2, P1-DIMMC2, P1-DIMMD2
P2-DIMME1, P2-DIMMF1, P2-DIMMG1, P2-DIMMH1, P2-DIMME2, P2-DIMMF2, P2-DIMMG2, P2-DIMMH2
OS: CentOS 7.5 with all current updates as of 10/08/2018.
Background: This server was deployed around February of this year. It has been running stable for many months. On 10/8/2018, it was rebooted
for a kernel update. Since then, it has been showing the errors I'm about to share below.

Before I go into the details below, here's a high level summary of my problem: I'm getting memory error messages, but I can't figure out
which module is bad because i get 3 different answers depending on how I analyze or source the data. At a high level (with details below):

1) if I follow the memory address reported, combined with information from dmidecode, my conclusion is that the problem is located at
P2-DIMMF2
2) if I look at the output of edac-util, my conclusion is that the problem is located at P2-DIMME1 and P2-DIMMG1
3) if I look at the BMC event log from the Supermicro IPMI interface, my conclusion is problem is located at P2-DIMMF1

As you can see, the 3 different results above do not bring me confidence.

DETAILS:

Here's modinfo on sb_edac:

Code: Select all

# modinfo sb_edac
filename:
/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description:    MC Driver for Intel Sandy Bridge and Ivy Bridge memory
controllers -  Ver: 1.1.1
author:         Red Hat Inc. (http://www.redhat.com)
author:         Mauro Carvalho Chehab <mchehab@redhat.com>
license:        GPL
retpoline:      Y
rhelversion:    7.5
srcversion:     873094F4D6922741B0DE7CA
alias:          x86cpu:vendor:0000:family:0006:model:0085:feature:*
alias:          x86cpu:vendor:0000:family:0006:model:0057:feature:*
alias:          x86cpu:vendor:0000:family:0006:model:0056:feature:*
alias:          x86cpu:vendor:0000:family:0006:model:004F:feature:*
alias:          x86cpu:vendor:0000:family:0006:model:003F:feature:*
alias:          x86cpu:vendor:0000:family:0006:model:003E:feature:*
alias:          x86cpu:vendor:0000:family:0006:model:002D:feature:*
depends:
intree:         Y
vermagic:       3.10.0-862.14.4.el7.x86_64 SMP mod_unload modversions
signer:         CentOS Linux kernel signing key
sig_key:        E4:A1:B6:8F:46:8A:CA:5C:22:84:50:53:18:FD:9D:AD:72:4B:13:03
sig_hashalgo:   sha256
parm:           edac_op_state:EDAC Error Reporting state: 0=Poll,1=NMI (int)
Here are sample EDAC messages I'm seeing:

Code: Select all

[124519.723865] mce: [Hardware Error]: Machine check events logged
[124519.723881] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[124519.723883] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010091
[124519.723885] EDAC sbridge MC1: TSC 0
[124519.723886] EDAC sbridge MC1: ADDR 2fa25b5880
[124519.723887] EDAC sbridge MC1: MISC 140724686
[124519.723889] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539121624 SOCKET 1 APIC 20
[124520.445994] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2fa25b5
offset:0x880 grain:32 syndrome:0x0 - areaRAM err_code:0001:0091 socket:1 ha:0 channel_mask:4 rank:1)

[125043.238458] mce: [Hardware Error]: Machine check events logged
[125043.238479] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[125043.238482] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[125043.238483] EDAC sbridge MC1: TSC 0
[125043.238485] EDAC sbridge MC1: ADDR 2fa25b5000
[125043.238486] EDAC sbridge MC1: MISC 90840010001108c
[125043.238488] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539122148 SOCKET 1 APIC 20
[125043.470434] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5
offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
From the above messages, I see that the problem addresses are:
ADDR 2fa25b5000
ADDR 2fa25b5880

If I look through the output of dmidecode -t 20, I find this entry that fits that address range:

Code: Select all

Handle 0x0048, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x02C00000000
    Ending Address: 0x02FFFFFFFFF
    Range Size: 16 GB
    Physical Device Handle: 0x0047
    Memory Array Mapped Address Handle: 0x0040
    Partition Row Position: 1
2,C00,000,000 < 2,FA2,5B5,000 < 2,FA2,5B5,880 < 2,FFF,FFF,FFF, right?

So, if that is correct, it is saying "Physical Device Handle: 0x0047". So, if I look at dmidecode -t 17, i find:

Code: Select all

Handle 0x0047, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x003F
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P2-DIMMF2
        Bank Locator: P1_Node1_Channel1_Dimm1
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1600 MHz
        Manufacturer: Hynix Semiconductor
        Serial Number: 4F920CCC
        Asset Tag: Dimm3_AssetTag
        Part Number: HMT42GR7MFR4C-PB
        Rank: 2
        Configured Clock Speed: 1600 MHz
So, that concludes that problem is located at P2-DIMMF2.

Next, if I look at edac-util -v output:

Code: Select all

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 7 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 92 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1 points to CPU2, Ch#0-3_DIMM#0, I interpret as P2-DIMME1/F1/G1/H1, where as Ch#0-3_DIMM#1 I interpret as P2-DIMME2/F2/G2/H2. If that is
correct, then the CEs are happening at P2-DIMME1 and P2-DIMMG1.

At this point, I'm already confused, because the interpretation of the output of edac-util is not consistent with following the address of
the error + dmidecode! How can the error report an address that is not where the CEs are being counted?

Finally, to add more confusion to this, I went to look at the BMC event log, and found these messages:

Code: Select all

19    2018/10/12 16:44:20    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
20    2018/10/12 17:01:23    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
21    2018/10/12 17:31:56    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
22    2018/10/12 17:53:25    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
23    2018/10/12 18:16:43    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
24    2018/10/12 18:32:25    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
25    2018/10/12 18:53:08    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
26    2018/10/12 19:19:09    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
27    2018/10/12 19:33:15    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
28    2018/10/13 01:27:19    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
So, according to BMC event log, the problem is located at CPU2 DIMMF1, or P2-DIMMF1.

Where am I misinterpreting this information?

Thanks in advance for any assistance...

Post Reply