confident about what is wrong. These errors are coming from a production machine (not life threatening mission critical, but i do
need to minimize downtime), and I need to schedule down time and have a limited window of time for what i can do during that time. So, I'm
trying to figure out exactly what is wrong so I can be prepared to operate on this machine.
For starters, here's information about the machine:
HW: Supermicro X9DRD-7NL4F motherboard, w/ 2x E5-2680v2 (Ivy Bridge),
16x16GB DIMMs.
- there are 8x DIMM slots per CPU socket, labeled P1 and P2 as:
P1-DIMMA1, P1-DIMMB1, P1-DIMMC1, P1-DIMMD1, P1-DIMMA2, P1-DIMMB2, P1-DIMMC2, P1-DIMMD2
P2-DIMME1, P2-DIMMF1, P2-DIMMG1, P2-DIMMH1, P2-DIMME2, P2-DIMMF2, P2-DIMMG2, P2-DIMMH2
OS: CentOS 7.5 with all current updates as of 10/08/2018.
Background: This server was deployed around February of this year. It has been running stable for many months. On 10/8/2018, it was rebooted
for a kernel update. Since then, it has been showing the errors I'm about to share below.
Before I go into the details below, here's a high level summary of my problem: I'm getting memory error messages, but I can't figure out
which module is bad because i get 3 different answers depending on how I analyze or source the data. At a high level (with details below):
1) if I follow the memory address reported, combined with information from dmidecode, my conclusion is that the problem is located at
P2-DIMMF2
2) if I look at the output of edac-util, my conclusion is that the problem is located at P2-DIMME1 and P2-DIMMG1
3) if I look at the BMC event log from the Supermicro IPMI interface, my conclusion is problem is located at P2-DIMMF1
As you can see, the 3 different results above do not bring me confidence.
DETAILS:
Here's modinfo on sb_edac:
Code: Select all
# modinfo sb_edac
filename:
/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description: MC Driver for Intel Sandy Bridge and Ivy Bridge memory
controllers - Ver: 1.1.1
author: Red Hat Inc. (http://www.redhat.com)
author: Mauro Carvalho Chehab <mchehab@redhat.com>
license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 873094F4D6922741B0DE7CA
alias: x86cpu:vendor:0000:family:0006:model:0085:feature:*
alias: x86cpu:vendor:0000:family:0006:model:0057:feature:*
alias: x86cpu:vendor:0000:family:0006:model:0056:feature:*
alias: x86cpu:vendor:0000:family:0006:model:004F:feature:*
alias: x86cpu:vendor:0000:family:0006:model:003F:feature:*
alias: x86cpu:vendor:0000:family:0006:model:003E:feature:*
alias: x86cpu:vendor:0000:family:0006:model:002D:feature:*
depends:
intree: Y
vermagic: 3.10.0-862.14.4.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: E4:A1:B6:8F:46:8A:CA:5C:22:84:50:53:18:FD:9D:AD:72:4B:13:03
sig_hashalgo: sha256
parm: edac_op_state:EDAC Error Reporting state: 0=Poll,1=NMI (int)
Code: Select all
[124519.723865] mce: [Hardware Error]: Machine check events logged
[124519.723881] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[124519.723883] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010091
[124519.723885] EDAC sbridge MC1: TSC 0
[124519.723886] EDAC sbridge MC1: ADDR 2fa25b5880
[124519.723887] EDAC sbridge MC1: MISC 140724686
[124519.723889] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539121624 SOCKET 1 APIC 20
[124520.445994] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2fa25b5
offset:0x880 grain:32 syndrome:0x0 - areaRAM err_code:0001:0091 socket:1 ha:0 channel_mask:4 rank:1)
[125043.238458] mce: [Hardware Error]: Machine check events logged
[125043.238479] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[125043.238482] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[125043.238483] EDAC sbridge MC1: TSC 0
[125043.238485] EDAC sbridge MC1: ADDR 2fa25b5000
[125043.238486] EDAC sbridge MC1: MISC 90840010001108c
[125043.238488] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539122148 SOCKET 1 APIC 20
[125043.470434] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5
offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
ADDR 2fa25b5000
ADDR 2fa25b5880
If I look through the output of dmidecode -t 20, I find this entry that fits that address range:
Code: Select all
Handle 0x0048, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x02C00000000
Ending Address: 0x02FFFFFFFFF
Range Size: 16 GB
Physical Device Handle: 0x0047
Memory Array Mapped Address Handle: 0x0040
Partition Row Position: 1
So, if that is correct, it is saying "Physical Device Handle: 0x0047". So, if I look at dmidecode -t 17, i find:
Code: Select all
Handle 0x0047, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x003F
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMMF2
Bank Locator: P1_Node1_Channel1_Dimm1
Type: DDR3
Type Detail: Registered (Buffered)
Speed: 1600 MHz
Manufacturer: Hynix Semiconductor
Serial Number: 4F920CCC
Asset Tag: Dimm3_AssetTag
Part Number: HMT42GR7MFR4C-PB
Rank: 2
Configured Clock Speed: 1600 MHz
Next, if I look at edac-util -v output:
Code: Select all
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 7 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 92 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
correct, then the CEs are happening at P2-DIMME1 and P2-DIMMG1.
At this point, I'm already confused, because the interpretation of the output of edac-util is not consistent with following the address of
the error + dmidecode! How can the error report an address that is not where the CEs are being counted?
Finally, to add more confusion to this, I went to look at the BMC event log, and found these messages:
Code: Select all
19 2018/10/12 16:44:20 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
20 2018/10/12 17:01:23 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
21 2018/10/12 17:31:56 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
22 2018/10/12 17:53:25 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
23 2018/10/12 18:16:43 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
24 2018/10/12 18:32:25 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
25 2018/10/12 18:53:08 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
26 2018/10/12 19:19:09 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
27 2018/10/12 19:33:15 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
28 2018/10/13 01:27:19 OEM Memory Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
Where am I misinterpreting this information?
Thanks in advance for any assistance...