ECC error (mcelog)

Issues related to hardware problems
Post Reply
CANnix
Posts: 6
Joined: 2019/02/06 15:34:17

ECC error (mcelog)

Post by CANnix » 2019/02/06 15:48:02

Hi all,

I am running a CentOS 6 server on a Xeon X5650 machine (SMP). During high CPU load the server crashes regularly after some time. In the MCElog I can trace the error back to data transfer between RAM and MCU:
...
STATUS 88010282 MCGSTATUS 0
MCGCAP 1c09 APICID 35 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 20 BANK 6 TSC b7065eeaa1810
TIME 1545643603 Mon Dec 24 10:26:43 2018
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-2 Generic Error
STATUS b200000080000106 MCGSTATUS 4
MCGCAP 1c09 APICID 13 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 6 TSC b7065eeaa18b0
TIME 1545643603 Mon Dec 24 10:26:43 2018
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-2 Generic Error
STATUS b200000080000106 MCGSTATUS 4
MCGCAP 1c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8
MISC 5222508000086200
TIME 1547586533 Tue Jan 15 22:08:53 2019
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: MEMORY CONTROLLER MS_CHANNELunspecified_ERR
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 0
Memory DIMM ID of error: 0
Memory channel ID of error: 2
Memory ECC syndrome: 52225080
STATUS 88000040000200cf MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
...
Now the thing is that Memtest runs fine, except for the SMP version which crashes. AFAIK this is not too uncommon and does not necessarily indicate a hardware fault. To eliminate defect RAM as root cause the respective DIMM was replaced with a brand new one before the latest crash (the one documented in the log above).

Does anyone see a solution or a direction for further analysis?

Best regards

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: ECC error (mcelog)

Post by TrevorH » 2019/02/06 16:20:27

Your latest one is on a different cpu and memory bank than the first two. First two are cpu 20 (and 4), bank 6, latest one is cpu 1 bank 8. If your machines are anything like mine then even numbered cpus are all on one socket and odd numbered ones are on the other socket. I think it likely that bank 6 on cpu 4 and 20 are the same memory module while bank 8 on cpu 1 is likely to be a different one.

If you have libvirt installed on there then running virsh capabilities is one way to see which cpu numbers belong to which socket. If you don't have libvirt then try lscpu -a --extended (part of the util-linux-ng package) instead and look for lines like

Code: Select all

NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
Edit: updated yet again after I realised that this is CentOS 6 and lscpu is not installed by default and then again because it's old and doesn't know about lscpu -y
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

CANnix
Posts: 6
Joined: 2019/02/06 15:34:17

Re: ECC error (mcelog)

Post by CANnix » 2019/02/06 16:26:40

Thanks for the reply! Yes, I get the same output from lscpu as you posted. Am I right that defect RAM does not seem to be the problem here?

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: ECC error (mcelog)

Post by TrevorH » 2019/02/06 16:35:35

Depends. Did you replace all your RAM or just the defective DIMM in node 0 bank 6? Because the latest MCE is on bank 8 on the other node. Sounds like a different DIMM to me.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

CANnix
Posts: 6
Joined: 2019/02/06 15:34:17

Re: ECC error (mcelog)

Post by CANnix » 2019/02/06 17:02:18

I replaced the DIMMs of bank 6, 8 and 10.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: ECC error (mcelog)

Post by TrevorH » 2019/02/06 17:26:53

As far as I know banks exist on both sockets so if you replaced the DIMM in bank 8 on cpu 0 then you still have the error on bank 8 of cpu 1.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

CANnix
Posts: 6
Joined: 2019/02/06 15:34:17

Re: ECC error (mcelog)

Post by CANnix » 2019/02/06 17:37:56

This is my mainboard: S5520UR (Intel). lshw lists 12 banks with indices from 0 to 11 and the architecture is SMP. And as both CPUs share the same memory address space, the bank names are probably unique.

The log indicates that the replaced DIMMs fail as much as the old ones. Is there a way to detect failures at the memory controller?

CANnix
Posts: 6
Joined: 2019/02/06 15:34:17

Re: ECC error (mcelog)

Post by CANnix » 2019/02/11 10:22:08

As we cannot find the root cause of the problem the server will be put out of service. However if anyone still has an idea how to find the problem, I would be glad to hear from you! Kind of a strange issue...

tunk
Posts: 1205
Joined: 2017/02/22 15:08:17

Re: ECC error (mcelog)

Post by tunk » 2019/02/11 11:58:29

You say it's happening during high load, so it could be stress or temperature related.
Do the fans work? Maybe the PSUs have deteriorated causing voltage instability under
high load. AFAIK the memory controller is built into the CPU - could it be bad contact
between CPU and motherboard?

CANnix
Posts: 6
Joined: 2019/02/06 15:34:17

Re: ECC error (mcelog)

Post by CANnix » 2019/02/11 13:35:26

It happens while rendering a simulation. However if I just stress the CPU it runs fine.

Post Reply