I am running a CentOS 6 server on a Xeon X5650 machine (SMP). During high CPU load the server crashes regularly after some time. In the MCElog I can trace the error back to data transfer between RAM and MCU:
Now the thing is that Memtest runs fine, except for the SMP version which crashes. AFAIK this is not too uncommon and does not necessarily indicate a hardware fault. To eliminate defect RAM as root cause the respective DIMM was replaced with a brand new one before the latest crash (the one documented in the log above)....
STATUS 88010282 MCGSTATUS 0
MCGCAP 1c09 APICID 35 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 20 BANK 6 TSC b7065eeaa1810
TIME 1545643603 Mon Dec 24 10:26:43 2018
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-2 Generic Error
STATUS b200000080000106 MCGSTATUS 4
MCGCAP 1c09 APICID 13 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 6 TSC b7065eeaa18b0
TIME 1545643603 Mon Dec 24 10:26:43 2018
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-2 Generic Error
STATUS b200000080000106 MCGSTATUS 4
MCGCAP 1c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8
MISC 5222508000086200
TIME 1547586533 Tue Jan 15 22:08:53 2019
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: MEMORY CONTROLLER MS_CHANNELunspecified_ERR
Transaction: Memory scrubbing error
Memory ECC error occurred during scrub
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 0
Memory DIMM ID of error: 0
Memory channel ID of error: 2
Memory ECC syndrome: 52225080
STATUS 88000040000200cf MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
...
Does anyone see a solution or a direction for further analysis?
Best regards