MCE with Ryzen, stability

Issues related to hardware problems
jacscha
Posts: 8
Joined: 2017/11/29 14:06:55

MCE with Ryzen, stability

Post by jacscha » 2017/11/29 14:29:00

I’ve got a new build with ASrock AB350, Ryzen 1700, 32 gb ram, and a cheap Nvidia 1030 card. It’s intended to run genomic analyses so I will only access via SSH and no GUI is needed. I put the card in for installation.
I’m having random crashes with MCE errors. I’m having problems debugging since mcelog is telling me:

Code: Select all

mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
CPU is unsupported
I have not had any luck with this module.
I’ve eliminated ram as the problem (extensive memtest run was fine), and there are no SMART errors for any of the drives. I’ve upgraded to the most recent bios, and also installed the 4.14 kernel.

I’m starting to suspect the errors are with the Nvidia card. When looking at the errors with abrt-cli list, they are all the same and reference nvidia drivers (nouveau and Plymouth being the keywords). Here’s one:

Code: Select all

reason:         mce: [Hardware Error]: Machine check events logged
time:           Mon 27 Nov 2017 02:55:59 PM CST
cmdline:        BOOT_IMAGE=/vmlinuz-3.10.0-693.5.2.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8 nouveau.modeset=0 rd.driver.blacklist=nouveau plymouth.ignore-udev
Right now, I’m running without the GUI and waiting to see if I get a random boot (they occur ~24 hours).

Anyone know how I might get more info out of mcelog? Any thoughts on the Nvidia card being the problem?

stevemowbray
Posts: 519
Joined: 2012/06/26 14:20:47

Re: MCE with Ryzen, stability

Post by stevemowbray » 2017/11/29 14:51:51

The nouveau etc. is a red herring as it's just telling you the kernel command line used.

jacscha
Posts: 8
Joined: 2017/11/29 14:06:55

Re: MCE with Ryzen, stability

Post by jacscha » 2017/11/29 15:03:36

OK, any ideas on getting useful info from mcelog for ryzen systems?

stevemowbray
Posts: 519
Joined: 2012/06/26 14:20:47

Re: MCE with Ryzen, stability

Post by stevemowbray » 2017/11/29 15:06:48

Sorry, can't help with that as I don't have any Ryzens.

desertcat
Posts: 843
Joined: 2014/08/07 02:17:29
Location: Tucson, AZ

Re: MCE with Ryzen, stability

Post by desertcat » 2017/11/30 23:45:56

jacscha wrote:OK, any ideas on getting useful info from mcelog for ryzen systems?
I'm going to watch this also. Ryzen is still a new CPU, and while I have no intention of updating my workstation for several years my next stop is Ryzen. There are people on these boards who do have ASUS mobos with Ryzen CPU's running CentOS 7.4

jacscha
Posts: 8
Joined: 2017/11/29 14:06:55

Re: MCE with Ryzen, stability

Post by jacscha » 2017/12/02 15:36:03

Quick update...not loading the GUI (or Nvidia drivers) may have worked. I went from 1-2 crashes a day to none in the last 4 days. The crashes were never predictable so I can't say for sure. However, the system seems stable now.

jacscha
Posts: 8
Joined: 2017/11/29 14:06:55

Re: MCE with Ryzen, stability

Post by jacscha » 2017/12/05 16:07:52

This is very weird. The system has been stable so I copied over some large (8-25GB) data files and noted some mismatching md5sum values. What was odd, was that they didn't match for just the larger files.

No bad sectors on the drives, no SMART errors, no errors on the copy. Memory all tested fine.

Further, duplicate md5sum on the same file give different results (again, only with files >10GB). Here's an example with stable md5sum in a 4gb file, unstable for 12gb.

Code: Select all

dd if=/dev/urandom iflag=fullblock of=output.dat  bs=1G  count=4
4+0 records in
4+0 records out
4294967296 bytes (4.3 GB) copied, 66.9364 s, 64.2 MB/s
[jake@localhost ~]$ md5sum output.dat
d407ce8d4306ec8c65a0c2ebe27fa789  output.dat
[jake@localhost ~]$ md5sum output.dat
d407ce8d4306ec8c65a0c2ebe27fa789  output.dat
[jake@localhost ~]$ md5sum output.dat
d407ce8d4306ec8c65a0c2ebe27fa789  output.dat
[jake@localhost ~]$ md5sum output.dat
d407ce8d4306ec8c65a0c2ebe27fa789  output.dat
[jake@localhost ~]$ rm output.dat
[jake@localhost ~]$ dd if=/dev/urandom iflag=fullblock of=output.dat  bs=1G  count=6
6+0 records in
6+0 records out
6442450944 bytes (6.4 GB) copied, 98.9325 s, 65.1 MB/s
[jake@localhost ~]$ md5sum output.dat
6bc0ca2954a8df9067ff856d3b2d09db  output.dat
[jake@localhost ~]$ md5sum output.dat
6bc0ca2954a8df9067ff856d3b2d09db  output.dat
[jake@localhost ~]$ md5sum output.dat
6bc0ca2954a8df9067ff856d3b2d09db  output.dat
[jake@localhost ~]$ md5sum output.dat
6bc0ca2954a8df9067ff856d3b2d09db  output.dat
[jake@localhost ~]$ md5sum output.dat
6bc0ca2954a8df9067ff856d3b2d09db  output.dat
[jake@localhost ~]$ rm output.dat
[jake@localhost ~]$ dd if=/dev/urandom iflag=fullblock of=output.dat  bs=1G  count=12
12+0 records in
12+0 records out
12884901888 bytes (13 GB) copied, 198.55 s, 64.9 MB/s
[jake@localhost ~]$ md5sum output.dat
20a096b779ad26cd6c596f1657c6f5a1  output.dat
[jake@localhost ~]$ md5sum output.dat
05a08057cfdb3c418da34cd06531133a  output.dat
[jake@localhost ~]$ md5sum output.dat
30333dd0e1cb4f068d81548bbf8dabd8  output.dat
[jake@localhost ~]$ md5sum output.dat
30333dd0e1cb4f068d81548bbf8dabd8  output.dat
[jake@localhost ~]$ md5sum output.dat
30333dd0e1cb4f068d81548bbf8dabd8  output.dat
[jake@localhost ~]$ md5sum output.dat
30333dd0e1cb4f068d81548bbf8dabd8  output.dat
And, if you look at parts of larger files (head -c 4G | md5sum), you always get consistent values until you pass more than 8-10G to md5sum. These "partial" md5sums also always match the original files. If the files were corrupted, that wouldn't always be the case.

I found this post:
https://askubuntu.com/questions/968123/ ... vme-drives

I removed some ram as suggested, didn't change the behavior.

Any ideas? Maybe I should move this to a new post.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: MCE with Ryzen, stability

Post by hunter86_bg » 2017/12/05 19:03:33

Can you try with sha256sum or sha512sum... Maybe it's only limited to md5sum.
Also consider running prime95 test that uses ram and leave it running for at least 24 hours.As far as i remmember there was a bug with early made ryzens when compiling.AMD honour RMA without too many questions.

jacscha
Posts: 8
Joined: 2017/11/29 14:06:55

Re: MCE with Ryzen, stability

Post by jacscha » 2017/12/05 20:29:23

sha256sum showing the same behavior. Immediate failure on prime95 blend. Tests that don't use ram appear to be doing fine. So, I guess I have some bad ram or something with the controller.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: MCE with Ryzen, stability

Post by hunter86_bg » 2017/12/06 04:57:30

Stop the XMP profile or increase the voltage on the RAM by a notch.Update the bios and try again.
If you have an overclock on the cpu - increase the cpu voltage by a notch (but keep it below 1.35V if possible).
If you have access to other RAM sticks (even smaller in size), replace the ram and try again(no XMP or overclock).

As you don't have issues with hashes against smaller (less ram consuming) images, most probably it's undervolted RAM.

Post Reply