NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

General support questions
Post Reply
zaitsova
Posts: 9
Joined: 2018/01/30 05:34:07

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

Post by zaitsova » 2018/02/17 02:38:33

Intermittent freezes, and blackscreens. Crashes at random times under random conditions. Experiments to recreate a crash from particular conditions have all failed. The average time between crashes is 4 to 5 hours, but variable in practice. The workstation froze at 5:44 AM this morning, and again at 9:30 AM.

The log of the 5:44 crash is here : http://pastebin.centos.org/551531/

Nothing was logged for four minutes.
Feb 16 05:40:30 localhost.localdomain systemd[1]: Started Hostname Service.
Feb 16 05:44:43 localhost.localdomain kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

In the time between 5:44:43 and -- Reboot -- the gnome desktop was on screen, but the mouse cursor was frozen and keyboard presses had no effect.

Code: Select all

CentOS 7.4.1708
Gnome desktop 3.22.2
CPU = AMD Ryzen 1700. Eight core
GPU = gtx 1070 pcie sse2
Nvidia X server 390.25

zaitsova
Posts: 9
Joined: 2018/01/30 05:34:07

Re: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

Post by zaitsova » 2018/02/17 07:16:30

Should I assume that the item appearing in parentheses is the culprit for the lockup?

Code: Select all

[java:14903]

zaitsova
Posts: 9
Joined: 2018/01/30 05:34:07

Re: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

Post by zaitsova » 2018/02/18 02:24:18

(Tentatively)

(Post removed suggestion of wrong solution)

(Edit did not work see below :evil: )
Last edited by zaitsova on 2018/02/28 21:50:56, edited 1 time in total.

zaitsova
Posts: 9
Joined: 2018/01/30 05:34:07

Re: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

Post by zaitsova » 2018/02/22 00:42:42

I have returned since last time. The machine continues to freeze. For emphasis : these freezes occur when I am not using the machine at all, and it is sitting passively in a Gnome desktop.

The particular CPU in the system is AMD Ryzen 7 1700 Eight Core Socket AM4 1331.
I have now attempted a downcore configuration of FOUR(4+0) . This has produced some mild success, and I will return if any additional problems manifest.

The OS is currenly using Oracle Java, and not using OpenJDK. Many systems have both installed at the same time. I have removed OpenJDK manually. My best intuition has the following hypothesis: Oracle Java is not entirely compatible with 8-core AMD CPUs. There is likely some kind of thread deadlock happening with Oracle Java's service, or a process related to that service. Experimentation on that possibility is nigh, and I will return here with the results.

zaitsova
Posts: 9
Joined: 2018/01/30 05:34:07

Re: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

Post by zaitsova » 2018/02/28 21:56:56

The workstation has achieved 14 hours of stability.
The kernel was updated from 3.10.0-693.17.el7.x86_64 to 4.15.5-1.el7.elrepo.x86_64
This was followed by uninstalling the NVIDIA display driver 390.25, rebooting and literally re-installing it.
stableworkstation.png
stableworkstation.png (185.32 KiB) Viewed 7319 times
In the next post, I will give a detailed description of what was done.

zaitsova
Posts: 9
Joined: 2018/01/30 05:34:07

Re: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:14903]

Post by zaitsova » 2018/02/28 22:25:20

This step can take up to an hour.

Code: Select all

yum update
An existing display driver may cause booting problems. At the boot selection screen, press 'e' to edit the configuration. Place a single 3 character at the very end of the linux16 command, and press CTRL-x. This is booting to a command line, if needed.

Code: Select all

linux16 /vmlinuz etc etc LANG=en_US.UTF-8 3
As super user root,

Code: Select all

--import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org 
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
yum --enablerepo=elrepo-kernel install -y kernel-ml
yum --enablerepo=elrepo-kernel install -y kernel-ml-devel
The machine must be rebooted at this time.

Code: Select all

./NVIDIA-Linux-x86_64-390.25.run --uninstall
Reboot at this time.

Code: Select all

./NVIDIA-Linux-x86_64-390.25.run

kernel-ml
is short for kernel MainLine Stable. kernel-ml-devel is the kernel sourcecode and is not required. But if an NVIDIA display driver is needed on the machine, it is required. The ".run" file is actually compiling a driver against that source. If you have done something wrong, you will be greeted with this message ,
ERROR: Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed. If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.

Post Reply