I have a dedicated server with with VMWare ESXi-6.5.0-20181103001-standard-customized installed which runs 8 small virtual machines. One of the VMs is used for hosting my wordpress site with Centos7 (centos-release-7-6.1810.2.el7.centos.x86_64) installed.
Everything was working okay, and suddenly I the server load started to jump (from ~1 to) as high as 50 and my website became unresponsive (too slow or at sometimes displayed Error 524 server down). And then after a minute or two, the load comes down back and the cycle repeats every 5-10 min...
At the times of high CPU load (or just before it starts to skyrocket), I very frequently get similar messages on my shell:
Code: Select all
Message from syslogd@web at Apr 26 08:45:20 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#4 stuck for 25s! [ksoftirqd/4:29]
Message from syslogd@web at Apr 26 08:45:26 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:27134]
Message from syslogd@web at Apr 26 09:01:51 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [sw-collectd:7572]
Message from syslogd@web at Apr 26 09:03:38 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [lsphp:31825]
Message from syslogd@web at Apr 26 09:03:42 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [swapper/6:0]
Message from syslogd@web at Apr 26 09:03:43 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 24s! [swapper/5:0]
Message from syslogd@web at Apr 26 09:05:21 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 24s! [kworker/u16:1:22896]
Message from syslogd@web at Apr 26 09:05:50 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [ksoftirqd/2:19]
Message from syslogd@web at Apr 26 09:05:50 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [systemd-journal:3606]
Message from syslogd@web at Apr 26 09:05:50 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [ksoftirqd/2:19]
Message from syslogd@web at Apr 26 09:05:59 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [ksoftirqd/2:19]
Message from syslogd@web at Apr 26 09:06:40 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [lsphp:32091]
Message from syslogd@web at Apr 26 09:06:40 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [lsphp:31812]
Message from syslogd@web at Apr 26 09:06:45 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [tuned:14968]
Message from syslogd@web at Apr 26 09:07:31 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [xfsaild/dm-0:3534]
Message from syslogd@web at Apr 26 09:07:45 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 25s! [lsphp:32094]
Message from syslogd@web at Apr 26 09:07:45 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [lsphp:32092]
Message from syslogd@web at Apr 26 09:13:53 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s! [lsphp:32611]
Message from syslogd@web at Apr 26 09:13:53 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [swapper/2:0]
Message from syslogd@web at Apr 26 09:13:53 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [swapper/1:0]
Message from syslogd@web at Apr 26 09:13:55 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 28s! [htop:18808]
Message from syslogd@web at Apr 26 09:13:55 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#4 stuck for 28s! [kworker/4:2:31820]
I am not an expert, and I searched through the internet.
I updated the kernel which had no effect.
I temporarily disabled csf which had no effect.
I read in the VMWare forum that this could be a result of memory overcommitment... But my dedicated server has 128GB of RAM and only 80GB is used (and it never goes up). The memory assigned to my web server is 32GB and all the other small servers won't reach 40GB in total...
I even saw another post with the same topic on this forum, but it also was not my problem, that is why I am sending a new post.
The problem starts to occur really out of the blue and after a few hours just goes away on its own for several hours later when it again shows up. It does not have any direct connection with the number of users on my site as far as I can see.
I desperately need to fix the issue since it is affecting my website hugely.
Thanks in advance