CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

General support questions
kludge
Posts: 5
Joined: 2017/03/06 18:58:38

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by kludge » 2017/03/07 13:46:36

We're not doing much disk I/O at all, if anything we are mostly doing massive amounts of USB3 I/O to a camera. But again, never had the issue before the kernel upgrade.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by TrevorH » 2017/03/07 14:40:25

Well, what ever it is, it's been stuck for 2 whole minutes which is about an ice age in length in computer terms. The process involved ought to be stuck in D status while this is happening and using ps with wchan might give you enough clues to find out why or where. Or look at the rest of the stack trace in the log - that ought to tell you too.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by TrevorH » 2017/03/09 16:43:49

I've realised that this whole thread was not really in the best forum (was in Networking problems) so have moved it over to the General Support forum where it seems to fit better.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

neutronsnowball
Posts: 16
Joined: 2016/10/27 18:09:29

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by neutronsnowball » 2017/05/30 15:30:31

We started seeing this earlier this month. Has recurred four times. I'm not seeing any reference to excessive load on the guest or host. I'm not sure how to read it but here's the stacktrace:

Code: Select all

May 27 14:22:22 server-name kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [nrsysmond:1516]
May 27 14:22:22 server-name kernel: Modules linked in: xt_multiport vmw_vsock_vmci_transport vsock ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_conntrack_ipv6 nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter iptable_filter ip6_tables xfs libcrc32c intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon pcspkr sg vmw_vmci i2c_piix4 shpchp parport_pc parport ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm e1000 mptspi i2c_core scsi_transport_spi mptscsih ata_piix mptbase libata floppy fjes dm_mirror dm_region_hash
May 27 14:22:22 server-name kernel: dm_log dm_mod
May 27 14:22:22 server-name kernel: CPU: 1 PID: 1516 Comm: nrsysmond Not tainted 3.10.0-514.16.1.el7.x86_64 #1
May 27 14:22:22 server-name kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
May 27 14:22:22 server-name kernel: task: ffff880035985e20 ti: ffff8800b6e24000 task.ti: ffff8800b6e24000
May 27 14:22:22 server-name kernel: RIP: 0010:[<ffffffff8168e402>]  [<ffffffff8168e402>] _raw_spin_lock+0x32/0x50
May 27 14:22:22 server-name kernel: RSP: 0018:ffff8800b6e27cd0  EFLAGS: 00000202
May 27 14:22:22 server-name kernel: RAX: 000000000000362d RBX: ffff8800b6e27f10 RCX: 0000000000001180
May 27 14:22:22 server-name kernel: RDX: 0000000000003888 RSI: 0000000000003888 RDI: ffffc900016ae2f0
May 27 14:22:22 server-name kernel: RBP: ffff8800b6e27cd0 R08: ffffffff81a9cd40 R09: ffffc900016ae2f0
May 27 14:22:22 server-name kernel: R10: ffff8800a8a2d540 R11: 00000000ae757c88 R12: ffff8800b6e27d00
May 27 14:22:22 server-name kernel: R13: ffff8800557fb900 R14: 0000100000000000 R15: 7fffffffffffffff
May 27 14:22:22 server-name kernel: FS:  00007f8f7aa02700(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
May 27 14:22:22 server-name kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 14:22:22 server-name kernel: CR2: 00007f8f7c10c000 CR3: 00000000b6e1b000 CR4: 00000000000407e0
May 27 14:22:22 server-name kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 27 14:22:22 server-name kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 27 14:22:22 server-name kernel: Stack:
May 27 14:22:22 server-name kernel: ffff8800b6e27d70 ffffffff815bb546 03fffffffe06a735 26378de08704ae59
May 27 14:22:22 server-name kernel: ffff8800a8a2d540 ffffffff81a9cd40 405494c8000085fa ffffc900016ae2f0
May 27 14:22:22 server-name kernel: ffffffff815ba740 ffffffff815ba8f0 ffffffff81aa2300 ffff8800000085fa
May 27 14:22:22 server-name kernel: Call Trace:
May 27 14:22:22 server-name kernel: [<ffffffff815bb546>] __inet_hash_connect+0x1b6/0x3d0
May 27 14:22:22 server-name kernel: [<ffffffff815ba740>] ? inet_ehashfn+0xc0/0xc0
May 27 14:22:22 server-name kernel: [<ffffffff815ba8f0>] ? inet_unhash+0xb0/0xb0
May 27 14:22:22 server-name kernel: [<ffffffff815bb79b>] inet_hash_connect+0x3b/0x40
May 27 14:22:22 server-name kernel: [<ffffffff815d51b0>] tcp_v4_connect+0x2b0/0x4e0
May 27 14:22:22 server-name kernel: [<ffffffff815ec085>] __inet_stream_connect+0xb5/0x330
May 27 14:22:22 server-name kernel: [<ffffffff815d9413>] ? tcp_assign_congestion_control+0x43/0x170
May 27 14:22:22 server-name kernel: [<ffffffff815ec338>] inet_stream_connect+0x38/0x50
May 27 14:22:22 server-name kernel: [<ffffffff81555ae7>] SYSC_connect+0xe7/0x120
May 27 14:22:22 server-name kernel: [<ffffffff81552950>] ? sock_alloc_file+0xa0/0x140
May 27 14:22:22 server-name kernel: [<ffffffff8121d1f7>] ? __fd_install+0x47/0x60
May 27 14:22:22 server-name kernel: [<ffffffff815568ee>] SyS_connect+0xe/0x10
May 27 14:22:22 server-name kernel: [<ffffffff81697189>] system_call_fastpath+0x16/0x1b
May 27 14:22:22 server-name kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00

kludge
Posts: 5
Joined: 2017/03/06 18:58:38

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by kludge » 2017/05/30 15:35:10

I will say that our problem seems to have been fixed by upgrading the kernel to 3.10.0-514.10.2.el7.x86_64.
Which version are you running?

neutronsnowball
Posts: 16
Joined: 2016/10/27 18:09:29

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by neutronsnowball » 2017/05/30 15:36:42

Code: Select all

3.10.0-514.16.1.el7.x86_64

alexk2000
Posts: 2
Joined: 2017/06/21 09:35:04

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by alexk2000 » 2017/06/21 10:19:57

Have the same issue. Centos vm (8 vcpus) on ESXI.
ESXi version: 6.5.0
Kernel 3.10.0-514.21.1.el7.x86_64
Jun 18 15:54:33 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s! [swapper/7:0]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [watchdog/3:21]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 25s! [ksoftirqd/0:3]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:33]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [khungtaskd:49]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [ksoftirqd/5:33]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 28s! [watchdog/3:21]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [khungtaskd:49]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [ksoftirqd/0:3]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 21s! [mysqld:1420]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [swapper/7:0]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksoftirqd/0:3]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [sh:14207]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [watchdog/3:21]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s! [swapper/7:0]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 27s! [ksoftirqd/5:33]
Jun 18 15:54:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [watchdog/0:10]
Jun 18 17:42:23 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [kswapd0:60]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s! [httpd:55626]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:10564]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [scsi_eh_2:327]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [httpd:55626]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [ksoftirqd/4:28]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 21s! [mysqld:1424]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [mysqld:1430]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s! [httpd:55626]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [mysqld:1424]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
Jun 19 23:48:35 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 21s! [mysql:22435]
Jun 19 23:49:34 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [httpd:39616]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [zabbix_agentd:57659]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 24s! [vsftpd:41679]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 29s! [kworker/u16:0:64218]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 25s! [awk:22047]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 29s! [mysqld:23673]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 32s! [httpd:60036]
Jun 17 00:14:15 somehost kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 32s! [mysqld:1425]

First massage appeared on the Jun 17 00:14:15, before it I noticed this message
Jun 17 00:11:49 somehost kernel: sched: RT throttling activated

I didn't find any proofs that root cause is resources overcommitment - neither ESXI highly loaded nor Centos vm (verified through Zabbix monitoring).

neutronsnowball
Posts: 16
Joined: 2016/10/27 18:09:29

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by neutronsnowball » 2017/06/22 16:41:05

We removed the NewRelic agent and this problem went away.

alexk2000
Posts: 2
Joined: 2017/06/21 09:35:04

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by alexk2000 » 2017/06/23 07:20:50

neutronsnowball wrote:We removed the NewRelic agent and this problem went away.
We don't use it.

freemarket
Posts: 7
Joined: 2005/04/08 14:29:06
Location: New York City
Contact:

Re: CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19]

Post by freemarket » 2017/09/27 16:24:46

kludge wrote:We're seeing similar issues on Centos 7. No problem on 6.8, but upgrading to 7 with the 3.10.0-514.6.1.el7.x86_64 kernel and nvidia driver 375.39 gives us seemingly identical behaviour, running directly on real hardware. I think this is a new kernel bug, or else something related to an nvidia driver bug. When this is going on, the CPU used by the migration task goes up to 100%.
I am seeing a similar issue on Centos 7.4 with kernel, 3.10.0-693.2.2.el7.x86_64, and nvidia driver, nvidia-x11-drv-384.90. The use case is when I try to recover system after doing a systemctl suspend, it issues soft lockup with the reason stated as:

NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [192.168.15.66-m:5513]

where that node is my NFS server. This had not occurred prior to the 694 or 514 version kernels.

Post Reply