Page 1 of 2

Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 11:42:25
by andrewc24
Hi,

I am having huge problems with kernel panics and hanging systems on Centos 7.7 (1908) on some new hardware (Intel Xeon Silver 4214.)

I do not get these issues on Centos 7.2 with older hardware but I cannot downgrade to that OS version on the new systems as the CPU's are not supported by 7.2.

Does anyone have any insight into kernel panics on long I/O processes?

I have tried network shares via AUTOFS and FSTAB and various SMB versions and get no success in reducing this issue (as we are outputting data to image sequences on our network.) The system will run for X amount of time then panic and hang forever and only a reboot lets the system run again for a while. I have plenty of available RAM and Swap memory. i am running a 10gb connection aswell, the network does not seem to be getting taxed over its limit.

Here is a dump below to hopefully at least prompt some help as a starting point.

This has been driving me mad for months. Centos 8 is not suitable due to the number of package changes in it and it also not being a VFX supported platform.

Thanks,

--------------------------------
Nov 6 11:28:37 RENDER_NODE1L kernel: INFO: task mantra-bin:37395 blocked for more than 120 seconds.
Nov 6 11:28:37 RENDER_NODE1L kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 11:28:37 RENDER_NODE1L kernel: mantra-bin D ffff8da4572f20e0 0 37395 37248 0x00000080
Nov 6 11:28:37 RENDER_NODE1L kernel: Call Trace:
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d7eb09>] schedule+0x29/0x70
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d804f5>] rwsem_down_read_failed+0x105/0x1c0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff919913f8>] call_rwsem_down_read_failed+0x18/0x30
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d7dc90>] down_read+0x20/0x40
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffffc0c5184b>] cifs_has_mand_locks+0x1b/0x80 [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffffc0c533fd>] cifs_reopen_file+0x53d/0x840 [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffffc0c54005>] cifs_readpage_worker+0x195/0x630 [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91986f04>] ? __radix_tree_lookup+0x84/0xf0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffffc0c54765>] cifs_readpage+0x85/0x240 [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff917bdb10>] generic_file_aio_read+0x3f0/0x790
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffffc0c5ab79>] cifs_strict_readv+0x149/0x180 [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91848353>] do_sync_read+0x93/0xe0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91848d8f>] vfs_read+0x9f/0x170
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91849c4f>] SyS_read+0x7f/0xf0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d8bede>] system_call_fastpath+0x25/0x2a
Nov 6 11:28:37 RENDER_NODE1L kernel: INFO: task kworker/29:1:38802 blocked for more than 120 seconds.
Nov 6 11:28:37 RENDER_NODE1L kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 11:28:37 RENDER_NODE1L kernel: kworker/29:1 D ffff8da455ce1070 0 38802 2 0x00000080
Nov 6 11:28:37 RENDER_NODE1L kernel: Workqueue: cifsoplockd cifs_oplock_break [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: Call Trace:
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d7eb09>] schedule+0x29/0x70
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d80245>] rwsem_down_write_failed+0x215/0x3c0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91991427>] call_rwsem_down_write_failed+0x17/0x30
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d7dcdd>] down_write+0x2d/0x3d
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffffc0c528ab>] cifs_oplock_break+0xdb/0x390 [cifs]
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff916bd1df>] process_one_work+0x17f/0x440
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff916be2f6>] worker_thread+0x126/0x3c0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff916be1d0>] ? manage_workers.isra.26+0x2a0/0x2a0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff916c51b1>] kthread+0xd1/0xe0
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff916c50e0>] ? insert_kthread_work+0x40/0x40
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff91d8bd1d>] ret_from_fork_nospec_begin+0x7/0x21
Nov 6 11:28:37 RENDER_NODE1L kernel: [<ffffffff916c50e0>] ? insert_kthread_work+0x40/0x40

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 12:11:39
by TrevorH
What's the output from uname -a ?

Are you able to use something other than SMB to access your network filesystem? NFS?

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 12:16:01
by andrewc24
TrevorH wrote:
2019/11/12 12:11:39
What's the output from uname -a ?
Linux RENDER_NODE1L 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
TrevorH wrote:
2019/11/12 12:11:39
Are you able to use something other than SMB to access your network filesystem? NFS?

I would like to try NFS but it means changing our primary Isilon which is set as SMB. I can push for a change internally to dummy it, but likely won't be anytime soon, I would have to await another team doing that.

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 12:24:29
by BShT
try to mount with cache=none

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 12:27:42
by TrevorH
Well the problem is pretty much definitely to do with SMB so avoiding it entirely would be one way to fix it!

Though, actually now that I look at that stack trace, it's actually not crashing, this is the kernel going "I haven't had a response to that I/O that I started 2 whole minutes ago, something must be wrong, let me kick it and see if it wakes up". That might mean your storage server is overloaded and not responding quickly enough. Or it might just be a bug and stuck ;)

Google has a number of hits on "Workqueue: cifsoplockd cifs_oplock_break", none of them much use though there is a bugzilla from early 2018 against RHGS (not RHEL though RHGS is probably on top of RHEL) which is closed saying "do not use SMB > 2". Which is helpful when the rest of the world is busy disabling SMB1 as fast as it can.

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 12:38:13
by andrewc24
Checking the Isilon resources shows we are nowhere near overload, so can probably scratch that off.

It's purely isolated to this new hardware and OS version, everything else we run is fine, bizarre!

I will push for NFS and see if we can get a dummy run...it's basically the final thing I can try. :(

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 12:46:24
by TrevorH
You've tried forcing SMB1? And verified that it actually honoured it?

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 13:02:10
by andrewc24
SMB 1 through to 3 will work, but ultimately they all end up timing out after X amount of time.

When we used AUTOFS the problem was really bad.

When we use FSTAB, the problem is still bad...but not as much.

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 14:15:28
by TrevorH
Did you check if it was actually using SMB1 when you told it to use that? It's now so insecure that many servers have disabled it and silently upgrade you to a newer version.

Re: Centos 7.7 - Frequent kernel panics on long running I/O processes?

Posted: 2019/11/12 17:05:27
by andrewc24
Our system seems to respect defining version 1.

The main system was re-load-balanced today just to take that out of the equation too for definite.

Next stop is NFS I guess!