Kernel Panic Advice

Issues related to applications and software problems
dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Kernel Panic Advice

Post by dcrdev » 2017/08/08 14:41:31

Usually when I've dealt with Kernel panics in the past I've been able to pinpoint it to a specific action I'd taken, with the system I've just set up I have not been able to do so. The trouble is I can't reproduce them and since this machine is a server I need to ensure it is reliable - I'm not really sure what to do in this situation so would like some advice.

Since setting the system up I've experienced 2 kernel panics:
  • The first time I was setting up a docker container and ssh just stopped working. Conneting via IPMI revealed that the machine had dropped to dracut and was asking me to decrypt the main volume so that kdump could kick in. Stupidly I just put this down to docker and removed the crash kernel.
  • The second time occurred maybe an hour after the first one and occurred whilst I was configuring postfix. The same thing happened - dropped to dracut, decrypted the main volume; except this time it threw an out-of-memory error and froze, bit confused by that given I have 64gb RAM and this is at the moment a fairly minimal install. Anyway kdump did not manage to dump the crash kernel - so I hard rebooted. On reboot my system logs were being flooded with 1000's of selinux denials on /var/log/messages, I relabelled the entire filesystem and the errors were gone. Sometime later I started receiving crond errors about unexpected file endings in /var/log/sa - I cleared out /var/log/sa and ran "/usr/lib64/sa/sa2 -A" and that fixed the errors.
The system has been running for 48 hours now without issue - nothing unusual in the logs as far as I can tell, docker, postfix all working. My question is - should I trust this system? Are there any tests you'd advise?

Memtest ran fine and did about 8 passes, IPMI monitoring all shows everything being within normal parameters and I've stressed the machine with Prime95 for days. Additionally the system is running on ZFS and is showing no checksum errors; worth noting that I'm using ECC memory here.

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: Kernel Panic Advice

Post by aks » 2017/08/17 18:14:34

Well we don't have a clue. The system has rebooted twice for unknown reasons. Running out of RAM (and thus, swap - you do have swap right?) is not a good thing! But we don't know why.
The sar problems would be related to a complete halt of the system and not being able to write "final" details to disk.

Test hardware (you've tested RAM), what's your IO bottleneck? Do you know how much IO your system can handle? I don't know Prime95, but yes, stress test the system (introduce a "leaky" app - ram wise), what happens? Do a whole bunch of IO,. what happens? Run a CPU intensive app (factoring prime numbers is usually a good one), what happens at saturation?
Now that you know (at least some) of your limits, what does the application(s) that are/will be running there require? Will the system support those application(s)? If yes, you're fine, if not, go back to the drawing board.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Kernel Panic Advice

Post by hunter86_bg » 2017/08/17 18:58:21

I'm in the same situation , but only suspecting a kernel panic. As the machine is in a cluster -it gets fenced.
I was thinking about activating the "netconsole" which might give a clue why this happened.

Have you thought about it ?

dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: Kernel Panic Advice

Post by dcrdev » 2017/08/18 16:16:28

Ok so it happened again when playing round with docker - this time round after several reboots, I now get a kernel panic before I can actually get into the system. It has given me the opportunity to see the error message though - looks like an occurrence of this: https://bugs.centos.org/print_bug_page.php?bug_id=13063

I've got the exact same error and am using NVME + ZFS , although no idea what drbd is so probably not using that.

Not really sure what to do now - I've spent ages configuring everything to perfection and thoroughly fed up. The only way forward I can think of is to import the pool on a live image and grab the configuration, then start again.

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: Kernel Panic Advice

Post by aks » 2017/08/18 18:36:40

FYI: drbd is a distributed block replication device. Like a mirror of disks, but on different machines.
Seems the bug is related to heavy I/O (like Docker does with it "layering" of changes). Posts suggest a "flood" of the NVME queue to the point where the kernel can not cope. That suggests to me that the underlaying filesystem (in your case ZFS) can't keep up. One of the big "things" with ZFS is the L2 ARC - I seems to recall it's a big choke point.
Perhaps try it on a different filesystem?
ZFS is friggin' awesome (albeit like 10-15 years old now - when did Solaris 10 come out?) and works very well on Solaris, but Solaris is not Linux .....

dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: Kernel Panic Advice

Post by dcrdev » 2017/08/18 21:48:57

^ Thanks for explaining it to me, although isn't L2ARC stored in volatile memory - shouldn't that be uber fast, I have a lot of it and it's clocked at 2400mhz.

In any case I initially tried importing the pool on a live image, destroying the datasets pertaining to docker and disabled docker, I then ran a scrub, added a selinux autorelabel file for good measure aaannd... still got the same kernel panic on reboot.

So I went back to the live image, created a tarball of zfs root dataset, copied that to an external drive and have now reinstalled CentOS on xfs using lvm. System is running and have docker running without issue so far, so am currently in the process of moving across various configuration files from the tarball I created.

I'm a bit disappointing that I wont be running CentOS on a ZFS root, but ultimately there wasn't that much benefit anyway given that it was a single disk; I just wanted consistency across the system - I'm going to be using zfs for a 4 disk mirrored/striped storage pool. I think lvm should handle the odd snapshot on root, even if it does degrade performance.

Hopefully I don't encounter issues with zfs on the storage pool - which will be using spinning rust disks and not nvme.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Kernel Panic Advice

Post by hunter86_bg » 2017/08/19 18:52:36

ZFS is not a good choice for linux in production environment.
You can always use software raid and/or LVM2 for setting up mirrored/striped setup.
They are pretty stable.

dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: Kernel Panic Advice

Post by dcrdev » 2017/08/19 21:14:29

^ I don't know a lot of people say ZoL is at about parity with ZFS on BSD, I'm new to ZoL so obviously not an expert but I've heard many examples of people using ZoL in production environments.

I think possibly the issue here is NVME, which I imagine has not had the a great deal of testing on OpenZFS on any platform, given how new it is. Also when doing my initial research I stumbled across a couple of mentions of people using ZFS as a root filesystem on Gentoo with NVME, without issue; so it could also be down to certain features lacking in the RHEL kernel compared to a bleeding edge one - again just speculating.

Like I said I'm now using LVM for root - but my storage array will be using ZFS because quite frankly it's the only option. LVM is great, but snapshots degrade performance significantly. The only other file system would be BTRFS and ofcourse that's definitely not happening on a production environment, given the many issues around RAID and the fact Redhat is dropping official support for it.

dcrdev
Posts: 70
Joined: 2015/10/25 23:42:17

Re: Kernel Panic Advice

Post by dcrdev » 2017/08/21 13:17:57

Ok so after some digging I came accross this:
https://access.redhat.com/solutions/3094071

I have a Redhat developer network subscription, so was able to see the content of the article - it's the same issue as I have. Unfortunately once again it references a private bug: https://bugzilla.redhat.com/show_bug.cgi?id=1450098 .

How do I go about finding out what's happening with it?

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Kernel Panic Advice

Post by hunter86_bg » 2017/08/21 16:08:03

You can't view the private bug as it is opened by a Red Hat customer and only the customer can see it.

Post Reply