CentOS server freeze/crash on megaraid rebuild, analysis and

Issues related to hardware problems
jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

CentOS server freeze/crash on megaraid rebuild, analysis and

Post by jamesNJ » 2015/07/24 17:25:21

Hello all,

I have a problem with a large CentOS 7 server hosting an LSI MegaRAID controller with 16x 1tb SAS drives. The server goes dead at night requiring a forced reboot or power cycle to restore service. If it matters, this server has 1 large RAID-6 volume with 1 global hot spare available.

I believe I have narrowed this issue down to the MegaRAID controller being busy with a RAID rebuild, and some automated action occurring at night that confuses LVM into oblivion.

The issue is difficult to narrow down because this “automated action” seems to result in all file systems being marked read-only. Syslog seems to continue working, but obviously cannot write useful data out to disk. Hence I have only been able to collect data on those rare times that I can actually log in when this issue occurs. I was able to capture 2 points of data that seem to start out with the same error condition.

This only seems to occur when a drive fails and the MegRAID rebuilds to the global hot spare, or if I force some action on the RAID which causes a drive fail and rebuild to an alternate disk (I had a few disks get SMART predictive failures and have been working to replace these with new). I initially thought this issue was related to smartd warning messages, however when I replaced the last drive with predictive failures, the rebuild triggered the same behavior.

So what seems to be the pattern is that I kick off a rebuild (which takes many hours) and then sometime around midnight a systemd-udevd process kicks in and the system eventually ends up unresponsive. From the 2 times I was able to get on, these messages seem to be in common right at the time file systems go read-only:

Jul 15 00:45:44 server1 kernel: megaraid_sas: scanning for scsi0...
Jul 15 00:45:44 server1 systemd-udevd: failed to execute '/sbin/mdadm' '/sbin/mdadm -If sda5 --path pci-0000:09:00.0-scsi-0:2:0:0': Input/output error
Jul 15 00:45:44 server1 systemd-udevd: failed to execute '/sbin/mdadm' '/sbin/mdadm -If sda4 --path pci-0000:09:00.0-scsi-0:2:0:0': Input/output error
Jul 15 00:45:44 server1 systemd-udevd: failed to execute '/sbin/mdadm' '/sbin/mdadm -If sda3 --path pci-0000:09:00.0-scsi-0:2:0:0': Input/output error
Jul 15 00:45:44 server1 systemd: Stopping LVM2 PV scan on device 8:3...
Jul 15 00:45:44 server1 systemd-udevd: failed to execute '/sbin/mdadm' '/sbin/mdadm -If sda2 --path pci-0000:09:00.0-scsi-0:2:0:0': Input/output error
Jul 15 00:45:44 server1 systemd: Stopping LVM2 PV scan on device 8:4...
Jul 15 00:45:44 server1 systemd-udevd: failed to execute '/sbin/mdadm' '/sbin/mdadm -If sda1 --path pci-0000:09:00.0-scsi-0:2:0:0': Input/output error
Jul 15 00:45:44 server1 systemd: Stopping Local File Systems.
Jul 15 00:45:44 server1 systemd: Stopped target Local File Systems.
Jul 15 00:45:44 server1 systemd: Unmounting /boot...
Jul 15 00:45:44 server1 systemd: Failed at step EXEC spawning /bin/umount: Input/output error
Jul 15 00:45:44 server1 systemd: boot.mount mount process exited, code=exited status=203
Jul 15 00:45:44 server1 systemd: Failed unmounting /boot.
Jul 15 00:45:44 server1 systemd: Stopping File System Check on /dev/disk/by-uuid/14b55dbf-64b0-477f-a003-1ec7404cb363...
Jul 15 00:45:44 server1 systemd: Stopped File System Check on /dev/disk/by-uuid/14b55dbf-64b0-477f-a003-1ec7404cb363.
Jul 15 00:45:44 server1 lvm: Device 8:3 not found. Cleared from lvmetad cache.
Jul 15 00:45:44 server1 systemd: Stopped LVM2 PV scan on device 8:3.

Can anyone tell me what this systemd-udev process is trying to do, and why it might have some adverse interaction with the RAID rebuild? If I scan my messages log for instances of systemd-uudev, I don’t find anything interesting. Also, I don’t know why the system is trying to spawn this mdadm process. CentOS only sees 1 giant 11tb drive and then uses the space for LVM; Linux has no other interaction with RAID related activities. Again, this only seems to happen when the RAID is rebuilding drives so this should be a very rare occurrence; however it could lead to mystery failures at night if a drive goes bad and MegaRAID decides to automatically rebuild on the available hot spare.

If anyone can shed light on why this occurs and/or how I might further troubleshoot or prevent this I would appreciate it. If it helps, I have uploaded `getinfo.sh disk` output to: http://pastebin.com/1YiDmQVq

User avatar
TrevorH
Forum Moderator
Posts: 23195
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by TrevorH » 2015/07/24 18:52:16

Could you post the output from cat /proc/mdstat

I am really confused about why, when you have a hardware RAID controller, it is trying to use mdadm!
CentOS 5 died in March 2017 - migrate NOW!
Full time Geek, part time moderator. Use the FAQ Luke

jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by jamesNJ » 2015/07/24 20:15:12

Yes ... I am equally confused. It is a hardware RAID controller with no other separate devices, and for all intents and purposes a very straightforward CentOS install.

/proc/mdstat doesn't report anything useful:

# cat /proc/mdstat
Personalities :
unused devices: <none>

I didn't do any configuration of smartd, nor anything to systemd. From the log, it looks like it is an automated process meant to do some kind of maintenance on md devices, but I have no idea what or why this process is triggered.

jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by jamesNJ » 2015/07/28 22:13:39

Does anyone have a guess as to where I should start looking to trace this issue down?

I did a general scan of /lib/systemd/system for instances of mdadm but didn't find anything useful. I'm looking at docs for systemd-udev but haven't found anything interesting there either. The /etc/udev directory on this server is pretty bare of anything useful.

I found in /usr/lib/udev/rules.d a file 65-md-incremental.rules that looks suspicious; it seems to have commands baked in that resemble the output I posted (mdadm -If).

I don't know much about this area of udev and device management, but sniffing around various files it might seem plausible that my MegaRAID rebuild, at some point, generates a kernel event which causes this udev rule to fire and subsequently snarl up LVM.

I suppose one thing I could try is to disable this rule entirely (since I don't use md devices anyway) and test to see what happens.

Can anyone point me to information where I may be able to verify this, and possibly capture some logging on the (kernel?) events that might trigger this?

gerald_clark
Posts: 10642
Joined: 2005/08/05 15:19:54
Location: Northern Illinois, USA

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by gerald_clark » 2015/07/28 22:41:48

Do you have any 'fd' type partitions?

jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by jamesNJ » 2015/08/02 04:12:14

Do you mean fd as in floppy disk? These are the only partitions on the host:
# cat partitions
major minor #blocks name

8 0 12682608640 sda
8 1 2048 sda1
8 2 512000 sda2
8 3 53252096 sda3
8 4 12628841455 sda4
11 0 1048575 sr0
253 0 32768000 dm-0
253 1 20480000 dm-1
253 2 20971520 dm-2

gerald_clark
Posts: 10642
Joined: 2005/08/05 15:19:54
Location: Northern Illinois, USA

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by gerald_clark » 2015/08/02 05:10:26

No. When you run fdisk, each partition has a type flag.
Partitions of type 'fd' are assumed to be software RAID partitions.
Having partitions of type 'fd' will trigger a mdraid assemble attempt.

Show us the output of the 'blkid' command and of 'fdisk -l'.

jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by jamesNJ » 2015/08/03 20:26:54

Thanks for the clarification. I have a one huge device so gdisk is needed here. Below is the output requested. Unfortunately, since the last time this error occurred and now I have repartitioned that 1 large device, and I don't recall what the partition codes would have been. The previous partitioning had a small partition for LVM an then an ordinary partition for non-LVM filesystem, and I think the codes were 8E00 and 8300 respectively. I built the server and it never had software RAID enabled, so worst case I might have used the defaults that gdisk presents when partitioning (linux partition 8300)

# blkid
/dev/block/8:3: UUID="Vz3FXg-1B1H-JVkH-vghb-jNwM-j91R-D2SDnN" TYPE="LVM2_member" PARTLABEL="/dev/sda3 LVM" PARTUUID="1af8c7f3-aa3c-47aa-ad33-743a3522a777"
/dev/block/253:1: UUID="a89c2299-5649-4473-915c-cd9ae2b6106f" TYPE="ext4"
/dev/block/8:2: UUID="14b55dbf-64b0-477f-a003-1ec7404cb363" TYPE="xfs" PARTUUID="ea90d665-41e7-476d-ba6f-07d25ae52882"
/dev/block/253:0: UUID="d86e4e6b-d4ca-4ce5-bec6-91440eab03ce" TYPE="swap"
/dev/sda4: UUID="eFXuf1-1c0G-mTHj-y93O-SAdl-eOw9-icc5zm" TYPE="LVM2_member" PARTLABEL="/dev/sda4 LVM" PARTUUID="e65b3086-5575-4220-8234-7f0ff26b0c6e"
/dev/mapper/centos00-mysql: UUID="32b56a09-611c-47e8-8c8f-b5ecdde66656" TYPE="ext4"

# gdisk -l /dev/sda
GPT fdisk (gdisk) version 0.8.6

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 25365217280 sectors, 11.8 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 03EB60FD-E681-4E2D-9BCB-53409CDC81A8
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 25365217246
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number Start (sector) End (sector) Size Code Name
1 2048 6143 2.0 MiB EF02
2 6144 1030143 500.0 MiB 0700
3 1030144 107534335 50.8 GiB 8E00 /dev/sda3 LVM
4 107534336 25365217246 11.8 TiB 8E00 /dev/sda4 LVM

jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by jamesNJ » 2015/08/10 18:30:41

I still have this problem but some new information ... I wonder if this thread should be transferred to the CentOS hardware forum.

I suffered a disk fail in the megaraid RAID set this weekend, and so it rebuilt automatically on the hot spare. During this time, the server went dead twice; I was not able to collect anything useful as I needed to force a reboot remotely. When I could get to the system, I edited /usr/lib/udev/rules.d/65-md-incremental.rules to disable the md maintenance in hopes that would prevent future outage. I then restarted systemd-udev.service.

Today I hit another outage and was able to collect some extra data. At least from what I can see the md utilities are not kicking off, so at this point I don't think the md action from udev was the cause of the issue here, it may be something that occurs earlier, and the md issue was just a false issue.

Here is some data I collected from dmesg and messages logs. Any comments or direction would be appreciated.

It looks like the first two errors in dmesg might have something to consider. Can anyone tell me what causes the messages abcou megaraid scanning and the sync on scsi cache to occur? Maybe there is some automated process that triggers the "LVM2 PV scan" that I should consider?

Here are my log outputs ...

DMESG:
[143476.406788] megaraid_sas: scanning for scsi0...
[143476.460051] sd 0:2:0:0: [sda] Synchronizing SCSI cache
[143476.783166] EXT4-fs warning (device dm-2): ext4_end_bio:332: I/O error -19 writing to inode 15 (offset 0 size 0 starting block 532513)
[143476.783173] Buffer I/O error on device dm-2, logical block 532513
[143477.711084] Aborting journal on device dm-2-8.
[143477.711093] Buffer I/O error on device dm-2, logical block 2655236
[143477.711095] lost page write due to I/O error on dm-2
[143477.711097] JBD2: Error -5 detected when updating journal superblock for dm-2-8.
[143480.014933] Buffer I/O error on device dm-1, logical block 2129940
[143480.014936] lost page write due to I/O error on dm-1
[143480.015074] Aborting journal on device dm-1-8.
[143480.015078] Buffer I/O error on device dm-1, logical block 2129920
[143480.015080] lost page write due to I/O error on dm-1
[143480.015082] JBD2: Error -5 detected when updating journal superblock for dm-1-8.
[143480.015111] Buffer I/O error on device dm-1, logical block 0
[143480.015118] lost page write due to I/O error on dm-1
[143480.015126] EXT4-fs (dm-1): previous I/O error to superblock detected
[143480.015121] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal
[143480.015132] EXT4-fs (dm-1): Remounting filesystem read-only
[143480.015152] Buffer I/O error on device dm-1, logical block 0
[143480.015158] lost page write due to I/O error on dm-1
[143480.015163] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal
[143480.015170] Buffer I/O error on device dm-1, logical block 0
[143480.015173] lost page write due to I/O error on dm-1
[143485.039033] Buffer I/O error on device dm-1, logical block 524302
[143485.039036] lost page write due to I/O error on dm-1
[143485.039040] Buffer I/O error on device dm-1, logical block 524414
[143485.039042] lost page write due to I/O error on dm-1
[143485.039050] Buffer I/O error on device dm-1, logical block 1088189
[143485.039051] lost page write due to I/O error on dm-1
[143485.039056] Buffer I/O error on device dm-1, logical block 4196613
[143485.039057] lost page write due to I/O error on dm-1


/var/log/messages:
Aug 10 13:47:05 smaug kernel: megaraid_sas: scanning for scsi0...
Aug 10 13:47:05 smaug systemd: Stopping LVM2 PV scan on device 8:3...
Aug 10 13:47:05 smaug systemd: Stopping LVM2 PV scan on device 8:4...
Aug 10 13:47:05 smaug systemd: Stopping Local File Systems.
Aug 10 13:47:05 smaug systemd: Stopped target Local File Systems.
Aug 10 13:47:05 smaug systemd: Unmounting /boot...
Aug 10 13:47:05 smaug systemd: Failed at step EXEC spawning /bin/umount: Input/output error
Aug 10 13:47:05 smaug systemd: boot.mount mount process exited, code=exited status=203
Aug 10 13:47:05 smaug systemd: Failed unmounting /boot.
Aug 10 13:47:05 smaug systemd: Stopping File System Check on /dev/disk/by-uuid/14b55dbf-64b0-477f-a003-1ec7404cb363...
Aug 10 13:47:05 smaug systemd: Stopped File System Check on /dev/disk/by-uuid/14b55dbf-64b0-477f-a003-1ec7404cb363.
Aug 10 13:47:05 smaug lvm: Device 8:3 not found. Cleared from lvmetad cache.
Aug 10 13:47:05 smaug systemd: Stopped LVM2 PV scan on device 8:3.
Aug 10 13:47:05 smaug kernel: sd 0:2:0:0: [sda] Synchronizing SCSI cache
Aug 10 13:47:05 smaug lvm: Device 8:4 not found. Cleared from lvmetad cache.
Aug 10 13:47:05 smaug systemd: Stopped LVM2 PV scan on device 8:4.
Aug 10 13:47:05 smaug mysqld_safe: 150810 13:47:05 mysqld_safe Number of processes running now: 0
Aug 10 13:47:05 smaug mysqld_safe: 150810 13:47:05 mysqld_safe mysqld restarted
Aug 10 13:47:05 smaug kernel: EXT4-fs warning (device dm-2): ext4_end_bio:332: I/O error -19 writing to inode 15 (offset 0 size 0 starting block 532513)
Aug 10 13:47:05 smaug kernel: Buffer I/O error on device dm-2, logical block 532513
Aug 10 13:47:05 smaug mysqld_safe: 150810 13:47:05 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
Aug 10 13:47:06 smaug kernel: Aborting journal on device dm-2-8.
Aug 10 13:47:06 smaug kernel: Buffer I/O error on device dm-2, logical block 2655236
Aug 10 13:47:06 smaug kernel: lost page write due to I/O error on dm-2
Aug 10 13:47:06 smaug kernel: JBD2: Error -5 detected when updating journal superblock for dm-2-8.

[Thread moved, as suggested, to CentOS 7 - Hardware Support.]

jamesNJ
Posts: 20
Joined: 2015/02/25 21:49:44

Re: CentOS server freeze/crash on megaraid rebuild, analysis

Post by jamesNJ » 2015/08/11 02:26:11

On more interesting fact. I noted above I suffered another outage today and initially wasn't sure why.

When I finally inspected the hardware expecting to see a failed drive, I actually had 2 failed drives. Unlucky me, but not a surprise; I set raid6 + hot spare to help insulate against problems like this. The second server outage occurred after my other administrator confirmed that 1 drive was failed ....

So I am speculating here that adverse hardware events with the megaraid correlate to these outages -- this issue has only been observed when drives fail or are in the process of recovery/rebuilding.

One thing I am going to try is to disable the smartd daemon. Right now that is the only thing that I can find that is periodically polling the hardware for status.

Does anyone else have any suggestions for additional troubleshooting or logging that might help?

Post Reply