Degraded Soft-RAID1 Does Not Boot

PhoenixM · Post by **PhoenixM** » 2019/01/19 00:16:41

I am trying to create my first software RAID1 GPT system. Well, specifically, I have already created it, but before I let it go live for the users, I want to make sure that I've done everything necessary to ensure I am able to recover from a single hard drive failure, since that's kind of the point of setting it up as a RAID1 in the first place. But after having taken all of the steps that I think are appropriate/correct, I am falling at the final step. I have extensive experience with 3ware 9650SE hardware RAID cards (which only do MBR), but this was my first time setting up a software RAID.

So... I started with two blank 6TB hard drives. Ahead of time I preallocated a 512MiB partition FAT32 partition on sdb to act as a place to later put a duplicate copy of the EFI partition that I was going to let the CentOS 7 installer create on sda (since EFI partitions can't be mirrored via RAID). Anyway, from within the CentOS installer, I created an EFI partition, and then set up RAID arrays for /boot, swap, /, and /home (ext4 for all but swap, and no LVM). I did the installation, everything went fine, and afterwards I used mdadm -E and the links in /dev/disk/by-uuid to check that things seemed to be the way I wanted them; about my only surprise was that the md0, md1, etc. I had specified during installation instead turned into md124, md125, etc.

Having done that, I used dd to copy the real EFI partition (sda1) onto sdb1. I changed the UUID on sdb1, set its flags the same as sda1, and then I used efibootmgr to create a boot entry for its copy of shimx64.efi. My reasoning was that, in the event of a complete sda failure, I could still boot into my OS via the EFI loader on sdb, then use mdadm to fail and remove the sda partitions from their respective arrays, add a blank replacement disk with an identical partition table, and then use mdadm again to regenerate the array.

Oh, one final thing: I added the nofail flag to the installer-created fstab entry for mounting the EFI partition in /boot/efi, because I didn't want that missing partition to break the boot process in the event of an sda failure. All the other fstab entries used the RAID UUIDs, so I figured they'd still mount properly if sda went south (which apparently was an incorrect assumption, as you'll see).

Now it was time to test what I had done. I disconnected both drives, removed sdb, and then used a physical HD duplicator to create a clone of it on a blank drive. I replaced the original sdb with the clone, and booted with only that one drive attached (i.e. to simulate a total sda failure). The CentOS EFI entry wouldn't work (obviously, since it was pointing sda1), but the sdb1 entry I made with efibootmgr did work; I got to grub fine, and I even got a bit of the graphical CentOS loading screen. But then things died, and I was dropped into emergency mode.

The relevant messages I got on screen are as follows:

Code: Select all

[ OK ] Reached target Basic System.
[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-(uuid of md126, i.e. root partition).
[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-(uuid of md126, i.e. root partition).
[ DEPEND ] Dependency failed for File System Check on /dev/disk/by-uuid/(uuid of md126).
[ DEPEND ] Dependency failed for /sysroot.
[ DEPEND ] Dependency failed for Initrd Root File System.
[ DEPEND ] Dependency failed for Reload Configuration from the Real Root.

Just for the hell of it, I tried plugging the clone into sda's sata cable, and got the same result.

Anyway, that's where I'm at. It's almost as if initrd doesn't register md126's UUID as valid if both disks aren't present. Obviously there's something I'm missing, but nothing obvious is jumping out at me. Any help would be appreciated.

hunter86_bg · Post by **hunter86_bg** » 2019/01/19 20:49:27

Can you try the following:
1. Install on a single disk
2. Use dd to clone /boot and /boot/efi to the second disk
3. Using LVM add /dev/sdb3 to the system VG
4. Convert each LV to raid1 (and point the new PV)
5. Reboot to test everything is working
6. Power off, remove drive1 and boot of 2nd drive

I see 1 possible flaw -when updating the kernel and successful reboot, the 2 drives need to be synced manually (maybe a script?)

PhoenixM · Post by **PhoenixM** » 2019/01/21 17:39:13

I do appreciate that you've replied to my post. That having been said...

1. I'm really not a fan of working with LVM; I have worked with it, successfully, but prefer not to use it unless I have a specific reason to, as I see it an extra layer of complexity added to a system that presumably will not need the dynamic options that LVM exists to provide.
2. I'd rather not have to try doing the install yet again if at all possible, having already done it twice now; aside from the effort involved, I have people who are waiting to use this server, and I'm hoping to be able to give them access sooner rather than later.
3. But most importantly... I still really want to understand why the steps I've already taken are not working. I've done everything based on my understanding of how the boot process should work, though conceding my lack of familiarity in setting up a softRAID1 from scratch. If I can't boot the system with only one disk, then clearly there is something here that I don't understand and/or some additional step(s) that I need to take; I want to understand what it is that I don't know, so as to be a better administrator.

This afternoon, I think I'm going to try rebuilding initrd, for no real reason other than the fact that the error messages suggest that he's the guy who is having problems seeing/acknowledging the root RAID array specified in fstab. Once I do that (assuming that I don't brick my system in the process), then I'll fire off the cloning process again on sdb, and repeat my degraded array boot experiment. I don't see why initrd should need a rebuild, considering that the installation process knew from the very beginning that I was installing most everything into raid arrays, but I don't know what else to try.

yoda_muppet · Post by **yoda_muppet** » 2019/04/08 19:32:24

Did you make any progress on this? I'm also testing failure for drives and have nearly the same set of errors.

miscbs · Post by **miscbs** » 2019/04/30 14:14:24

Yep. Same problem here. Hope someone is paying attention. It is easy enough to get around, but would rather just have the server boot as it should to the remaining, functional drive as RAID 1 should. Here is what I did:

Install with:
1. Language: English
2. Timezone - Phoenix
3. Software Selection: Infrastructure Server
4. Installation Destionation:
a. 2 500gb SATA SSDs
b. I will configure partitioning
c. Standard Partitioning
/boot 1024MB xfs RAID1
/ 463GB xfs RAID1
swap 8196MB RAID1
5. Network: Automatically connect when available
6. Set root password
7. Reboot into new system after install

To test:
1. Shutdown and disconnnect disk
2. Boot up fails
3. Shutdown and reconnect drive.
4. Boots up fine

danrimal · Post by **danrimal** » 2019/05/03 11:08:31

There is a workaround to this problem:

https://bugs.centos.org/view.php?id=13107#c29045

CentOS

Degraded Soft-RAID1 Does Not Boot

Degraded Soft-RAID1 Does Not Boot

Re: Degraded Soft-RAID1 Does Not Boot

Re: Degraded Soft-RAID1 Does Not Boot

Re: Degraded Soft-RAID1 Does Not Boot

Re: Degraded Soft-RAID1 Does Not Boot

Re: Degraded Soft-RAID1 Does Not Boot