RAID5 failure of one partition, not others

Issues related to hardware problems
nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/26 22:10:19

I have a 6-drive raid5 array, I recently added 3 drives (a few days ago) and I already have a failure! I don't actually think the drive has failed already, I want it to just try to rebuild the array onto that drive. Here is my /proc/mdstat:

Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdc1[2](S) sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid5 sdf2[3] sde2[4] sdd2[5] sdc2[2] sdb2[1] sda2[0]
81930240 blocks level 5, 256k chunk, algorithm 2 [6/6] [UUUUUU]

md2 : active raid5 sdf3[3] sde3[4] sdd3[6](F) sdc3[2] sdb3[1] sda3[0]
9685105920 blocks level 5, 256k chunk, algorithm 2 [6/5] [UUUUU_]

unused devices:

How can I tell it to rebuild the array onto the drive? It's brand new and the other partition in md1 is working fine! Do I have to remove the drive and then add it back?

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/27 00:22:27

I tried to remove /dev/sdd3 and it worked, but then it wouldn't let me add it back. So I rebooted the system.

Now when I look at mdstat, it thinks that sdf is the failed drive!

Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdc1[2](S) sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid5 sde2[3] sdd2[4] sdc2[2] sdb2[1] sda2[0]
81930240 blocks level 5, 256k chunk, algorithm 2 [6/5] [UUUUU_]

md2 : active raid5 sde3[3] sdd3[4] sdc3[2] sdb3[1] sda3[0]
9685105920 blocks level 5, 256k chunk, algorithm 2 [6/5] [UUUUU_]

unused devices:

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/27 00:33:43

Perhaps this will help:

dmesg | grep md
Kernel command line: ro root=/dev/md2 rhgb quiet
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
md: Autodetecting RAID arrays.
md: invalid raid superblock magic on sdd1
md: sdd1 has invalid sb, not importing!
md: invalid raid superblock magic on sde1
md: sde1 has invalid sb, not importing!
md: autorun ...
md: considering sde3 ...
md: adding sde3 ...
md: sde2 has different UUID to sde3
md: adding sdd3 ...
md: sdd2 has different UUID to sde3
md: adding sdc3 ...
md: sdc2 has different UUID to sde3
md: sdc1 has different UUID to sde3
md: adding sdb3 ...
md: sdb2 has different UUID to sde3
md: sdb1 has different UUID to sde3
md: adding sda3 ...
md: sda2 has different UUID to sde3
md: sda1 has different UUID to sde3
md: created md2
md: bind
md: bind
md: bind
md: bind
md: bind
md: running:
raid5: allocated 6286kB for md2
raid5: raid level 5 set md2 active with 5 out of 6 devices, algorithm 2
md: considering sde2 ...
md: adding sde2 ...
md: adding sdd2 ...
md: adding sdc2 ...
md: sdc1 has different UUID to sde2
md: adding sdb2 ...
md: sdb1 has different UUID to sde2
md: adding sda2 ...
md: sda1 has different UUID to sde2
md: created md1
md: bind
md: bind
md: bind
md: bind
md: bind
md: running:
raid5: allocated 6286kB for md1
raid5: raid level 5 set md1 active with 5 out of 6 devices, algorithm 2
md: considering sdc1 ...
md: adding sdc1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md0
md: bind
md: bind
md: bind
md: running:
md: personality for level 1 is not loaded!
md: do_md_run() returned -22
md: md0 stopped.
md: unbind
md: export_rdev(sdc1)
md: unbind
md: export_rdev(sdb1)
md: unbind
md: export_rdev(sda1)
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sda1 ...
md: adding sda1 ...
md: adding sdb1 ...
md: adding sdc1 ...
md: created md0
md: bind
md: bind
md: bind
md: running:
md: personality for level 1 is not loaded!
md: do_md_run() returned -22
md: md0 stopped.
md: unbind
md: export_rdev(sda1)
md: unbind
md: export_rdev(sdb1)
md: unbind
md: export_rdev(sdc1)
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdc1 ...
md: adding sdc1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md0
md: bind
md: bind
md: bind
md: running:
md: raid1 personality registered for level 1
raid1: raid set md0 active with 2 out of 2 mirrors
md: ... autorun DONE.
EXT3 FS on md2, internal journal
EXT3 FS on md0, internal journal
Adding 32772088k swap on /dev/md1. Priority:-1 extents:1 across:32772088k




dmesg | grep raid
raid5: automatically using best checksumming function: pIII_sse
raid5: using function: pIII_sse (4941.000 MB/sec)
raid6: int32x1 1176 MB/s
raid6: int32x2 1299 MB/s
raid6: int32x4 1197 MB/s
raid6: int32x8 1266 MB/s
raid6: mmxx1 2400 MB/s
raid6: mmxx2 4523 MB/s
raid6: sse1x1 2000 MB/s
raid6: sse1x2 3551 MB/s
raid6: sse2x1 3718 MB/s
raid6: sse2x2 5994 MB/s
raid6: using algorithm sse2x2 (5994 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
device-mapper: dm-raid45: initialized v0.2594l
md: invalid raid superblock magic on sdd1
md: invalid raid superblock magic on sde1
raid5: device sde3 operational as raid disk 3
raid5: device sdd3 operational as raid disk 4
raid5: device sdc3 operational as raid disk 2
raid5: device sdb3 operational as raid disk 1
raid5: device sda3 operational as raid disk 0
raid5: allocated 6286kB for md2
raid5: raid level 5 set md2 active with 5 out of 6 devices, algorithm 2
raid5: device sde2 operational as raid disk 3
raid5: device sdd2 operational as raid disk 4
raid5: device sdc2 operational as raid disk 2
raid5: device sdb2 operational as raid disk 1
raid5: device sda2 operational as raid disk 0
raid5: allocated 6286kB for md1
raid5: raid level 5 set md1 active with 5 out of 6 devices, algorithm 2
md: raid1 personality registered for level 1
raid1: raid set md0 active with 2 out of 2 mirrors

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/27 00:38:50

/sbin/fdisk -l

Disk /dev/sda: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 fd Linux raid autodetect
/dev/sda2 14 2053 16386300 fd Linux raid autodetect
/dev/sda3 2054 243201 1937021310 fd Linux raid autodetect

Disk /dev/sdb: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 104391 fd Linux raid autodetect
/dev/sdb2 14 2053 16386300 fd Linux raid autodetect
/dev/sdb3 2054 243201 1937021310 fd Linux raid autodetect

Disk /dev/sdc: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdc1 * 1 13 104391 fd Linux raid autodetect
/dev/sdc2 14 2053 16386300 fd Linux raid autodetect
/dev/sdc3 2054 243201 1937021310 fd Linux raid autodetect

Disk /dev/sdd: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdd1 1 13 104391 fd Linux raid autodetect
/dev/sdd2 14 2053 16386300 fd Linux raid autodetect
/dev/sdd3 2054 243201 1937021310 fd Linux raid autodetect

Disk /dev/sde: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sde1 1 13 104391 fd Linux raid autodetect
/dev/sde2 14 2053 16386300 fd Linux raid autodetect
/dev/sde3 2054 243201 1937021310 fd Linux raid autodetect

Disk /dev/md2: 9917.5 GB, 9917548462080 bytes
2 heads, 4 sectors/track, -1873690816 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/md1: 83.8 GB, 83896565760 bytes
2 heads, 4 sectors/track, 20482560 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md0: 106 MB, 106823680 bytes
2 heads, 4 sectors/track, 26080 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/27 01:08:04

Ok, so I shut the box down, and went to the console and opened it and checked that all wires were connected, then I powered it on and went into the bios and it saw all 6 drives. Then I let it boot, and it shows that /dev/sdd3 had failed like the first time. Since it was removed, I just added it back (/sbin/mdadm /dev/md2 -a /dev/sdd3) and then it worked! Now it is rebuilding the array. I have no idea why it did this strange behavior. It had been running for about a year with just the 3 drives without any issues.

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/27 17:18:11

Ok new problem. Today I checked the rebuild, and it shows that sde has now failed, and sdd3 is a spare! Yet the array seems to be working with no data loss so far. Help!


Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdc1[2](S) sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid5 sdf2[3] sde2[6](F) sdd2[5] sdc2[2] sdb2[1] sda2[0]
81930240 blocks level 5, 256k chunk, algorithm 2 [6/5] [UUUU_U]

md2 : active raid5 sdd3[6](S) sdf3[3] sde3[7](F) sdc3[2] sdb3[1] sda3[0]
9685105920 blocks level 5, 256k chunk, algorithm 2 [6/4] [UUUU__]

unused devices:

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: RAID5 failure of one partition, not others

Post by nosebreaker » 2011/11/27 18:08:17

I take it back, there is apparent data loss. I can get a directory listing, but I cannot read files. I can login and run commands though.

There is a VM that is running on this box that appears to be fine so far, it can read/write to its virtual HDD which is on md2.

If I "cat /dev/sde3" it does show data, however it says "input/output error" as well.

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Failed RAID5 volume (see latest post)

Post by nosebreaker » 2011/11/28 03:07:11

It appears that the culprit was the cheap SATA cables that came with the new drives. I rebooted the server again and it didn't detect a drive at all, so I powered it off and opened it up again. I noticed the new cables are a bit looser than the old ones, and after pushing them back on and powering up again, it detects all the drives now.

Now, the system won't boot because it says 2 of the 6 drives have failed and it cannot find the superblock. I also noted that it said on sdd, sde and sdf it found no superblock. It said the raid456 error before, I believe it to be spurious.

insmod: error inserting '/lib/raid456.ko': -1 File exists
insmod: error inserting '/lib/raid456.ko': -1 File exists
md: invalid raid superblock magic on sdd1
md: invalid raid superblock magic on sde1
md: invalid raid superblock magic on sdf1
[b]raid5: not enough operational devices for md1 (2/6 failed)[/b]
raid5: failed to run raid set md2
md: pers->run() failed ...
EXT3-fs: unable to read superblock
mount: error mounting /dev/root on /sysroot as ext3: Invalid argument
setuproot: moving /dev failed: No such file or directory
setuproot: error mounting /proc: No such file or directory
setuproot: error mounting /sys: No such file or directory
switchroot: mount failed: No such file or directory
Kernel panic - not syncing: Attempted to kill init!

nosebreaker
Posts: 72
Joined: 2010/08/09 16:10:26

Re: Failed RAID5 volume (see latest post) [solved]

Post by nosebreaker » 2011/11/28 23:25:29

I was able to repair the array! I ended up first trying a ubuntu livecd, but it wouldn't repair with 'mdadm --assemble --scan -fv' as it didn't see enough drives. I then plugged in an ide/pata drive and installed the same version of Centos (5.5) to that, and then ran the mdadm --assemble --scan -fv command, and it said it found 5/6 and was able to do something, then the machine crashed/locked. I rebooted it and it wouldn't boot to the console, but since it said it fixed something on the array I then removed the ide drive and turned the box back on.

It booted saying 5/6 and is rebuilding the 6th drive now!

r_hartman
Posts: 711
Joined: 2009/03/23 15:08:11
Location: Netherlands
Contact:

Re: Failed RAID5 volume (see latest post) [solved]

Post by r_hartman » 2011/11/29 09:25:06

Quite a story. Makes me wonder what drives you added. WD Caviar Green is notorious for RAID issues, WD has special (expensive!) RE (Raid Edition) drives. Even Caviar Black is not recommended for RAID purposes.

I've seen issues with fluffy sata cables as well. Only one remedy: get proper ones. You don want to lose data just because of a cheap cable being intermittent.

Post Reply