software RAID-1 data recovery

General support questions including new installations
markmh
Posts: 24
Joined: 2012/01/29 21:57:39

software RAID-1 data recovery

Post by markmh » 2012/01/30 00:31:14

Hello,

We have a software RAID 1 array, and a few hundred gigs of data have disappeared. The type is a software not a hardware RAID. I believe that we have both sda and sdb disks that should be mirroring each other. Error message in /var/log/messages (below) indicates a problem with sda.

Our priority is to recover the data. It contains the research of a few people. What is a good way to access the data safely?
We would like to be able to copy it to an external disk. What is the best (safe) way to try to look at the contents of the each of the two disks, sda and sdb, separately? I am afraid to try anything in case I make things worse.

I saw these commands suggested somewhere, but I am not sure if these apply in my situation:

umount /compute
mkdir /compute-other
mount /dev/sdb /compute
mount /dev/sda /compute-other


Any help would be appreciated. Thanks.


(more info, FYI, if useful:

system info: CentOS release 4.4 (final) operating system
distribution is Rocks release 4.2.1 (Cydonia)

/proc/mdstat gives:
Personalities :
unused devices:

/etc/fstab gives:
[code]# This file is edited by fstab-sync - see 'man fstab-sync' for details
LABEL=/ / ext3 defaults 1 1
none /dev/pts devpts gid=5,mode=620 0 0
none /dev/shm tmpfs noexec,nosuid 0 0
none /proc proc defaults 0 0
LABEL=/state/partition /state/partition1 ext3 nosuid,defaults 1 2
none /sys sysfs defaults 0 0
LABEL=/var /var ext3 defaults 1 2
/dev/sda3 swap swap defaults 0 0
192.168.1.250:/export/home /home nfs nosuid,intr,rsize=32768,wsize=32768,noatime 0 0
192.168.1.250:/compute /compute nfs rsize=32768,wsize=32768,noatime,nosuid,nodev,intr 0 0
/dev/hdc /media/cdrom auto pamconsole,exec,noauto,managed 0 0
/dev/fd0 /media/floppy auto pamconsole,exec,noauto,managed 0 0
[/code]
[Moderator edit: Added [i]code[/i] tags to preserve formatting.]

/var/log/messages gives this, and repeats:
Jan 22 04:09:20 compute-1-22.local ntpdate[28121]: cap_set_proc failed.
Jan 22 04:09:21 compute-1-22.local ntpd[28125]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:20:50 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Jan 22 04:20:50 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Offline uncorrectable sectors
Jan 22 04:27:40 compute-1-17.local ntpdate[16403]: cap_set_proc failed.
Jan 22 04:27:41 compute-1-17.local ntpd[16407]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:37:07 compute-1-20.local ntpdate[25104]: cap_set_proc failed.
Jan 22 04:37:07 compute-1-20.local ntpd[25108]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:44:03 compute-1-19.local ntpdate[17662]: cap_set_proc failed.
Jan 22 04:44:03 compute-1-19.local ntpd[17666]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:44:23 compute-1-24.local ntpdate[17321]: cap_set_proc failed.
Jan 22 04:44:23 compute-1-24.local ntpd[17325]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:50:49 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Jan 22 04:50:49 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Offline uncorrectable sectors
Jan 22 04:54:13 compute-1-16.local ntpdate[8123]: cap_set_proc failed.
Jan 22 04:54:13 compute-1-16.local ntpd[8127]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 05:00:12 compute-1-21.local ntpdate[23242]: cap_set_proc failed.
Jan 22 05:00:12 compute-1-21.local ntpd[23246]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 05:01:04 compute-1-7.local ntpdate[8102]: no server suitable for synchronization found
Jan 22 05:01:04 compute-1-12.local ntpdate[11055]: no server suitable for synchronization found
Jan 22 05:01:04 compute-1-14.local ntpdate[17838]: no server suitable for synchronization found
Jan 22 05:01:05 compute-1-25.local ntpdate[12129]: no server suitable for synchronization found
Jan 22 05:01:05 compute-1-9.local ntpdate[27772]: no server suitable for synchronization found
Jan 22 05:01:07 compute-1-15.local ntpdate[16657]: no server suitable for synchronization found
Jan 22 05:01:07 compute-1-6.local ntpdate[13937]: no server suitable for synchronization found
)

pschaff
Retired Moderator
Posts: 18276
Joined: 2006/12/13 20:15:34
Location: Tidewater, Virginia, North America
Contact:

software RAID-1 data recovery

Post by pschaff » 2012/01/30 10:34:18

Welcome to the CentOS fora. Please see the recommended reading for new users linked in my signature.

I'm a bit confused as to why /proc/mdstat does not show the RAID.

It appears /dev/sda has serious problems and may not be recoverable. If it contains the missing data, and cannot be successfully mounted, then you may want to look at some rescue tools such as the following from [url=http://wiki.centos.org/AdditionalResources/Repositories?highlight=%28rpmforge%29]RPMforge[/url]:

dd_rescue
dd_rhelp
ddrescue

I'm not sure if these are available for CentOS-4. Your best bet may be to mount the disks on a system with a later OS installation if they are not. Whatever you do, writing to the disks should be avoided.

Mounting each device independently, as you outlined, rather than as RAID members should let you access the contents, if they can be mounted successfully.

Do note that you are running a very obsolete, insecure, and unsupported release. The current CentOS-4 release is 4.9 and the whole series reaches [url=https://www.centos.org/modules/newbb/viewtopic.php?topic_id=35288&forum=27&post_id=152075#forumpost152075]end of life[/url] in a few weeks.

If more help is needed then please [url=http://www.centos.org/modules/newbb/viewtopic.php?topic_id=28723&forum=54]provide more information about your system[/url] by running "./getinfo.sh" and showing us the output file.

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Post by markmh » 2012/01/31 15:56:44

Thanks Phil, I will try to mount them today and see what happens.

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Post by markmh » 2012/02/01 22:51:35

OK, here is some more info:

./getinfo.sh basic:


Basic system information.
[code]
== BEGIN uname -rmi ==
2.6.9-78.ELsmp x86_64 x86_64
== END uname -rmi ==

== BEGIN rpm -qa \*-release\* ==
centos-release-4-4.2
rpmforge-release-0.3.6-1.el4.rf
== END rpm -qa \*-release\* ==

== BEGIN cat /etc/redhat-release ==
CentOS release 4.4 (Final)
== END cat /etc/redhat-release ==

== BEGIN getenforce ==
Disabled
== END getenforce ==

== BEGIN free -m ==
total used free shared buffers cached
Mem: 4903 2156 2746 0 98 983
-/+ buffers/cache: 1074 3828
Swap: 996 521 474
== END free -m ==

[/code]





df -h (as root, on head node):

Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 32G 11G 20G 36% /
/dev/sdc1 251M 51M 187M 22% /boot
none 3.0G 0 3.0G 0% /dev/shm
/dev/sdc5 258G 242G 3.1G 99% /export
tmpfs 1.5G 9.2M 1.5G 1% /var/lib/ganglia/rrds
/dev/md0 1.4T 1.1T 281G 79% /compute
/export/home/opt 258G 242G 3.1G 99% /home/opt





lsscsi (as root):

[0:0:0:0] disk ATA ST31500341AS CC1H /dev/sda
[1:0:0:0] disk ATA ST31500341AS CC1H /dev/sdb
[6:0:0:0] disk AMCC 9550SX-4LP DISK 3.04 /dev/sdc





umount /compute (as root):

umount: /compute : device is busy
umount: /compute : device is busy


so, as far as I can tell, it doesn't even want to be unmounted. Does this help? Is there something I need to do to be able to unmount it?


Thanks

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Post by markmh » 2012/02/01 23:02:54

also,

cat /proc/mdstat (as root):

Personalities: [raid]
md0: active raid1 sda [0] sdb [1]
1465138496 block [2/2] [UU]
unused devices: (none)

User avatar
TrevorH
Forum Moderator
Posts: 23181
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: software RAID-1 data recovery

Post by TrevorH » 2012/02/02 01:06:35

[quote]
so, as far as I can tell, it doesn't even want to be unmounted. Does this help? Is there something I need to do to be able to unmount it?
[/quote]

I'd guess the file system is exported over NFS and some or all of your clients are using it (have it mounted). Does `showmount -d` give any clue? Or maybe it's just exported and you need to stop the nfs service.

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Post by markmh » 2012/02/03 22:01:13

ok, I stopped the jobs that were running on the nodes, and even turned off some of them, and it still won't unmount. Is there possibly more I have to do to disconnect the nodes?

showmount -d gives:

Directories on lcpp-cluster.bgsu.edu:
/compute
/diskless/x64/RHEL4-AS/root
/export
192.168.1.0/255.255.255.0


does that tell you anything?

pschaff
Retired Moderator
Posts: 18276
Joined: 2006/12/13 20:15:34
Location: Tidewater, Virginia, North America
Contact:

Re: software RAID-1 data recovery

Post by pschaff » 2012/02/03 22:06:26

What does [code]showmount -a[/code]show?

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Post by markmh » 2012/02/04 23:08:37

I am going to have to wait until Monday to access the machine and to get that info. But, is it possible for me to edit my previous post? I probably should not have included the address... :/

pschaff
Retired Moderator
Posts: 18276
Joined: 2006/12/13 20:15:34
Location: Tidewater, Virginia, North America
Contact:

Re: software RAID-1 data recovery

Post by pschaff » 2012/02/05 13:08:40

No, not after the one hour edit period times out, but a moderator can do it for you; however, I see nothing but non-routing private IP addresses, only accessible on the LAN. No public IP addresses that could be accessed from the Internet, unless I am (again) missing something.

We really needed to see more than "./getinfo.sh basic" - that's why I said "./getinfo.sh". It appears you are still at 4.4. An update (and a migration plan) would be highly advised, as many mirrors will stop carrying 4.x once it goes EOL.

Post Reply