software RAID-1 data recovery

General support questions including new installations
markmh
Posts: 24
Joined: 2012/01/29 21:57:39

software RAID-1 data recovery

Postby markmh » 2012/01/30 00:31:14

Hello,

We have a software RAID 1 array, and a few hundred gigs of data have disappeared. The type is a software not a hardware RAID. I believe that we have both sda and sdb disks that should be mirroring each other. Error message in /var/log/messages (below) indicates a problem with sda.

Our priority is to recover the data. It contains the research of a few people. What is a good way to access the data safely?
We would like to be able to copy it to an external disk. What is the best (safe) way to try to look at the contents of the each of the two disks, sda and sdb, separately? I am afraid to try anything in case I make things worse.

I saw these commands suggested somewhere, but I am not sure if these apply in my situation:

umount /compute
mkdir /compute-other
mount /dev/sdb /compute
mount /dev/sda /compute-other


Any help would be appreciated. Thanks.


(more info, FYI, if useful:

system info: CentOS release 4.4 (final) operating system
distribution is Rocks release 4.2.1 (Cydonia)

/proc/mdstat gives:
Personalities :
unused devices:

/etc/fstab gives:

Code: Select all

# This file is edited by fstab-sync - see 'man fstab-sync' for details
LABEL=/                 /                       ext3    defaults        1 1
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /dev/shm                tmpfs   noexec,nosuid        0 0
none                    /proc                   proc    defaults        0 0
LABEL=/state/partition  /state/partition1       ext3    nosuid,defaults        1 2
none                    /sys                    sysfs   defaults        0 0
LABEL=/var              /var                    ext3    defaults        1 2
/dev/sda3               swap                    swap    defaults        0 0
192.168.1.250:/export/home      /home           nfs     nosuid,intr,rsize=32768,wsize=32768,noatime 0 0
192.168.1.250:/compute /compute nfs rsize=32768,wsize=32768,noatime,nosuid,nodev,intr 0 0
/dev/hdc                /media/cdrom            auto    pamconsole,exec,noauto,managed 0 0
/dev/fd0                /media/floppy           auto    pamconsole,exec,noauto,managed 0 0

[Moderator edit: Added code tags to preserve formatting.]

/var/log/messages gives this, and repeats:
Jan 22 04:09:20 compute-1-22.local ntpdate[28121]: cap_set_proc failed.
Jan 22 04:09:21 compute-1-22.local ntpd[28125]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:20:50 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Jan 22 04:20:50 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Offline uncorrectable sectors
Jan 22 04:27:40 compute-1-17.local ntpdate[16403]: cap_set_proc failed.
Jan 22 04:27:41 compute-1-17.local ntpd[16407]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:37:07 compute-1-20.local ntpdate[25104]: cap_set_proc failed.
Jan 22 04:37:07 compute-1-20.local ntpd[25108]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:44:03 compute-1-19.local ntpdate[17662]: cap_set_proc failed.
Jan 22 04:44:03 compute-1-19.local ntpd[17666]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:44:23 compute-1-24.local ntpdate[17321]: cap_set_proc failed.
Jan 22 04:44:23 compute-1-24.local ntpd[17325]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 04:50:49 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Jan 22 04:50:49 compute-1-11.local smartd[3642]: Device: /dev/sda, 1 Offline uncorrectable sectors
Jan 22 04:54:13 compute-1-16.local ntpdate[8123]: cap_set_proc failed.
Jan 22 04:54:13 compute-1-16.local ntpd[8127]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 05:00:12 compute-1-21.local ntpdate[23242]: cap_set_proc failed.
Jan 22 05:00:12 compute-1-21.local ntpd[23246]: cap_set_proc() failed to drop root privileges: Operation not permitted
Jan 22 05:01:04 compute-1-7.local ntpdate[8102]: no server suitable for synchronization found
Jan 22 05:01:04 compute-1-12.local ntpdate[11055]: no server suitable for synchronization found
Jan 22 05:01:04 compute-1-14.local ntpdate[17838]: no server suitable for synchronization found
Jan 22 05:01:05 compute-1-25.local ntpdate[12129]: no server suitable for synchronization found
Jan 22 05:01:05 compute-1-9.local ntpdate[27772]: no server suitable for synchronization found
Jan 22 05:01:07 compute-1-15.local ntpdate[16657]: no server suitable for synchronization found
Jan 22 05:01:07 compute-1-6.local ntpdate[13937]: no server suitable for synchronization found
)

pschaff
Retired Moderator
Posts: 18276
Joined: 2006/12/13 20:15:34
Location: Tidewater, Virginia, North America
Contact:

software RAID-1 data recovery

Postby pschaff » 2012/01/30 10:34:18

Welcome to the CentOS fora. Please see the recommended reading for new users linked in my signature.

I'm a bit confused as to why /proc/mdstat does not show the RAID.

It appears /dev/sda has serious problems and may not be recoverable. If it contains the missing data, and cannot be successfully mounted, then you may want to look at some rescue tools such as the following from RPMforge:

dd_rescue
dd_rhelp
ddrescue

I'm not sure if these are available for CentOS-4. Your best bet may be to mount the disks on a system with a later OS installation if they are not. Whatever you do, writing to the disks should be avoided.

Mounting each device independently, as you outlined, rather than as RAID members should let you access the contents, if they can be mounted successfully.

Do note that you are running a very obsolete, insecure, and unsupported release. The current CentOS-4 release is 4.9 and the whole series reaches end of life in a few weeks.

If more help is needed then please provide more information about your system by running "./getinfo.sh" and showing us the output file.

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Postby markmh » 2012/01/31 15:56:44

Thanks Phil, I will try to mount them today and see what happens.

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Postby markmh » 2012/02/01 22:51:35

OK, here is some more info:

./getinfo.sh basic:


Basic system information.

Code: Select all

== BEGIN uname -rmi ==
2.6.9-78.ELsmp x86_64 x86_64
== END   uname -rmi ==

== BEGIN rpm -qa \*-release\* ==
centos-release-4-4.2
rpmforge-release-0.3.6-1.el4.rf
== END   rpm -qa \*-release\* ==

== BEGIN cat /etc/redhat-release ==
CentOS release 4.4 (Final)
== END   cat /etc/redhat-release ==

== BEGIN getenforce ==
Disabled
== END   getenforce ==

== BEGIN free -m ==
             total       used       free     shared    buffers     cached
Mem:          4903       2156       2746          0         98        983
-/+ buffers/cache:       1074       3828
Swap:          996        521        474
== END   free -m ==







df -h (as root, on head node):

Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 32G 11G 20G 36% /
/dev/sdc1 251M 51M 187M 22% /boot
none 3.0G 0 3.0G 0% /dev/shm
/dev/sdc5 258G 242G 3.1G 99% /export
tmpfs 1.5G 9.2M 1.5G 1% /var/lib/ganglia/rrds
/dev/md0 1.4T 1.1T 281G 79% /compute
/export/home/opt 258G 242G 3.1G 99% /home/opt





lsscsi (as root):

[0:0:0:0] disk ATA ST31500341AS CC1H /dev/sda
[1:0:0:0] disk ATA ST31500341AS CC1H /dev/sdb
[6:0:0:0] disk AMCC 9550SX-4LP DISK 3.04 /dev/sdc





umount /compute (as root):

umount: /compute : device is busy
umount: /compute : device is busy


so, as far as I can tell, it doesn't even want to be unmounted. Does this help? Is there something I need to do to be able to unmount it?


Thanks

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Postby markmh » 2012/02/01 23:02:54

also,

cat /proc/mdstat (as root):

Personalities: [raid]
md0: active raid1 sda [0] sdb [1]
1465138496 block [2/2] [UU]
unused devices: (none)

User avatar
TrevorH
Forum Moderator
Posts: 21200
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: software RAID-1 data recovery

Postby TrevorH » 2012/02/02 01:06:35

so, as far as I can tell, it doesn't even want to be unmounted. Does this help? Is there something I need to do to be able to unmount it?


I'd guess the file system is exported over NFS and some or all of your clients are using it (have it mounted). Does `showmount -d` give any clue? Or maybe it's just exported and you need to stop the nfs service.

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Postby markmh » 2012/02/03 22:01:13

ok, I stopped the jobs that were running on the nodes, and even turned off some of them, and it still won't unmount. Is there possibly more I have to do to disconnect the nodes?

showmount -d gives:

Directories on lcpp-cluster.bgsu.edu:
/compute
/diskless/x64/RHEL4-AS/root
/export
192.168.1.0/255.255.255.0


does that tell you anything?

pschaff
Retired Moderator
Posts: 18276
Joined: 2006/12/13 20:15:34
Location: Tidewater, Virginia, North America
Contact:

Re: software RAID-1 data recovery

Postby pschaff » 2012/02/03 22:06:26

What does

Code: Select all

showmount -a
show?

markmh
Posts: 24
Joined: 2012/01/29 21:57:39

Re: software RAID-1 data recovery

Postby markmh » 2012/02/04 23:08:37

I am going to have to wait until Monday to access the machine and to get that info. But, is it possible for me to edit my previous post? I probably should not have included the address... :/

pschaff
Retired Moderator
Posts: 18276
Joined: 2006/12/13 20:15:34
Location: Tidewater, Virginia, North America
Contact:

Re: software RAID-1 data recovery

Postby pschaff » 2012/02/05 13:08:40

No, not after the one hour edit period times out, but a moderator can do it for you; however, I see nothing but non-routing private IP addresses, only accessible on the LAN. No public IP addresses that could be accessed from the Internet, unless I am (again) missing something.

We really needed to see more than "./getinfo.sh basic" - that's why I said "./getinfo.sh". It appears you are still at 4.4. An update (and a migration plan) would be highly advised, as many mirrors will stop carrying 4.x once it goes EOL.