Is my SATA disk going bad?

General support questions
MarkEHansen
Posts: 118
Joined: 2005/11/25 02:50:31
Location: Sacramento, CA

Is my SATA disk going bad?

Post by MarkEHansen » 2015/12/20 18:35:08

I'm running CentOS 7.2.1511 and have been getting errors from time to time which look like this:

Code: Select all

 WARNING:  Kernel Errors Present
    ata3.00: irq_stat 0x08000000, interface fatal error ...:  1 Time(s)
    ata3: SError: { CommWake 10B8 ...:  1 Time(s)
 
 1 Time(s): ata3.00: cmd 61/00:00:f8:8e:aa/04:00:03:00:00/40 tag 0 ncq 524288 out
 1 Time(s): ata3.00: cmd 61/00:08:f8:92:aa/04:00:03:00:00/40 tag 1 ncq 524288 out
 1 Time(s): ata3.00: cmd 61/00:10:f8:96:aa/04:00:03:00:00/40 tag 2 ncq 524288 out
 ...
 1 Time(s): ata3.00: cmd 61/00:f0:f8:8a:aa/04:00:03:00:00/40 tag 30 ncq 524288 out
 1 Time(s): ata3.00: configured for UDMA/133
 1 Time(s): ata3.00: exception Emask 0x10 SAct 0x7fffffff SErr 0x4c0000 action 0x6 frozen
 31 Time(s): ata3.00: failed command: WRITE FPDMA QUEUED
 31 Time(s): ata3.00: status: { DRDY }
 1 Time(s): ata3: EH complete
 1 Time(s): ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 1 Time(s): ata3: hard resetting link
First, does ata3 mean that it is the disk attached to the Sata-3 port on the motherboard?

The disk I have attached to Sata-3 is a secondary disk (not used for the O/S partitions). I'm trying to check the disk health but am running into problems and could use some help. When I 'umount' it, I can run fsck, but then within a few minutes, the disk is mounted again. I'm guessing something auto-mounts it? Here is what I see in /var/log/messages:

Code: Select all

Dec 20 10:32:06 stargate kernel: sdb: sdb1
Dec 20 10:32:07 stargate systemd: Mounting /meh...
Dec 20 10:32:07 stargate kernel: XFS (sdb1): Mounting V4 Filesystem
Dec 20 10:32:07 stargate kernel: XFS (sdb1): Ending clean mount
Dec 20 10:32:07 stargate systemd: Mounted /meh.
How can I unmount the disk so I can check it?

During the time it is unmounted, I run fsck passing the device (/dev/sdb) of the disk and get the following error:

Code: Select all

# fsck /dev/sdb
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdb

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
Does this mean the disk is bad, or that I'm not checking it correctly?

Thanks for any help.

gerald_clark
Posts: 10642
Joined: 2005/08/05 15:19:54
Location: Northern Illinois, USA

Re: Is my SATA disk going bad?

Post by gerald_clark » 2015/12/21 13:56:48

You need to fsck the blockdevice that contains the filesystem, not the raw drive.

SamLee
Posts: 66
Joined: 2015/08/05 08:13:26
Location: ShenZhen China
Contact:

Re: Is my SATA disk going bad?

Post by SamLee » 2015/12/21 14:13:31

Hi MarkEHansen,

Good day.
You can also use S.M.A.R.T to check the status for your disks; and then you can know the status of your disks.

For example (I am using sda as an example and you can use your actual disk)
#smartctl --all /dev/sda
#smartctl --test=long /dev/sda

Regards,
Sam
Study More, work hard

MarkEHansen
Posts: 118
Joined: 2005/11/25 02:50:31
Location: Sacramento, CA

Re: Is my SATA disk going bad?

Post by MarkEHansen » 2015/12/21 17:27:24

I ran fsck /dev/sdb1 and got back the following:

Code: Select all

fsck /dev/sdb1
fsck from util-linux 2.23.2
If you wish to check the consistency of an XFS filesystem or
repair a damaged filesystem, see xfs_repair(8).
I don't really know what this is telling me.

I also ran smartctl --test=long /dev/sdb and got back the following:

Code: Select all

smartctl --test=long /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.3.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 263 minutes for test to complete.
Test will complete after Mon Dec 21 11:06:19 2015

Use smartctl -X to abort test.
It looks like it's going to take a while to run. Note that in both cases, I haven't unmounted the disk partition.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Is my SATA disk going bad?

Post by TrevorH » 2015/12/21 18:46:59

You should never run fsck against a mounted filesystem
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

MarkEHansen
Posts: 118
Joined: 2005/11/25 02:50:31
Location: Sacramento, CA

Re: Is my SATA disk going bad?

Post by MarkEHansen » 2015/12/21 20:20:05

I don't know how to tell if the smartctl background test is done. It is past the time it said it would be done, so I ran smartctl -l error and got the following:

Code: Select all

 smartctl -l error /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.3.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
Does this mean my disk is fine, or that I didn't run the correct type of test on it?

What is the best way to determine whether or not my disk is having problems? If to run a test I need to unmount the file system, can someone please tell me how to do that (please read above where I said that after unmounting, it just automatically mounts again).

Thanks,

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Is my SATA disk going bad?

Post by TrevorH » 2015/12/21 20:25:05

Can you post the output from smartctl -a /dev/sdb
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

MarkEHansen
Posts: 118
Joined: 2005/11/25 02:50:31
Location: Sacramento, CA

Re: Is my SATA disk going bad?

Post by MarkEHansen » 2015/12/21 21:18:15

Yes, here it is:

Code: Select all

smartctl -a /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.3.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD20EZRX-00D8PB0
Serial Number:    WD-WCC4M2UEY5EJ
LU WWN Device Id: 5 0014ee 2606cd0b6
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 21 13:18:46 2015 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (25980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 263) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   173   173   021    Pre-fail  Always       -       4308
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       41
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   093   000    Old_age   Always       -       4872
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       41
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7796
194 Temperature_Celsius     0x0022   116   106   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       13
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4870         -
# 2  Short offline       Completed without error       00%      4827         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Is my SATA disk going bad?

Post by TrevorH » 2015/12/21 21:32:00

The important indicators there are all zero so I think it looks ok (I'm looking at attributes 5,7, 196-198).

Given that the original error that you quoted said interface error, I'd be tempted to change the cable if you have a spare one lying around.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

MarkEHansen
Posts: 118
Joined: 2005/11/25 02:50:31
Location: Sacramento, CA

Re: Is my SATA disk going bad?

Post by MarkEHansen » 2015/12/21 21:40:06

Thanks. Was there something I needed to do to get those error values to fill in properly, or are they automatically filled in? I'm just wondering if I needed to run something (correctly) before those fields would contain valid values, and the zeros may just mean that the appropriate test(s) haven't yet been run.

I'll look at changing the cable.

Post Reply