Failed drive?

Issues related to hardware problems
Post Reply
nycaleksey
Posts: 4
Joined: 2017/09/14 16:17:16

Failed drive?

Post by nycaleksey » 2017/10/29 17:32:57

Hi,

I have an 9TB RAID5 MD array made out of 4 3TB drives. Today I saw the following messages in the logs:

Code: Select all

[471636.561493] md: data-check of RAID array md127
[484513.684815] ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
[484513.691946] ata3.00: irq_stat 0x40000008
[484513.695959] ata3.00: failed command: READ FPDMA QUEUED
[484513.701187] ata3.00: cmd 60/00:28:a8:11:7f/04:00:e1:00:00/40 tag 5 ncq 524288 in
         res 51/40:f8:b0:12:7f/00:02:e1:00:00/40 Emask 0x409 (media error) <F>
[484513.716985] ata3.00: status: { DRDY ERR }
[484513.721084] ata3.00: error: { UNC }
[484513.727777] ata3.00: configured for UDMA/133
[484513.732199] sd 3:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[484513.740029] sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor] 
[484513.747340] sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
[484513.755606] sd 3:0:0:0: [sdb] CDB: Read(16) 88 00 00 00 00 00 e1 7f 11 a8 00 00 04 00 00 00
[484513.764044] blk_update_request: I/O error, dev sdb, sector 3783201456
[484513.770633] ata3: EH complete
[484518.460033] ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
[484518.467167] ata3.00: irq_stat 0x40000008
[484518.471179] ata3.00: failed command: READ FPDMA QUEUED
[484518.476407] ata3.00: cmd 60/f8:40:b0:12:7f/02:00:e1:00:00/40 tag 8 ncq 389120 in
         res 51/40:f8:b0:12:7f/00:02:e1:00:00/40 Emask 0x409 (media error) <F>
[484518.492232] ata3.00: status: { DRDY ERR }
[484518.496329] ata3.00: error: { UNC }
[484518.503040] ata3.00: configured for UDMA/133
[484518.507437] sd 3:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[484518.515268] sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor] 
[484518.522574] sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
[484518.530849] sd 3:0:0:0: [sdb] CDB: Read(16) 88 00 00 00 00 00 e1 7f 12 b0 00 00 02 f8 00 00
[484518.539274] blk_update_request: I/O error, dev sdb, sector 3783201456
[484518.545853] ata3: EH complete
[484528.493366] md/raid:md127: read error corrected (8 sectors at 3783201456 on sdb)
[484528.500874] md/raid:md127: read error corrected (8 sectors at 3783201464 on sdb)
[484528.508370] md/raid:md127: read error corrected (8 sectors at 3783201472 on sdb)
[484528.515882] md/raid:md127: read error corrected (8 sectors at 3783201480 on sdb)
[484528.523372] md/raid:md127: read error corrected (8 sectors at 3783201488 on sdb)
[484528.530891] md/raid:md127: read error corrected (8 sectors at 3783201496 on sdb)
[484528.538378] md/raid:md127: read error corrected (8 sectors at 3783201504 on sdb)
[484528.545867] md/raid:md127: read error corrected (8 sectors at 3783201512 on sdb)
[484528.553354] md/raid:md127: read error corrected (8 sectors at 3783201520 on sdb)
[484528.560843] md/raid:md127: read error corrected (8 sectors at 3783201528 on sdb)
[493330.449005] md: md127: data-check done.
Looks like one drive (sdb) had a failure, and should be replaced. However mdadm shows no problem:

Code: Select all

root@widebox[~]# mdadm --detail /dev/md127 
/dev/md127:
           Version : 1.2
     Creation Time : Wed Sep 18 23:49:09 2013
        Raid Level : raid5
        Array Size : 8790405888 (8383.18 GiB 9001.38 GB)
     Used Dev Size : 2930135296 (2794.39 GiB 3000.46 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Sun Oct 29 13:23:15 2017
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 128K

Consistency Policy : resync

              Name : blackbox.xxxx.net.:0
              UUID : c86469a8:4b2a7ce1:6eaf97d5:8fe8aa92
            Events : 2636

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd
       4       8       64        3      active sync   /dev/sde
Array status is "clean", all 4 drives are listed as "Working" and 0 are listed as "Failed" - I don't see this output reporting any problem.

Is it normal ? Isn't the drive failed and requires replacement?

Thank you.

tunk
Posts: 1206
Joined: 2017/02/22 15:08:17

Re: Failed drive?

Post by tunk » 2017/10/30 10:41:50

Can you see if SMART reports any problems: smartctl -a /dev/sdb|more

nycaleksey
Posts: 4
Joined: 2017/09/14 16:17:16

Re: Failed drive?

Post by nycaleksey » 2017/10/30 14:02:06

tunk wrote:Can you see if SMART reports any problems: smartctl -a /dev/sdb|more
I did try looking there, but I don't see a straightforward answer to the question "is the drive failed and does it need replacement?" in the output.
Here's what I see:

Code: Select all

root@widebox[/root]# smartctl -a /dev/sdb
smartctl 6.2 2017-02-27 r4394 [x86_64-linux-3.10.0-693.5.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" DT01ACA... Desktop HDD
Device Model:     TOSHIBA DT01ACA300
Serial Number:    43NP9S8YS
LU WWN Device Id: 5 000039 ff4c9b065
Firmware Version: MX6OABB0
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct 30 09:58:23 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  36)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		(21504) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 359) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   140   140   054    Pre-fail  Offline      -       69
  3 Spin_Up_Time            0x0007   187   187   024    Pre-fail  Always       -       323 (Average 289)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       156
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   117   117   020    Pre-fail  Offline      -       36
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       35945
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       156
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       533
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       533
194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 9/58)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 2
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 35916 hours (1496 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 f8 b0 12 7f 01  Error: UNC at LBA = 0x017f12b0 = 25105072

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 60 a8 85 83 40 00   6d+01:51:03.549  READ FPDMA QUEUED
  60 00 58 a8 81 83 40 00   6d+01:51:03.549  READ FPDMA QUEUED
  60 00 50 a8 7d 83 40 00   6d+01:51:03.533  READ FPDMA QUEUED
  60 00 48 a8 79 83 40 00   6d+01:51:03.533  READ FPDMA QUEUED
  60 00 38 a8 75 83 40 00   6d+01:51:03.527  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 35916 hours (1496 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 f8 b0 12 7f 01  Error: UNC at LBA = 0x017f12b0 = 25105072

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 40 a8 1d 7f 40 00   6d+01:50:58.698  READ FPDMA QUEUED
  60 00 38 a8 19 7f 40 00   6d+01:50:58.698  READ FPDMA QUEUED
  60 00 30 a8 15 7f 40 00   6d+01:50:58.698  READ FPDMA QUEUED
  60 00 28 a8 11 7f 40 00   6d+01:50:58.698  READ FPDMA QUEUED
  60 00 20 a8 0d 7f 40 00   6d+01:50:58.698  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      40%     35928         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Am I missing something?

Thank you.

northpoint
Posts: 107
Joined: 2016/05/23 11:57:12

Re: Failed drive?

Post by northpoint » 2017/10/30 18:52:03

To me it looks like the drive has not failed but reported bad sectors. As I remember, I have seen drives not fail but show bad sectors and I would replace the drive. They only get worse. :(
Ryzen x1800 * Asus x370 Pro * CentOS 7.4 64bit / Icewarp /

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: Failed drive?

Post by hunter86_bg » 2017/11/02 21:41:50

9 Power_On_Hours 0x0012 095 095 000 Old_age Always - 35945
Either the disk has bad sectors or the cable starts to cause issues. I'll bet the first as this drive is quite old.

Post Reply