Issues related to hardware problems
-
nycaleksey
- Posts: 4
- Joined: 2017/09/14 16:17:16
Post
by nycaleksey » 2017/10/29 17:32:57
Hi,
I have an 9TB RAID5 MD array made out of 4 3TB drives. Today I saw the following messages in the logs:
Code: Select all
[471636.561493] md: data-check of RAID array md127
[484513.684815] ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
[484513.691946] ata3.00: irq_stat 0x40000008
[484513.695959] ata3.00: failed command: READ FPDMA QUEUED
[484513.701187] ata3.00: cmd 60/00:28:a8:11:7f/04:00:e1:00:00/40 tag 5 ncq 524288 in
res 51/40:f8:b0:12:7f/00:02:e1:00:00/40 Emask 0x409 (media error) <F>
[484513.716985] ata3.00: status: { DRDY ERR }
[484513.721084] ata3.00: error: { UNC }
[484513.727777] ata3.00: configured for UDMA/133
[484513.732199] sd 3:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[484513.740029] sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
[484513.747340] sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
[484513.755606] sd 3:0:0:0: [sdb] CDB: Read(16) 88 00 00 00 00 00 e1 7f 11 a8 00 00 04 00 00 00
[484513.764044] blk_update_request: I/O error, dev sdb, sector 3783201456
[484513.770633] ata3: EH complete
[484518.460033] ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
[484518.467167] ata3.00: irq_stat 0x40000008
[484518.471179] ata3.00: failed command: READ FPDMA QUEUED
[484518.476407] ata3.00: cmd 60/f8:40:b0:12:7f/02:00:e1:00:00/40 tag 8 ncq 389120 in
res 51/40:f8:b0:12:7f/00:02:e1:00:00/40 Emask 0x409 (media error) <F>
[484518.492232] ata3.00: status: { DRDY ERR }
[484518.496329] ata3.00: error: { UNC }
[484518.503040] ata3.00: configured for UDMA/133
[484518.507437] sd 3:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[484518.515268] sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
[484518.522574] sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
[484518.530849] sd 3:0:0:0: [sdb] CDB: Read(16) 88 00 00 00 00 00 e1 7f 12 b0 00 00 02 f8 00 00
[484518.539274] blk_update_request: I/O error, dev sdb, sector 3783201456
[484518.545853] ata3: EH complete
[484528.493366] md/raid:md127: read error corrected (8 sectors at 3783201456 on sdb)
[484528.500874] md/raid:md127: read error corrected (8 sectors at 3783201464 on sdb)
[484528.508370] md/raid:md127: read error corrected (8 sectors at 3783201472 on sdb)
[484528.515882] md/raid:md127: read error corrected (8 sectors at 3783201480 on sdb)
[484528.523372] md/raid:md127: read error corrected (8 sectors at 3783201488 on sdb)
[484528.530891] md/raid:md127: read error corrected (8 sectors at 3783201496 on sdb)
[484528.538378] md/raid:md127: read error corrected (8 sectors at 3783201504 on sdb)
[484528.545867] md/raid:md127: read error corrected (8 sectors at 3783201512 on sdb)
[484528.553354] md/raid:md127: read error corrected (8 sectors at 3783201520 on sdb)
[484528.560843] md/raid:md127: read error corrected (8 sectors at 3783201528 on sdb)
[493330.449005] md: md127: data-check done.
Looks like one drive (sdb) had a failure, and should be replaced. However mdadm shows no problem:
Code: Select all
root@widebox[~]# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Wed Sep 18 23:49:09 2013
Raid Level : raid5
Array Size : 8790405888 (8383.18 GiB 9001.38 GB)
Used Dev Size : 2930135296 (2794.39 GiB 3000.46 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Sun Oct 29 13:23:15 2017
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Consistency Policy : resync
Name : blackbox.xxxx.net.:0
UUID : c86469a8:4b2a7ce1:6eaf97d5:8fe8aa92
Events : 2636
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
4 8 64 3 active sync /dev/sde
Array status is "clean", all 4 drives are listed as "Working" and 0 are listed as "Failed" - I don't see this output reporting any problem.
Is it normal ? Isn't the drive failed and requires replacement?
Thank you.
-
tunk
- Posts: 1206
- Joined: 2017/02/22 15:08:17
Post
by tunk » 2017/10/30 10:41:50
Can you see if SMART reports any problems: smartctl -a /dev/sdb|more
-
nycaleksey
- Posts: 4
- Joined: 2017/09/14 16:17:16
Post
by nycaleksey » 2017/10/30 14:02:06
tunk wrote:Can you see if SMART reports any problems: smartctl -a /dev/sdb|more
I did try looking there, but I don't see a straightforward answer to the question "is the drive failed and does it need replacement?" in the output.
Here's what I see:
Code: Select all
root@widebox[/root]# smartctl -a /dev/sdb
smartctl 6.2 2017-02-27 r4394 [x86_64-linux-3.10.0-693.5.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Toshiba 3.5" DT01ACA... Desktop HDD
Device Model: TOSHIBA DT01ACA300
Serial Number: 43NP9S8YS
LU WWN Device Id: 5 000039 ff4c9b065
Firmware Version: MX6OABB0
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Oct 30 09:58:23 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 36) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (21504) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 359) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 140 140 054 Pre-fail Offline - 69
3 Spin_Up_Time 0x0007 187 187 024 Pre-fail Always - 323 (Average 289)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 156
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 117 117 020 Pre-fail Offline - 36
9 Power_On_Hours 0x0012 095 095 000 Old_age Always - 35945
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 156
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 533
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 533
194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 9/58)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 35916 hours (1496 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 f8 b0 12 7f 01 Error: UNC at LBA = 0x017f12b0 = 25105072
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 60 a8 85 83 40 00 6d+01:51:03.549 READ FPDMA QUEUED
60 00 58 a8 81 83 40 00 6d+01:51:03.549 READ FPDMA QUEUED
60 00 50 a8 7d 83 40 00 6d+01:51:03.533 READ FPDMA QUEUED
60 00 48 a8 79 83 40 00 6d+01:51:03.533 READ FPDMA QUEUED
60 00 38 a8 75 83 40 00 6d+01:51:03.527 READ FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 35916 hours (1496 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 f8 b0 12 7f 01 Error: UNC at LBA = 0x017f12b0 = 25105072
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 40 a8 1d 7f 40 00 6d+01:50:58.698 READ FPDMA QUEUED
60 00 38 a8 19 7f 40 00 6d+01:50:58.698 READ FPDMA QUEUED
60 00 30 a8 15 7f 40 00 6d+01:50:58.698 READ FPDMA QUEUED
60 00 28 a8 11 7f 40 00 6d+01:50:58.698 READ FPDMA QUEUED
60 00 20 a8 0d 7f 40 00 6d+01:50:58.698 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 40% 35928 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Am I missing something?
Thank you.
-
northpoint
- Posts: 107
- Joined: 2016/05/23 11:57:12
Post
by northpoint » 2017/10/30 18:52:03
To me it looks like the drive has not failed but reported bad sectors. As I remember, I have seen drives not fail but show bad sectors and I would replace the drive. They only get worse.
Ryzen x1800 * Asus x370 Pro * CentOS 7.4 64bit / Icewarp /
-
hunter86_bg
- Posts: 2019
- Joined: 2015/02/17 15:14:33
- Location: Bulgaria
-
Contact:
Post
by hunter86_bg » 2017/11/02 21:41:50
9 Power_On_Hours 0x0012 095 095 000 Old_age Always - 35945
Either the disk has bad sectors or the cable starts to cause issues. I'll bet the first as this drive is quite old.