I see same disk failed on four nodes today between short duration of time but only one of the system for reboot.
I have two questions,
- can this disk or physical connection [cable or connectors] may be the cause for the system crash and reboot though the failed disk is not a OS disk ?
- smartctl output shows some of the disks [good/ bad drives] attributes as unknown_attribute which I think can be vital information to find whether the drives are burned out. I tried updating it smartmontools-6.2-8.el7.x86_64 from smartmontools-6.5-1.el7.x86_64 but no luck. Also I tried to run /usr/sbin/update-smart-drivedb
Node rebooted
Code: Select all
Jul 18 01:43:16 kernel: ata4.00: exception Emask 0x0 SAct 0x100000 SErr 0x0 action 0x6 frozen
Jul 18 01:43:16 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Jul 18 01:43:16 kernel: ata4.00: cmd 61/40:a0:00:9a:c5/00:00:02:00:00/40 tag 20 ncq 32768 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 18 01:43:16 kernel: ata4.00: status: { DRDY }
Jul 18 01:43:16 kernel: ata4: hard resetting link
Jul 18 01:43:21 kernel: ata4: link is slow to respond, please be patient (ready=0)
Jul 18 01:43:26 kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:43:26 kernel: ata4: hard resetting link
Jul 18 01:43:31 kernel: ata4: link is slow to respond, please be patient (ready=0)
Jul 18 01:43:36 kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:43:36 kernel: ata4: hard resetting link
Jul 18 01:43:41 kernel: ata4: link is slow to respond, please be patient (ready=0)
Code: Select all
grep -i sata dmesg
[ 25.909619] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0xf impl SATA mode
[ 25.919855] ata1: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200100 irq 329
[ 25.919857] ata2: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200180 irq 329
[ 25.919858] ata3: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200200 irq 329
[ 25.919860] ata4: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200280 irq 329
[ 26.225256] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 26.225296] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 26.225344] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 80.982600] ata4: limiting SATA link speed to 3.0 Gbps
Nodes had the same drive failure but didn't rebooted
Code: Select all
Jul 18 01:36:23 kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:36:23 kernel: ata4: hard resetting link
Jul 18 01:36:28 kernel: ata4: link is slow to respond, please be patient (ready=0)
Jul 18 01:36:58 kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:36:58 kernel: ata4: limiting SATA link speed to 3.0 Gbps
Jul 18 01:36:58 kernel: ata4: hard resetting link
Jul 18 01:37:01 systemd: Started Session 275176 of user root.
Jul 18 01:37:01 systemd: Starting Session 275176 of user root.
Jul 18 01:37:03 kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:37:03 kernel: ata4: reset failed, giving up
Jul 18 01:37:03 kernel: ata4.00: disabled
Jul 18 01:37:03 kernel: ata4: EH complete
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 4e 00 00 00 10 00
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 55 00 00 00 10 00
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47336960
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47338752
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 47 00 00 00 10 00
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47335168
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47336448
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 43 00 00 00 10 00
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 54 00 00 00 10 00
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47334144
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47338496
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 4b 00 00 00 10 00
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47336192
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 16
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: md: super_written gets error=-5, uptodate=0
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 49 00 00 00 10 00
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47335680
Jul 18 01:37:03 kernel: md/raid1:md125: Disk failure on sdi, disabling device.
md/raid1:md125: Operation continuing on 1 devices.
Jul 18 01:37:03 kernel: blk_update_request: I/O error, dev sdi, sector 47335424
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 50 00 00 00 10 00
Jul 18 01:37:03 kernel: md: super_written gets error=-5, uptodate=0
Jul 18 01:37:06 kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 51 00 00 00 10 00
Code: Select all
smartctl -a /dev/ime_cmtl_sg1
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-693.17.1.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA THNSNJ480PCS3
Serial Number: 64MS1010T5RW
LU WWN Device Id: 5 00080d 910170124
Firmware Version: J3E16101
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Jul 18 09:16:18 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0013 100 100 050 Pre-fail Always - 0
7 Unknown_SSD_Attribute 0x000b 100 100 050 Pre-fail Always - 0
8 Unknown_SSD_Attribute 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 13989
10 Unknown_SSD_Attribute 0x0013 100 100 050 Pre-fail Always - 0
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 263
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 100 100 010 Pre-fail Always - 100
170 Unknown_Attribute 0x0013 100 100 010 Pre-fail Always - 0
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 203819
174 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 454275
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 147
194 Temperature_Celsius 0x0022 074 037 000 Old_age Always - 26 (Min/Max 19/63)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
232 Available_Reservd_Space 0x0022 100 100 000 Old_age Always - 0
240 Unknown_SSD_Attribute 0x0013 100 100 050 Pre-fail Always - 0
241 Total_LBAs_Written 0x0012 100 100 000 Old_age Always - 13912
242 Total_LBAs_Read 0x0012 100 100 000 Old_age Always - 306
243 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
249 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 5003