System rebooted automatically after it was trying to hard reset ata link

Issues related to hardware problems
Post Reply
pharthiphan
Posts: 35
Joined: 2018/02/20 05:01:28

System rebooted automatically after it was trying to hard reset ata link

Post by pharthiphan » 2018/07/18 13:53:24

System rebooted automatically after trying to hard reset ata link, I have four drives, OS Disk is mirrored and other two drives mirrored for data. and this failed disk isn't a OS disk.

I see same disk failed on four nodes today between short duration of time but only one of the system for reboot.

I have two questions,

- can this disk or physical connection [cable or connectors] may be the cause for the system crash and reboot though the failed disk is not a OS disk ?
- smartctl output shows some of the disks [good/ bad drives] attributes as unknown_attribute which I think can be vital information to find whether the drives are burned out. I tried updating it smartmontools-6.2-8.el7.x86_64 from smartmontools-6.5-1.el7.x86_64 but no luck. Also I tried to run /usr/sbin/update-smart-drivedb :(

Node rebooted

Code: Select all

Jul 18 01:43:16  kernel: ata4.00: exception Emask 0x0 SAct 0x100000 SErr 0x0 action 0x6 frozen
Jul 18 01:43:16  kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Jul 18 01:43:16  kernel: ata4.00: cmd 61/40:a0:00:9a:c5/00:00:02:00:00/40 tag 20 ncq 32768 out
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 18 01:43:16  kernel: ata4.00: status: { DRDY }
Jul 18 01:43:16  kernel: ata4: hard resetting link
Jul 18 01:43:21  kernel: ata4: link is slow to respond, please be patient (ready=0)
Jul 18 01:43:26  kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:43:26  kernel: ata4: hard resetting link
Jul 18 01:43:31  kernel: ata4: link is slow to respond, please be patient (ready=0)
Jul 18 01:43:36  kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:43:36  kernel: ata4: hard resetting link
Jul 18 01:43:41  kernel: ata4: link is slow to respond, please be patient (ready=0)

Code: Select all

grep -i sata dmesg
[   25.909619] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0xf impl SATA mode
[   25.919855] ata1: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200100 irq 329
[   25.919857] ata2: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200180 irq 329
[   25.919858] ata3: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200200 irq 329
[   25.919860] ata4: SATA max UDMA/133 abar m2048@0x56200000 port 0x56200280 irq 329
[   26.225256] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   26.225296] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   26.225344] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   80.982600] ata4: limiting SATA link speed to 3.0 Gbps

Nodes had the same drive failure but didn't rebooted

Code: Select all

Jul 18 01:36:23  kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:36:23  kernel: ata4: hard resetting link
Jul 18 01:36:28  kernel: ata4: link is slow to respond, please be patient (ready=0)
Jul 18 01:36:58  kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:36:58  kernel: ata4: limiting SATA link speed to 3.0 Gbps
Jul 18 01:36:58  kernel: ata4: hard resetting link
Jul 18 01:37:01  systemd: Started Session 275176 of user root.
Jul 18 01:37:01  systemd: Starting Session 275176 of user root.
Jul 18 01:37:03  kernel: ata4: COMRESET failed (errno=-16)
Jul 18 01:37:03  kernel: ata4: reset failed, giving up
Jul 18 01:37:03  kernel: ata4.00: disabled
Jul 18 01:37:03  kernel: ata4: EH complete
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 4e 00 00 00 10 00
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 55 00 00 00 10 00
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47336960
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47338752
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 47 00 00 00 10 00
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47335168
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47336448
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 43 00 00 00 10 00
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 54 00 00 00 10 00
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47334144
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47338496
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 4b 00 00 00 10 00
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47336192
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 16
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: md: super_written gets error=-5, uptodate=0
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 49 00 00 00 10 00
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47335680
Jul 18 01:37:03  kernel: md/raid1:md125: Disk failure on sdi, disabling device.
md/raid1:md125: Operation continuing on 1 devices.
Jul 18 01:37:03  kernel: blk_update_request: I/O error, dev sdi, sector 47335424
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 18 01:37:03  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 50 00 00 00 10 00
Jul 18 01:37:03  kernel: md: super_written gets error=-5, uptodate=0
Jul 18 01:37:06  kernel: sd 9:0:0:0: [sdi] CDB: Write(10) 2a 00 02 d2 51 00 00 00 10 00

Code: Select all

smartctl -a /dev/ime_cmtl_sg1
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-693.17.1.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA THNSNJ480PCS3
Serial Number:    64MS1010T5RW
LU WWN Device Id: 5 00080d 910170124
Firmware Version: J3E16101
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 18 09:16:18 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  7 Unknown_SSD_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
  8 Unknown_SSD_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       13989
 10 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       263
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       100
170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       203819
174 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       454275
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       147
194 Temperature_Celsius     0x0022   074   037   000    Old_age   Always       -       26 (Min/Max 19/63)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0022   100   100   000    Old_age   Always       -       0
240 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       13912
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       306
243 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
249 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       5003

Post Reply