I have a CentOS 7 system refusing to boot up after a PSU failure (yes, it does have a new PSU now...).
In brief: smartctl reports two pending sectors on /dev/sda, and that's my issue.
I wonder if there's a chance to recover the system, rather than having to reinstall from scratch...
Here are some more details:
After the new PSU was installed, the system refuses to boot properly. entering an emergency mode and complaining about I/O error on /dev/sda:
Code: Select all
blk_update_request: I/O error, dev sda, sector 105129760
XFS (dm-1): metadata I/O error, block 0x62e2db8 ("xlog_bread_noalign") error 5 numblks 8200
Code: Select all
[root@mgmt ~]# xfs_repair /dev/sda1
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
.........................................................................................................................................................................................Sorry, could not find valid secondary superblock
Exiting now.
Code: Select all
[root@mgmt ~]# lvscan
ACTIVE '/dev/centos/root' [<98.83 GiB] inherit
ACTIVE '/dev/centos/home' [<638.31 GiB] inherit
ACTIVE '/dev/centos/swap' [<7.52 GiB] inherit
[root@mgmt ~]#
Code: Select all
[root@mgmt ~]# xfs_repair /dev/centos/root
Phase 1 - find and verify superblock...
superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
fatal error -- Input/output error
And here is the smartctl status of /dev/sda :
Code: Select all
[root@mgmt ~]# smartctl --all /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68AX9N0
Serial Number: WD-WMC1T0921526
LU WWN Device Id: 5 0014ee 6ad8ba9a1
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jul 9 22:44:07 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (39540) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 397) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 15
3 Spin_Up_Time 0x0027 183 179 021 Pre-fail Always - 5808
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 173
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 023 023 000 Old_age Always - 56924
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 173
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 111
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 61
194 Temperature_Celsius 0x0022 107 091 000 Old_age Always - 43
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 56923 54439176
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
So, two major questions here:
1. How do I check and make sure /dev/sda has only those two troublesome sectors, and no other major issues ?
2. How can I try and recover this system? I would really like to make it boot properly again ...