Server is responding very slow

General support questions
Post Reply
mayav
Posts: 2
Joined: 2018/08/22 02:55:38

Server is responding very slow

Post by mayav » 2018/08/22 03:06:09

Hi,

i have cent os6 minimal as a kvm host with kvm guest with cent os6,
Server details
2 x 120 GB SSD disk for OS - Centos OS 6 = RAID 1
4 x 2 TB HDD = RAID 5

mdstat details

cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md124 : active raid5 sdc[3] sdd[2] sde[1] sdf[0]
5567516672 blocks super external:/md0/0 level 5, 128k chunk, algorithm 0 [4/4] [UUUU]

md0 : inactive sdf[3](S) sde[2](S) sdd[1](S) sdc[0](S)
12612 blocks super external:imsm

md125 : active raid1 sda[1] sdb[0]
111357952 blocks super external:/md1/0 [2/2] [UU]

md1 : inactive sdb[1](S) sda[0](S)
6306 blocks super external:imsm

unused devices: <none>

this is error message seen in the /var/log/message

Aug 22 07:53:18 localhost kernel: ata7.00: error: { ABRT }
Aug 22 07:53:18 localhost kernel: ata7: hard resetting link
Aug 22 07:53:18 localhost kernel: ata7.00: configured for UDMA/33
Aug 22 07:53:18 localhost kernel: ata7: EH complete
Aug 22 07:53:21 localhost kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Aug 22 07:53:21 localhost kernel: ata7.00: failed command: READ DMA
Aug 22 07:53:21 localhost kernel: ata7.00: cmd c8/00:08:48:14:ed/00:00:00:00:00/e0 tag 0 dma 4096 in
Aug 22 07:53:21 localhost kernel: res 01/04:00:af:88:e0/00:00:e8:00:00/e0 Emask 0x3 (HSM violation)
Aug 22 07:53:21 localhost kernel: ata7.00: status: { ERR }
Aug 22 07:53:21 localhost kernel: ata7.00: error: { ABRT }
Aug 22 07:53:21 localhost kernel: ata7: hard resetting link
Aug 22 07:53:21 localhost kernel: ata7.00: configured for UDMA/33
Aug 22 07:53:21 localhost kernel: ata7: EH complete
Aug 22 07:53:24 localhost kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Aug 22 07:53:24 localhost kernel: ata7.00: failed command: READ DMA
Aug 22 07:53:24 localhost kernel: ata7.00: cmd c8/00:08:48:14:ed/00:00:00:00:00/e0 tag 0 dma 4096 in
Aug 22 07:53:24 localhost kernel: res 01/04:00:af:88:e0/00:00:e8:00:00/e0 Emask 0x3 (HSM violation)
Aug 22 07:53:24 localhost kernel: ata7.00: status: { ERR }
Aug 22 07:53:24 localhost kernel: ata7.00: error: { ABRT }
Aug 22 07:53:24 localhost kernel: ata7: hard resetting link
Aug 22 07:53:24 localhost kernel: ata7.00: configured for UDMA/33
Aug 22 07:53:24 localhost kernel: ata7: EH complete
Aug 22 07:53:27 localhost kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Aug 22 07:53:27 localhost kernel: ata7.00: failed command: READ DMA
Aug 22 07:53:27 localhost kernel: ata7.00: cmd c8/00:08:48:14:ed/00:00:00:00:00/e0 tag 0 dma 4096 in
Aug 22 07:53:27 localhost kernel: res 01/04:00:af:88:e0/00:00:e8:00:00/e0 Emask 0x3 (HSM violation)
Aug 22 07:53:27 localhost kernel: ata7.00: status: { ERR }
Aug 22 07:53:27 localhost kernel: ata7.00: error: { ABRT }
Aug 22 07:53:27 localhost kernel: ata7: hard resetting link
Aug 22 07:53:27 localhost kernel: ata7.00: configured for UDMA/33
Aug 22 07:53:27 localhost kernel: ata7: EH complete
Aug 22 07:53:30 localhost kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Aug 22 07:53:30 localhost kernel: ata7.00: failed command: READ DMA
Aug 22 07:53:30 localhost kernel: ata7.00: cmd c8/00:08:48:14:ed/00:00:00:00:00/e0 tag 0 dma 4096 in
Aug 22 07:53:30 localhost kernel: res 01/04:00:af:88:e0/00:00:e8:00:00/e0 Emask 0x3 (HSM violation)
Aug 22 07:53:30 localhost kernel: ata7.00: status: { ERR }
Aug 22 07:53:30 localhost kernel: ata7.00: error: { ABRT }
Aug 22 07:53:30 localhost kernel: ata7: hard resetting link
Aug 22 07:53:30 localhost kernel: ata7.00: configured for UDMA/33
Aug 22 07:53:30 localhost kernel: ata7: EH complete
Aug 22 07:53:33 localhost kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

I am not sure what to do to fix this issue any ideas to fix this ?

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Server is responding very slow

Post by TrevorH » 2018/08/22 10:20:24

That looks suspiciously like a hardware problem to me. You can use https://ata.wiki.kernel.org/index.php/L ... r_messages to help decode all those error messages and see if you can work out what the root problem is. I would try different SATA cables for one thing though I have no idea if that will help or if it is the disk(s) themselves that are having problems. Also try running smartctl -a /dev/sdX against each of your disks in turn and see what the stats look like.

And, if you don't have one already, now would be a very good time to be making a backup. Actually, yesterday would have been an even better time... :-(
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

mayav
Posts: 2
Joined: 2018/08/22 02:55:38

Re: Server is responding very slow

Post by mayav » 2018/08/23 10:07:03

Thanks for the reply

/dev/sde having bad block

smartctl -a /dev/sde show the following output. We started the backup process and ordered the new hdd

smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-431.20.3.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Constellation CS
Device Model: ST2000NC001-1DY164
Serial Number: Z1E461PN
LU WWN Device Id: 5 000c50 05091d56e
Firmware Version: CN02
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Aug 23 14:34:02 2018 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 222) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 050 042 044 Pre-fail Always In_the_past 192174272
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 648
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 208488862
9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 45995
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 20
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 16024
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 073 061 045 Old_age Always - 27 (Min/Max 24/30)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 092 091 000 Old_age Always - 1384
198 Offline_Uncorrectable 0x0010 092 091 000 Old_age Offline - 1384
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 32046 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 32046 occurred at disk power-on lifetime: 45994 hours (1916 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 14 ed 00 Error: UNC at LBA = 0x00ed1448 = 15537224

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 14 ed e0 00 04:10:48.387 READ DMA
27 00 00 00 00 00 e0 00 04:10:48.386 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 04:10:48.386 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 04:10:48.386 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:10:48.386 READ NATIVE MAX ADDRESS EXT

Error 32045 occurred at disk power-on lifetime: 45994 hours (1916 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 14 ed 00 Error: UNC at LBA = 0x00ed1448 = 15537224

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 14 ed e0 00 04:10:48.387 READ DMA
27 00 00 00 00 00 e0 00 04:10:48.386 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 04:10:48.386 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 04:10:48.386 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:10:48.386 READ NATIVE MAX ADDRESS EXT

Error 32044 occurred at disk power-on lifetime: 45994 hours (1916 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 14 ed 00 Error: UNC at LBA = 0x00ed1448 = 15537224

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 14 ed e0 00 04:10:45.339 READ DMA
27 00 00 00 00 00 e0 00 04:10:45.338 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 04:10:45.338 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 04:10:45.338 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:10:45.337 READ NATIVE MAX ADDRESS EXT

Error 32043 occurred at disk power-on lifetime: 45994 hours (1916 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 14 ed 00 Error: UNC at LBA = 0x00ed1448 = 15537224

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 14 ed e0 00 04:10:45.339 READ DMA
27 00 00 00 00 00 e0 00 04:10:45.338 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 04:10:45.338 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 04:10:45.338 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:10:45.337 READ NATIVE MAX ADDRESS EXT

Error 32042 occurred at disk power-on lifetime: 45994 hours (1916 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 48 14 ed 00 Error: UNC at LBA = 0x00ed1448 = 15537224

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 14 ed e0 00 04:10:42.277 READ DMA
27 00 00 00 00 00 e0 00 04:10:42.276 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 04:10:42.276 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 04:10:42.276 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:10:42.275 READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Post Reply