amuck-landowner

Hard Drive Failure

splitice

Just a little bit crazy...
Verified Provider
Ok, Guys. You people who deal with VPS & Dedicated server hardware must learn lots about hard drive failures (Spinning Rust Buckets, not speedy silicon).

Any tips for recognizing failures before they happen? Just had two drives fail, partially / corrupt / unknown while maintaining a health smart status. Fortunately its RAID6, phew. Lets hope the rebuilding doesnt take any others out.

One of the drives kills the HBA's kernel module after a bit of IO so I dont have a copy of its smart report, I am writing it off. But here is the one from the other which seems healthy ish. Just forgot its place in the array, which I dont take as a good sign.


smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda Green (Adv. Format)
Device Model: ST2000DL003-9VT166
Serial Number: 5YD46B1D
LU WWN Device Id: 5 000c50 038eaf34c
Firmware Version: CC98
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Oct 22 19:07:54 2014 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 623) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1033) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 120390944
3 Spin_Up_Time 0x0003 083 070 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 241
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 074 060 030 Pre-fail Always - 26584053
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 18363
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 154
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 082 082 000 Old_age Always - 18
190 Airflow_Temperature_Cel 0x0022 063 046 045 Old_age Always - 37 (Min/Max 26/37)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 203
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1499
194 Temperature_Celsius 0x0022 037 054 000 Old_age Always - 37 (0 14 0 0)
195 Hardware_ECC_Recovered 0x001a 021 004 000 Old_age Always - 120390944
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 131614977831727
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3825181282
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1905166982

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 18363 -
# 2 Short offline Aborted by host 90% 18363 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Thoughts? Tips for better managing the array?This is a home nas, so all consumer hardware and drives.
 

rds100

New Member
Verified Provider
37 degrees temperature sounds worrying, but other than that i don't see any sign of failure in the output you provided.

Run a long smart selftest on the drive? Will take a lopt of time to complete though - smartctl -t long /dev/whatever

By the way why do you say it has failed? How did you come to this conclusion?
 
Last edited by a moderator:

splitice

Just a little bit crazy...
Verified Provider
A combination of I/O errors logged, and the drive marked as removed in mdam (after a reboot, the other faulty drive froze the kernel).

I dont doubt the other drive is faulty, its throwing alot of IO errors and a few smartctl's and the kernel watchdog starts killing threads. I don't doubt that the raid card module is pretty crap, but without that drive its otherwise fine. I did check smart on that drive though, it passed too. Although I didnt keep it,

Its a toasty 30 degrees at the moment, and the case is off and there is a raid rebuild going on. Its quite toasty, Im going to increase the airflow shortly. From memory Ive always tried for 25°C to 40°C, I think Google said 25-45 in their study.
 

rds100

New Member
Verified Provider
I'd take the drive out and test it in another computer. The errors might not be from the drive, but due to motherboard or cable or power problem.

I've had bad luck in the past with bad HDD power cables. Especially with cables that convert from molex power connector to SATA power connector - the drive would throw errors, SATA bus resets, timeouts, etc. After this power cable was replaced with a good one the errors stopped.
 

HalfEatenPie

The Irrational One
Retired Staff
I'd take the drive out and test it in another computer. The errors might not be from the drive, but due to motherboard or cable or power problem.

I've had bad luck in the past with bad HDD power cables. Especially with cables that convert from molex power connector to SATA power connector - the drive would throw errors, SATA bus resets, timeouts, etc. After this power cable was replaced with a good one the errors stopped.
Ugh.  Sometimes its always the smallest thing that ruins a great thing.

I remember one time finding out it was the power cable to the entire computer that needed to be replaced.    
 

splitice

Just a little bit crazy...
Verified Provider
The WD drive which breaks the HBA I have tested on a different backplane attached to a different port. Ive written that one off. I dont think its a problem with the HBA / port, mSAS cable or power (cable or backplane). None of the other ports on the backplane / mSAS cable / power have any issues obviously. Its still in warranty by 2 weeks so Ill just get a RMA and get it replaced.

The Seagate though might be salvageable, although to me its Seek_Error_Rate & Raw_Read_Error_Rate seem awfully high. Not sure if @rds100 missed that. Most of the other drives are at zero, and a couple have a handful of errors. But nothing else in the hundreds of millions of errors.

It might be within warranty, Ill have to find the physical port and remove it post rebuild. Shitty HBA doesn't have an activity LED flash control that works, I have a good idea which it is but yeah post rebuild. Label tells me which it should be, and the temperature readings back it up (center drives are 2 degrees hotter than outer drives). Anyway dont want any issues until I have some redundancy.
 
Last edited by a moderator:

rds100

New Member
Verified Provider
Seek_Error_Rate & Raw_Read_Error_Rate are perfercty fine for a Seagate drive, they are always some large numbers. They shouldn't be interpreted as large numbers, probably smartctl just doesn't know how to display them in a meaningful way.

Just see on some other Seagate, even a new one - these two are some large numbers.
 

lbft

New Member
IMHO it's always a good idea to run regular SMART self tests - a full test after purchase, a conveyance test after any physical move (if the drive supports it), regular short self tests (maybe weekly) and occasional long self tests (maybe monthly). You can do it in a cron but smartd (likely part of your distro's smartmontools package) will do it for you as well as emailing you about any failures or errors (I particularly like to keep an eye on the reallocated sector count and the temperature, although 37 C for a busy drive doesn't seem so terrible to me).

When you run self tests, do it at less busy times because it will screw up performance.

Of course, this may not have caught this failure - drives can and do fail without giving any warning, and from what I can see nothing looks bad in that smartctl output. As a 2007 Google research paper noted, it's really hard to accurately predict drive failures:

Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives. Figure 14 shows that even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables.
One interesting thing about that study is that they also found temperature has less of an impact than had been suggested elsewhere - perhaps that 37 C temperature isn't worth freaking out over even if it would be nice for it to be lower.
 
Last edited by a moderator:

splitice

Just a little bit crazy...
Verified Provider
@rds100: I wasn't aware of that. FYI to others, http://sgros.blogspot.com.au/2013/01/seagate-disk-smart-values.html

@lbft Regular smart short tests are already run at 5:30AM every day :) I think I got base script of here or LET. It caught one of the drives pre-fail ages ago.

My gut is screaming at me not to trust that drive, but then again maybe I am paranoid. Most of the drives in that server have done atleast 3,000km in my car over their lives which probably isnt all that good for them. Nor the 46 degree Adelaide heat wave (even if it is cooler inside), or just the general Aussie summer. I expect they have got a bit toasty at times. I certainly dont run the air-con just for the server to heat it back up.

FYI

Three of the drives are going strong with 5 years powered on hours. And of that they spent a 1-2 years as external drives probably getting all manner of beatings (as only drives that are used to swap media do). All three are WD Greens. Gotta say, everyone cry's fowl about them, but some of them do seem to beat the odds.
 
Last edited by a moderator:
Top
amuck-landowner