Anatomy of a Drive Failure

I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.

I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.

First, here are the basic specs on the system and drives involved:

  • Athlon 700 (slot A)
  • 384 MB RAM (PC133)
  • Via KT133 chipset (Asus K7A MB, I think)
  • 3ware 7500-8 8-channel IDE RAID controller
  • 3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive

The drive that failed was a Maxtor, on channel #2. Here’s what smartmontools 5.30 has to say about the drive in its current condition:

Device Model: Maxtor 4A160J0
Serial Number:A608B7WE
Firmware Version: RAMB1TU0
Device is:    Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Fri Jan  7 11:47:02 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity was
                                        completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:  (  24) The self-test routine was aborted by
                                        the host.
Total time to complete Offline 
data collection:             ( 243) seconds.
Offline data collection
capabilities:                (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:        (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:    (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  99) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME      FLAG VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time        0x0027   214   214   063Pre-fail  Always   -   11805
  4 Start_Stop_Count    0x0032   253   253   000Old_age   Always   -   73
  5 Reallocated_Sector_Ct   0x0033   249   249   063Pre-fail  Always   -   41
  6 Read_Channel_Margin 0x0001   253   253   100Pre-fail  Offline  -   0
  7 Seek_Error_Rate     0x000a   253   252   000Old_age   Always   -   0
  8 Seek_Time_Performance   0x0027   252   244   187Pre-fail  Always   -   34394
  9 Power_On_Hours      0x0032   224   224   000Old_age   Always   -   24560
 10 Spin_Retry_Count    0x002b   253   252   157Pre-fail  Always   -   0
 11 Calibration_Retry_Count 0x002b   253   252   223Pre-fail  Always   -   0
 12 Power_Cycle_Count   0x0032   253   253   000Old_age   Always   -   76
192 Power-Off_Retract_Count 0x0032   253   253   000Old_age   Always   -   0
193 Load_Cycle_Count    0x0032   253   253   000Old_age   Always   -   0
194 Temperature_Celsius 0x0032   253   253   000Old_age   Always   -   38
195 Hardware_ECC_Recovered  0x000a   253   252   000Old_age   Always   -   43456
196 Reallocated_Event_Count 0x0008   251   251   000Old_age   Offline  -   2
197 Current_Pending_Sector  0x0008   249   249   000Old_age   Offline  -   41
198 Offline_Uncorrectable   0x0008   253   252   000Old_age   Offline  -   0
199 UDMA_CRC_Error_Count0x0008   199   199   000Old_age   Offline  -   0
200 Multi_Zone_Error_Rate   0x000a   253   252   000Old_age   Always   -   0
201 Soft_Read_Error_Rate0x000a   253   216   000Old_age   Always   -   37
202 TA_Increase_Count   0x000a   253   248   000Old_age   Always   -   0
203 Run_Out_Cancel      0x000b   253   245   180Pre-fail  Always   -   19
204 Shock_Count_Write_Opern 0x000a   253   252   000Old_age   Always   -   0
205 Shock_Rate_Write_Opern  0x000a   253   252   000Old_age   Always   -   0
207 Spin_High_Current   0x002a   253   252   000Old_age   Always   -   0
208 Spin_Buzz           0x002a   253   252   000Old_age   Always   -   0
209 Offline_Seek_Performnce 0x0024   154   148   000Old_age   Offline  -   0
 99 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0
100 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0
101 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0

smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.

Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.

So, let’s move on to the syslog results. The smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:

Dec  5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 
Dec  5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec  5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec  5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 

This pattern continues on like this the whole time, with Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:

Dec  8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec  8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 

This is the first indication of trouble. Notice that it’s not very threatening–Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:

Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 
Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 
Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 

This is the first time that Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.

The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:

Jan  1 03:20:57 starting scheduled Long Self-Test. 
Jan  1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251 
Jan  1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Jan  1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 
Jan  1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252 
Jan  1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 
Jan  1 05:50:56 Self-Test Log error count increased from 0 to 1 
Jan  1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Jan  1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 

At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:

Jan  4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251 
Jan  4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253 
Jan  4 07:35:40 ATA error count increased from 980 to 981 
Jan  4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Jan  5 02:05:42 starting scheduled Short Self-Test. 
Jan  5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 
Jan  5 02:35:40 Self-Test Log error count increased from 3 to 4 
Jan  5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252 
Jan  5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252 
Jan  5 06:36:10 ATA error count increased from 981 to 1293 
Jan  5 07:14:45 ATA error count increased from 1293 to 2377 

By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.

So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.

More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.

Posted by Scott Laird Sat, 08 Jan 2005 08:29:24 GMT


Comments

  1. JerryB about 11 hours later:

    As someone who monitors/supports thousands of drives from this manufacturer I can say, as you have discovered, that when you start getting a slew of ECCs its time to get important data off that drive. An ECC every now and then is not a big deal though. Here we have set thresholds (which are more restrictive than Maxtor’s) for critical SMART counters, and consider the drive bad as soon as one of them is exceeded. Off the top of my head I know we watch ECCs, off-tracks, and reallocations.

  2. opjose 23 days later:

    You should always take the drive offline and run SPINRITE on the ailing device with the machine down.

    Usually spinrite will refresh the drive surface and fix the “bad bits” so that data recovery is possible (mount the drive read-only).

    It will also force a remap of the bad sector and further verify the entire drive so that it may be put back into service (note: I never then use it for important non-recoverable data…).

  3. Scott Laird 23 days later:

    Well, ignoring the fact that Spinrite doesn’t run under Linux, I’m not really sure what the point of using a $90 data recovery program to fix a $75 drive. I mean, I guess I could buy it, build myself a Windows system, build a Spinrite boot CD, pull the failing drive from the RAID array and put it into a different PC that could run Spinrite, and then have it tell me that there are a bunch of bad sectors, but I don’t exactly see the point.

  4. Chris Samuel about 1 month later:

    Really useful article, thanks Scott!

    Just checked out one machine I recently got working with smartutils and it’s logged its first (and so far only) HardwareECCRecovered message, so I’ll be keeping a close eye (and current rsync snapshots elsewhere) just in case!

    Chris

  5. Metal Gear about 1 month later:

    Those Maxtor drives are the source of all your problems. They’ve been known to suddenly fail, not gradually. Lots of people are already complaining about them.. just Google around and you will see why. Like this one http://www.hardwareguys.com/picks/harddisk.html

    Cheers..

  6. Kevin 2 months later:

    I’d have to say that the error logs you were being given were not that helpful. All I can see in there is that the ECC count was changing, but not to what, or by how much. The “VALUE” field does not seem to be significant of anything other than something changed. The raw value that smartctl/smartd gives back is a lot more meaningful to my eyes.

    I’d find it very suspicious that any drive reports any significant number of hardware ECC events, because it would indicate that an error has occurred SOMEWHERE. Just because the drive could fix it before it got to the OS doesn’t mean it’s okay. Sometimes it can also point you to a bad cable (i.e. most rounded ones.)

    That said, Maxtor drives, in particular, are pieces of crap. Over the last few years I’ve observed close to a 50% failure rate within the first 2 years. The only reasons I would ever use a Maxtor drive for anything now would be if I couldn’t get a hold of anything else, or if a customer demanded it, and even then I’d buy a replacement and swap it in ASAP.

    As regards the comments regarding Spinrite. It doesn’t run under Linux, that’s true, but it boots off its own floppy/CD, and that’s all you need. The cost of the drive is not really relevant, it’s how much your data and time is worth. If Spinrite can get the drive back to a state where fsck/chkdsk/scandisk can actually work, and you might only have lost a little bit of data, then I’d say it’s worth it. Far better that then losing chunks of inode/block bitmaps and directory blocks and not being able to even get filenames back!

    Believe me, I’ve been there, and I’m not looking for a return visit any time soon.

    Just to make this even longer, I would also add the Maxtor drives seem to be especially susceptible to failure at temperatures which are only approaching the operating limits. For the last few years now, I’ve been trying to make sure that all my installations have fan cooling to stay within 25-30 degrees. Ideally, I’d use 120mm for longevity, as well as airflow, and reduced noise. Even better, use a temperature controlled fan setup with the sensors on the top surface of the drives.

    I hope it makes a difference, but I’ll only know for sure in a few years time if the drives are still running well.

  7. brendon 9 months later:

    Interesting read, as i recently went through an issue with a server that runs linux software raid and started using smartd

    Regarding your last point when you say you’ll stop using the 3ware card, what card are you going to use in conjunction with software raid?

  8. Uzi2 10 months later:

    that was mentioned here, was not to compare it with the value of data, but rather to compare it with other methods of protecting the data - such as RAID, which provides better protection (when configured to provide it).

  9. Jochen Pawletta 10 months later:

    Hi

    197 CurrentPendingSector 0x0008 249 249 000Old_age Offline - 41 Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong.

    Sorry, but this shows there are BIG problems on the drive, I wouldn’t have used it in the first place!

  10. Brett 10 months later:

    I have seen Maxtors fail before after only 6 months use.

    I’m afraid I’m going to see my first Western Digital fail. 2 Write Errors, 5 Reallocated Sector Count, 2 Pending Sector Count, 2 Off-Line Uncorrectable Sector Count and 3 Reallocation Event Count.

    I backup twice every Friday so I’m not worried of data loss.

  11. null about 1 year later:

    i have a samsung drive, smartctl currently states: RawReadError_Rate 0x000f 100 100 Pre-fail 41 Spin_Up_Time 0x0007 100 100 Pre-fail 5824 StartStopCount 0x0032 100 100 Old_age 32 PowerOn_HalfMinutes 0x0032 100 100 Old_age 33h+49m PowerCycleCount 0x0032 100 100 Old_age 25 UnknownAttribute 0x0022 160 130 Oldage 26 TemperatureCelsius 0x0022 160 130 Oldage 26 HardwareECCRecovered 0x001a 100 100 Old_age 69650862

    (all other values are zero)

    are you sure that hardwareeccrecovered should not be THAT big?

  12. Bruce Allen about 1 year later:

    Your mistake was not configuring smartd to send you warning emails. See the ‘-m’ directive for smartd.conf. This would have emailed you warnings when the self-tests began failing.

    You could also have configured the 3ware monitoring tool 3dm2 to send you email warnings when a drive failed or dropped out of the array.

    Bruce Allen

  13. Tim Berger about 1 year later:

    Scott, your first mistake (obviously) was not replacing the failed disk from the 3ware raid. If the 3ware failed it out, it was toast. 3ware’s 3dm2 mail alerts work; I’ve never had a problem with them. I think it’s a mistake to move to software raid because all large disks today develop bad sectors. That’s just the way it is. As long as your raid is intact, 3ware hardware raids can transparently remap bad sectors. Software raid can’t deal with this in any useful way.

    Interesting read about smartd and such though.