Anatomy of a Drive Failure

I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.

I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.

First, here are the basic specs on the system and drives involved:

Athlon 700 (slot A)
384 MB RAM (PC133)
Via KT133 chipset (Asus K7A MB, I think)
3ware 7500-8 8-channel IDE RAID controller
3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive

The drive that failed was a Maxtor, on channel #2. Here’s what smartmontools 5.30 has to say about the drive in its current condition:

Device Model: Maxtor 4A160J0
Serial Number:A608B7WE
Firmware Version: RAMB1TU0
Device is:    Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Fri Jan  7 11:47:02 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity was
                                        completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:  (  24) The self-test routine was aborted by
                                        the host.
Total time to complete Offline 
data collection:             ( 243) seconds.
Offline data collection
capabilities:                (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:        (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:    (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  99) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME      FLAG VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time        0x0027   214   214   063Pre-fail  Always   -   11805
  4 Start_Stop_Count    0x0032   253   253   000Old_age   Always   -   73
  5 Reallocated_Sector_Ct   0x0033   249   249   063Pre-fail  Always   -   41
  6 Read_Channel_Margin 0x0001   253   253   100Pre-fail  Offline  -   0
  7 Seek_Error_Rate     0x000a   253   252   000Old_age   Always   -   0
  8 Seek_Time_Performance   0x0027   252   244   187Pre-fail  Always   -   34394
  9 Power_On_Hours      0x0032   224   224   000Old_age   Always   -   24560
 10 Spin_Retry_Count    0x002b   253   252   157Pre-fail  Always   -   0
 11 Calibration_Retry_Count 0x002b   253   252   223Pre-fail  Always   -   0
 12 Power_Cycle_Count   0x0032   253   253   000Old_age   Always   -   76
192 Power-Off_Retract_Count 0x0032   253   253   000Old_age   Always   -   0
193 Load_Cycle_Count    0x0032   253   253   000Old_age   Always   -   0
194 Temperature_Celsius 0x0032   253   253   000Old_age   Always   -   38
195 Hardware_ECC_Recovered  0x000a   253   252   000Old_age   Always   -   43456
196 Reallocated_Event_Count 0x0008   251   251   000Old_age   Offline  -   2
197 Current_Pending_Sector  0x0008   249   249   000Old_age   Offline  -   41
198 Offline_Uncorrectable   0x0008   253   252   000Old_age   Offline  -   0
199 UDMA_CRC_Error_Count0x0008   199   199   000Old_age   Offline  -   0
200 Multi_Zone_Error_Rate   0x000a   253   252   000Old_age   Always   -   0
201 Soft_Read_Error_Rate0x000a   253   216   000Old_age   Always   -   37
202 TA_Increase_Count   0x000a   253   248   000Old_age   Always   -   0
203 Run_Out_Cancel      0x000b   253   245   180Pre-fail  Always   -   19
204 Shock_Count_Write_Opern 0x000a   253   252   000Old_age   Always   -   0
205 Shock_Rate_Write_Opern  0x000a   253   252   000Old_age   Always   -   0
207 Spin_High_Current   0x002a   253   252   000Old_age   Always   -   0
208 Spin_Buzz           0x002a   253   252   000Old_age   Always   -   0
209 Offline_Seek_Performnce 0x0024   154   148   000Old_age   Offline  -   0
 99 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0
100 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0
101 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0

smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.

Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.

So, let’s move on to the syslog results. The smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:

Dec  5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 
Dec  5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec  5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec  5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253

This pattern continues on like this the whole time, with Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:

Dec  8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec  8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253

This is the first indication of trouble. Notice that it’s not very threatening–Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:

Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 
Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 
Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253

This is the first time that Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.

The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:

Jan  1 03:20:57 starting scheduled Long Self-Test. 
Jan  1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251 
Jan  1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Jan  1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 
Jan  1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252 
Jan  1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 
Jan  1 05:50:56 Self-Test Log error count increased from 0 to 1 
Jan  1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Jan  1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253

At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:

Jan  4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251 
Jan  4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253 
Jan  4 07:35:40 ATA error count increased from 980 to 981 
Jan  4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Jan  5 02:05:42 starting scheduled Short Self-Test. 
Jan  5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 
Jan  5 02:35:40 Self-Test Log error count increased from 3 to 4 
Jan  5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252 
Jan  5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252 
Jan  5 06:36:10 ATA error count increased from 981 to 1293 
Jan  5 07:14:45 ATA error count increased from 1293 to 2377

By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.

So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.

More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.