Disks are fun

Posted by Scott Laird Mon, 10 Jan 2005 23:00:49 GMT

Okay, so my RAID array died because I wasn’t paying enough attention and my 3ware card had already kicked out one perfectly good drive for no obvious reason. No sweat, I can handle that. I as I mentioned before, I took me most of a day, but I recovered almost all of the data off of the failed 4-drive array onto a new 2-drive RAID-0 array. Once the copy was complete, the goal was to destroy the old, broken RAID-5 array, create a new, working RAID-5 array, and then copy all of the data off of the RAID-0 array onto the new RAID-5 array. Then, when everything was complete, I was planning on using the RAID-0 disks as parity and spare drives for the RAID-5 set. Nice and simple, right?

So, by Friday night, I had 6 drives in front of me. One was bad, three were good, but part of the broken RAID array, and two held the data that had been on the RAID array. My goal was to take the 3 good drives and use them to build a new 4-drive RAID-5 array, so I built a software RAID-5 array in degraded mode–that way, I could get away with leaving out the 4th drive at the beginning. Once I copied the data off of the 5th and 6th drives, I was planning on adding them to the RAID-5 array so I’d have a 4th disk plus a spare.

I was very careful not to re-use the broken drive–it was on 3ware channel #2, so I cleverly built my new array using Linux’s sda, sdc, and sdd devices, skipping sdb. Once RAID-5 was running, I formatted the new array, copied everything from the RAID-0 set, broke down the RAID-0 set, and added the drives to the RAID-5 array. And promptly watched everything crumble to dust. My RAID-5 array started out in degraded mode, with 3 of 4 drives active. I then added 2 additional drives, and instead of watching it rebuild to 4 of 4 plus 1 spare, it went to 2 of 4 active. It even sent me this helpful email:

From:   scott@mail.sigkill.org
Subject: Fail event on /dev/md1:nfs
Date: January 8, 2005 8:16:43 AM PST
To:   scott@sigkill.org

This is an automatically generated mail message from mdadm
running on nfs

A Fail event had been detected on md device /dev/md1.

Faithfully yours, etc.

Although the array was still mounted, any attempt to access it generated a steady stream of I/O errors. What happened, you ask?

Basically, I was an idiot. Like I said, the drive on 3ware channel #2 failed, so I didn’t use drive sdb. Except that 3ware numbers their channels starting with 0. So channel #2 was drive number 3—sdc, not sdb. So I’d rebuilt by array using the bad drive, then copied my data onto the broken disk, and destroyed all of my good copies. I spent all morning Saturday trying to fix things, but I couldn’t even get the kernel to acknowledge that the RAID array existed. I finally gave up and tried cloning sdb onto sdc, to see if that’d work, but it didn’t make a bit of difference–I could at least get mdadm to tell me that sdb had once been a part of a RAID array, but it didn’t recognize any of the data on sdc as any part of anything.

In desperation, I tried re-creating the RAID array exactly as I’d first built it, using sda, sdc, and sdd. Amazingly enough, that worked, and I was able to mount the drive. I then carefully added sdc into the array, watched it rebuild the first 20% of the array, and then fail sdc back out of the array, leaving me back where I started. I finally turned off the computer in disgust and went and played with what was left of our snow.

Sunday was more snow, so I played with the kids, and then finally took one last swing at the computer. I re-built the RAID array again, and then built a RAID-0 array from sde and sdf. I then tried to copy anything that was salvageable off of the broken RAID-5 array. I figured that I’d be able to copy something before it croaked again. I checked back a couple hours later to discover that it’d copied all 216 GB without error. I was stunned–apparently the drive’s problem was really just corruption of a few sectors–writing new data back onto the drive overwrote the weak parts with a new, strong signal, and it was able to read them back safely. Ugh. It wouldn’t resync right because there were still a number of old sectors with old data on them–if I’d zeroed out the whole drive, it’d probably have worked right from the start, for at least a couple months, until it failed again.

So, I went back through the process again, destroying the array built from sda, sdc, and sdd, and then building a new one with sdb this time. There’s no way I’m going to trust the failing drive, even if it did work this time. I copied everything off of the little RAID-0 array, then carefully tore it apart and used its drives to rebuild the big array into its full RAID-5 glory. And it actually worked this time, without errors. Everything was finally finished around midnight last night, and I was able to reboot without problems.

All done, right?

Ha.

This morning I got up to find the screen full of syslogged Ethernet problems–apparently the network card had locked up. I could log in on the console, but I couldn’t ping anything. I rebooted, everything came up okay, and I tried copying a bunch of stuff onto the new RAID array. It copied just fine for about 5 minutes, and then the box locked up hard. No kernel panic or anything, just a dead box. The reset button didn’t help, and it ignored the soft power button, so I had to do the hold-the-power-button-for-5-seconds trick. After that, it didn’t boot right–there were 3ware card errors everywhere–timeouts, not drive problems. It locked up again halfway through booting.

So, practically speaking, I’m right back where I started on Friday morning–my box is dead, but the data is probably fine. I’m going to pop the box open and wiggle some cables, but I probably have bad hardware somewhere in the box–motherboard, 3ware card, or power supply. If this had happened at work, I’d just RMA the whole mess and let the vendor sort it out, but that’s not very useful at home, especially when dealing with a 4-year-old system with a second-hand RAID card. Ugh.

Update: I powered it off for a while, wiggled cables, removed spare hardware, rebooted, and found a nice kernel bug. If you have a RAID array with 4 drives plus a spare, and for some reason the spare’s RAID superblock has a higher timestamp then the 4 data drives, then the kernel’s RAID code will gladly kick the 4 good drives out of the array and keep just the spare. I sense a bug report in my near future.

Posted in  | Tags , , ,  | no comments

Anatomy of a Drive Failure

Posted by Scott Laird Sat, 08 Jan 2005 08:29:24 GMT

I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.

I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.

First, here are the basic specs on the system and drives involved:

  • Athlon 700 (slot A)
  • 384 MB RAM (PC133)
  • Via KT133 chipset (Asus K7A MB, I think)
  • 3ware 7500-8 8-channel IDE RAID controller
  • 3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive

The drive that failed was a Maxtor, on channel #2. Here’s what smartmontools 5.30 has to say about the drive in its current condition:

Device Model: Maxtor 4A160J0
Serial Number:A608B7WE
Firmware Version: RAMB1TU0
Device is:    Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Fri Jan  7 11:47:02 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity was
                                        completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:  (  24) The self-test routine was aborted by
                                        the host.
Total time to complete Offline 
data collection:             ( 243) seconds.
Offline data collection
capabilities:                (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:        (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:    (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  99) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME      FLAG VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time        0x0027   214   214   063Pre-fail  Always   -   11805
  4 Start_Stop_Count    0x0032   253   253   000Old_age   Always   -   73
  5 Reallocated_Sector_Ct   0x0033   249   249   063Pre-fail  Always   -   41
  6 Read_Channel_Margin 0x0001   253   253   100Pre-fail  Offline  -   0
  7 Seek_Error_Rate     0x000a   253   252   000Old_age   Always   -   0
  8 Seek_Time_Performance   0x0027   252   244   187Pre-fail  Always   -   34394
  9 Power_On_Hours      0x0032   224   224   000Old_age   Always   -   24560
 10 Spin_Retry_Count    0x002b   253   252   157Pre-fail  Always   -   0
 11 Calibration_Retry_Count 0x002b   253   252   223Pre-fail  Always   -   0
 12 Power_Cycle_Count   0x0032   253   253   000Old_age   Always   -   76
192 Power-Off_Retract_Count 0x0032   253   253   000Old_age   Always   -   0
193 Load_Cycle_Count    0x0032   253   253   000Old_age   Always   -   0
194 Temperature_Celsius 0x0032   253   253   000Old_age   Always   -   38
195 Hardware_ECC_Recovered  0x000a   253   252   000Old_age   Always   -   43456
196 Reallocated_Event_Count 0x0008   251   251   000Old_age   Offline  -   2
197 Current_Pending_Sector  0x0008   249   249   000Old_age   Offline  -   41
198 Offline_Uncorrectable   0x0008   253   252   000Old_age   Offline  -   0
199 UDMA_CRC_Error_Count0x0008   199   199   000Old_age   Offline  -   0
200 Multi_Zone_Error_Rate   0x000a   253   252   000Old_age   Always   -   0
201 Soft_Read_Error_Rate0x000a   253   216   000Old_age   Always   -   37
202 TA_Increase_Count   0x000a   253   248   000Old_age   Always   -   0
203 Run_Out_Cancel      0x000b   253   245   180Pre-fail  Always   -   19
204 Shock_Count_Write_Opern 0x000a   253   252   000Old_age   Always   -   0
205 Shock_Rate_Write_Opern  0x000a   253   252   000Old_age   Always   -   0
207 Spin_High_Current   0x002a   253   252   000Old_age   Always   -   0
208 Spin_Buzz           0x002a   253   252   000Old_age   Always   -   0
209 Offline_Seek_Performnce 0x0024   154   148   000Old_age   Offline  -   0
 99 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0
100 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0
101 Unknown_Attribute   0x0004   253   253   000Old_age   Offline  -   0

smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.

Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.

So, let’s move on to the syslog results. The smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:

Dec  5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 
Dec  5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec  5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec  5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 

This pattern continues on like this the whole time, with Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:

Dec  8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec  8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 

This is the first indication of trouble. Notice that it’s not very threatening–Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:

Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 
Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 
Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 
Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 

This is the first time that Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.

The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:

Jan  1 03:20:57 starting scheduled Long Self-Test. 
Jan  1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251 
Jan  1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 
Jan  1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 
Jan  1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252 
Jan  1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 
Jan  1 05:50:56 Self-Test Log error count increased from 0 to 1 
Jan  1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Jan  1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 

At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:

Jan  4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251 
Jan  4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253 
Jan  4 07:35:40 ATA error count increased from 980 to 981 
Jan  4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 
Jan  5 02:05:42 starting scheduled Short Self-Test. 
Jan  5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 
Jan  5 02:35:40 Self-Test Log error count increased from 3 to 4 
Jan  5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252 
Jan  5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252 
Jan  5 06:36:10 ATA error count increased from 981 to 1293 
Jan  5 07:14:45 ATA error count increased from 1293 to 2377 

By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.

So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.

More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.

Posted in  | Tags , , , ,  | 13 comments

One step forward, one step back

Posted by Scott Laird Thu, 06 Jan 2005 19:49:46 GMT

As mentioned earlier, I spent part of the long weekend cleaning up home theater stuff. Part of this involved migrating files onto my home file server, which is an old Athlon 700 with an 8-channel 3ware RAID card and 4 160 GB drives in a 450 GB RAID 5 array.

So what happens as soon as I finish copying stuff onto the array? A drive starts failing on the RAID array, and I discover that it was already running in degraded mode. Now I’m in danger of losing all 200 GB on the array. Most likely, it won’t come to that, but it’s still fantastically irritating. Of the 4 160 GB drives that I bought last year, 2 of them have now failed.

To make sure that this doesn’t happen again, I just ordered 2 more 160 GB drives from NewEgg (only $76 each), along with a 3-in-2 style drive cooler. Assuming that it all arrives tomorrow, I should be able to rebuild the array, including a spare drive this time, and hopefully I won’t have to worry about it failing again.

Posted in , ,  | Tags , , , , ,  | no comments