Posted by Scott Laird
Mon, 22 Oct 2007 21:39:47 GMT
I mentioned that I bought a Gigabyte GC-RAMDISK (a.k.a. i-RAM) to go in my new home file server, largely to see if using it as a solid-state log device would improve ZFS performance.
Unfortunately, I’ve been completely and totally unable to get the card to do anything at all. I’m not sure if I have a defective card or if Gigabyte’s SATA implementation is just really buggy. When I plugged it into the motherboard’s ICH9R SATA ports, the BIOS didn’t even show it on the boot-up scan and Solaris reported it as failing to initialize correctly. When I plugged it into the Supermicro AOC-SAT2-MV8 8-port SATA card, the BIOS could see it but Solaris gave a similar error. Connecting it to the motherboard’s Marvell eSATA ports made the Marvell hang at bootup and made Solaris really unhappy, spewing drive failure messages all over the console.
I can’t find a single review that suggests that anyone has got this to work with a recent motherboard. Digging through the Linux kernel mailing list suggests that it has a really spotty SATA implementation. Apparently they developed it using a couple Windows drivers as a comparison, instead of actually paying attention to the SATA spec.
So, it’s going back to Amazon today. It was a nice idea, but it just doesn’t appear to work. I don’t know if it’s broken or just poorly designed, but either way it’s not useful to me.
Tags broken, gc, gigabyte, i, ram, ramdisk, ssd, zfs | 7 comments
Posted by Scott Laird
Sun, 14 Oct 2007 12:47:00 GMT
What’s worse then the sound of one hard drive going “click, click?” Why, two drives going “click, click, click” in the same RAID 5 array, of course.
I’m not very happy with my little Infrant NAS box right now. I think I’ve had it with RAID 5–if I’m going to pile my life onto a disk array, then I really want something that can survive a 2-drive failure without croaking, and that’s basically impossible in a 4-drive enclosure.
I’m seriously considering replacing the Infrant with an OpenSolaris box running ZFS over RAID-Z2 with 6–10 drives; that should live through 2-drive failures, right? Anyone feel the need to talk me out of it?
Tags broken, hardware | 9 comments
Posted by Scott Laird
Thu, 22 Jun 2006 03:41:31 GMT
I’m mostly happy with my new Nokia E61, but there are 2 or 3 things about it that have been annoying the heck out of me. The most serious one is power-related: it’s been a royal pain in the neck to charge the phone. When I plug the phone into the charger, it’ll charge for a little while–30 seconds to 10 minutes–and then beep and say “Not charging.” If I unplug and replug the charger, then it’ll charge for a little while longer, but I have to keep doing it over and over again to get a full charge. This isn’t exactly an enjoyable and productive way to spend my days.
Obviously, it isn’t supposed to work this way. From doing a bit of digging online, it looks like I’m not alone with this problem. The general issue seems to be the charger–some third-party chargers (or even older Nokia chargers) don’t put out enough voltage to charge the phone. The phone sees the voltage as too low, so it aborts with the “not charging” message. I can see why this happens with the cheap third-party car charger that came with my phone, but it’s not at all clear why Nokia’s own wall charger does it. Admittedly, the one that came in the box with the phone is a UK model, but it’s supposed to take 100-240V. I just doesn’t quite manage to work right. There’s a slim chance that the vendor that sold me the E61 swapped out an older charger, but that’s kind of weird.
In an effort to avoid having to return the phone, I picked up a Nokia AC-3U charger from CompUSA on the way home today. The AC-3U is the travel charger that Nokia lists on their website for the E61. It uses the right connector for the phone (unlike the charger that came with the phone–it needs an adapter), and it’s quite a bit smaller then the original charger, too.
I plugged it in and the phone charged for about a half hour before “not charging” showed up. That’s a new record, but more importantly, it claimed to be fully charged at that point. So, most likely, the AC-3U does the trick. I’ll give it another shot tomorrow once it’s drained down a bit, but I think it’ll be okay.
Tags broken, nokiae61 | 21 comments
Posted by Scott Laird
Mon, 31 Oct 2005 03:49:49 GMT
This was one of those weekends where I practically lived with a camera in my hand; I took around 500 halloween pictures yesterday and then moved on to Christmas-card pictures of the kids today. Everything was going well enough until late in the day, when I was trying to get a nice black and white shot of my son. In the middle of shooting I took a quick peek at my camera’s LCD display and noticed that the last shot had been completely underexposed. So I scrolled back a few shots and discovered that I’d been shooting nothing but black frames for about 30 seconds. One second it worked, the next it didn’t. All of the camera settings were the same–same aperture, ISO, and shutter speed. Same light. But no picture.
I double-checked things, but I was still getting nothing but black. With a sinking feeling, I popped the lens off and took a multi-second exposure while staring into the camera. I could see the mirror flip up, but instead of seeing the sensor, I was left staring at a closed shutter, which strongly suggests that my shutter has died. The shutter on my D60 is rated for 30,000 exposures, and I think I’m around 25,000 right now, so it’s a bit earlier then normal, but not utterly unexpected.
So, I guess I’ll be packing it up and shipping it off to Canon for service this week. I’m not sure how long it’ll take to get back, but I was hoping to use it at Mind Camp next weekend. It looks like I’m going to have to make alternate plans.
Tags broken, camera, photography | 1 comment
Posted by Scott Laird
Tue, 13 Sep 2005 17:31:16 GMT
So pretty much nothing worked yesterday.
It started with an 8:00 phone call over VoIP that only had one-way audio. Then my 11:00 phone call gave a potential client an “all circuits busy” instead of ringing through to my VoIP phone. I’ve been using Asterisk for almost 18 months, and this is the first time that I’ve ever seen either of these, so I spent a while trying to reproduce the problems and sending support email off to two different VoIP providers.
After that, I finally started in on The Evil Thing–I bought a copy of XP and I was planning on installing it on a spare PC so I could test Typo with IE 6. How hard could it be, right?
I gave up on it at 8:00 PM.
Here’s a short list of things that went wrong:
I couldn’t find an old Windows CD to use to make the upgrade test on my XP CD happy. I should have CDs for ‘98 and 2000 sitting here somewhere, but I couldn’t find either. I had to borrow one to make XP’s installer happy. It would have been easier to download a cracked copy then to use the legitimate version and fight with its copy and licensing protection.
Once I got past the upgrade test, the installer refused to format my hard drive. No matter which options I picked (full disk or small partition, NTFS or FAT, quick format or full format), it would always die out within 5 seconds with a “Setup was unable to format the partition” error. The error suggests that I check the power on my external SCSI drive. Since I’m installing onto a completely standard 80 GB internal IDE drive, the error isn’t very helpful. Digging around a bit, bad IDE cables and bad CD drives seem to be the most common causes for this error. Since this is an old box that I put together from spare parts, the system is using old 40-pin IDE cables; I need to swing by a store and pick up a couple 80-pin IDE cables. Maybe that will help.
For the fun of it, I tried booting my borrowed XP disk (the one that I was using to pass the upgrade test), and *it* partitioned the drive without any problems. Unfortunately, it refused to take my license key. The nice hologrammed one that came directly from Microsoft. Apparently my key is just good for XP Pro Upgrade CDs that come with SP2 pre-installed or something. Rebooting with my CD put me right back into formatting limbo.
I swear, I should have just downloaded and installed a cracked version–I would have been done early yesterday afternoon.
Tags broken, hardware, voip, windows | 3 comments
Posted by Scott Laird
Wed, 06 Jul 2005 04:22:55 GMT
The AC adapter for my PowerBook died this morning. It was weird–it was working fine at home this morning, but it completely failed to work at the office. I changed power plugs and wiggled all of the connectors, all without success.
Fortunately, my office is only 20 minutes from one of the local Apple stores, so I was able to dash out and get a replacement. After 40 months, I guess I’m not really surprised when pieces die on my laptop anymore.
Posted in Mac stuff | Tags apple, applestore, broken, powerbook | 2 comments
Posted by Scott Laird
Fri, 18 Mar 2005 20:16:19 GMT
For some reason, the trackpad on my PowerBook started acting up this morning. Once or twice per minute, the mouse pointer just stops moving. Picking my finger up and waiting for a couple seconds usually fixes it; tapping hard on the trackpad seems to work, too. I rebooted, and it didn’t seem to happen while I was sitting at the login screen, but it restarted as soon as I logged in. So, it could be a software issue. Ugh. As it is, it’s really awkward to use the trackpad. I have a USB mouse that I bought to use with the laptop years ago, but I gave up on it after I got used to the trackpad. Now it looks like I might have to drag the mouse out of retirement, unless I can find a simple solution to the problem.
I think this is my Mac’s way of telling me to order a new PowerBook, but I’m not going to take the hint until the PowerBook G5 shows up.
Posted in Mac stuff | Tags broken, powerbook | no comments
Posted by Scott Laird
Tue, 08 Mar 2005 00:31:13 GMT
Today’s weather for Seattle, WA:
60°F, mostly cloudy
Today’s weather for my office:
72°F, light rain.
For the fourth time in the last two years, the air conditioner hiding above my office’s suspended ceiling is dripping, sending a stream of water onto the floor of my office. Fortunately, my current office layout doesn’t have any critical hardware setting underneath the leak.
The really annoying thing is that the leak seems to follow me–my previous office had the same problem, and it was in a completely different building.
Posted in Work | Tags broken, office, seattle | no comments
Posted by Scott Laird
Wed, 16 Feb 2005 22:28:37 GMT
I’ve never really figured out why, but Google really likes me. Or, rather, it likes this blog. I keep showing up amazingly highly-ranked in common Google searches. Today’s example is treo wifi. In order, here are the top 10 results, out of 2.2 million possible matches:
- TreoCentral: No Treo WiFi
- TreoCentral: Treo 600 and WiFi?
- .:UNEASYsilence: Treo 650 WiFi
- PDA News: Treo 650 WiFi, Verizon announces XV6600, PalmOne…
- scottstuff: another WiFi solution for the Treo 650
- Slashdot: Enthusiast Hacks WiFi Into Treo 650
- CNet: Treo 650 Update WiFi-less
- Engadget: Add WiFi to your Treo 650! SD WiFi card drivers hacked
- Engadget: some random search page
- PalmInfocenter: HOWTO: Make that palmOne Treo 650 Even Better!
So, as I see this, Google sees me as a better source of information then Slashdot, CNet, Engadget, and PalmInfocenter? It can’t just be PageRank–from what I can see, I’m just a lowly PR5 this month, while Slashdot and CNet are PR9s, and Engadget is a PR6. It can’t be the number of links in Google’s database, because no one links to my Treo WiFi page. Can anyone explain how this works?
Posted in Blog stuff | Tags blog, broken, google, search | 2 comments
Posted by Scott Laird
Wed, 16 Feb 2005 19:10:17 GMT
I’m not an expert in cryptography, but I try to pay attention to what’s happening in the crypto world. Today, Bruce Schneier announced:
SHA-1 has been broken. Not a reduced-round version. Not a simplified version. The real thing.
The research team of Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu (mostly from Shandong University in China) have been quietly circulating a paper announcing their results:
- collisions in the the full SHA-1 in 269 hash operations, much less than the brute-force attack of 280 operations based on the hash length.
- collisions in SHA-0 in 239 operations.
- collisions in 58-round SHA-1 in 233 operations
So, unless I’m mis-reading this, SHA-1 lost a factor of 2,048; that’s enough to start moving away from SHA-1, but not enough to run screaming in the streets. The last time that SHA-1 attacks showed up, similar attacks were possible against MD5 and possibly also the newer SHA family members; I’m not really sure if there are cryptographic hashes in common use that aren’t at least slightly tainted right now.
Posted in Computer Security | Tags broken, cryptography, security, sha1 | 5 comments
Posted by Scott Laird
Sat, 05 Feb 2005 01:44:18 GMT
It all started on Wednesday night, when I discovered that my Asterisk VoIP server couldn’t make long-distance phone calls any more. I could dial, and the phone would ring, but as soon as the other end answered, the line went dead and the logs started filling up with error messages:
Feb 2 20:24:03 WARNING[-1250968656]: Huh? An ilbc frame that isn't a multiple of 50 bytes long from IAX2 (16)?
That’s kind of weird looking. Google wasn’t particularly helpful, so I decided that this was probably a VoIP provider problem and sent mail to NuFone’s support people. Historically, NuFone’s support department has a reputation of being a bit spotty–either they handle your problem right off the bat, or you never hear from them again, but that was probably a combination of broken email servers and staffing shortages due to rapid growth. In this case, I got a response back from NuFone’s president first thing in the morning. We’ve exchanged a bit of mail in the past, and in this case he agreed that it was a weird error, and the first thing to try was upgrading. I was running Asterisk 1.0.2, while the current version is 1.0.5. He suggested that I might be happier with checking a development version of out CVS–they’ve been more stable for him, anyway, and there are a ton of new features and bug fixes that aren’t in 1.0.5.
What the heck, I said. Sounds like fun.
So, I checked a copy of Asterisk and several of its supporting libraries and drivers out of CVS, built them, installed them, and tried restarting Asterisk. First problem: I’d built my kernel without support for module unloading, somehow. So, I had to reboot to upgrade the drivers for my two PCI VoIP cards. Grumble, grumble. Oh well, I’d been meaning to shut it down and add more RAM anyway; the RAM was sitting right there, and it shouldn’t take long to slide the server out from under its desk and install the RAM. Five minutes later, the box was booting back up with the new drivers and twice as much memory.
Problem 2: As soon as I started Asterisk, I discovered that the X101P PCI card that handles my regular POTS line was reporting a ‘red alarm’ on the line, suggesting that it wasn’t connected right. I crawled under the desk, unplugged the phone line, wiggled connectors, and tested it with a $10 phone that I keep on hand for situations like this. I got a dial tone just fine with the phone, so I plugged the line back into the PC and crawled back out from under the desk to discover…
Problem 3: the server locked up while I was fiddling with the phone line. So I rebooted it, only to discover…
Problem 4: it couldn’t load the drivers for the PCI card anymore. The driver spit out a number of weird errors:
kernel: wcfxo: Out of space to write register 06 with e0
kernel: wcfxo: Out of space to write register 0f with 00
That’s ugly. I developed a new theory: the card wasn’t getting reset right on reboot. So, I powered the box completely down, pulled the power cord for a minute, and then tried again. It worked this time; everything came up right, and I was able to verify that everything was working correctly. Incoming calls worked right, and outgoing calls worked, via both POTS and VoIP.
So fast-forward a few hours. I try to call out, and nothing happens–I just get a dead line. Checking Asterisk’s console, it looks like it dialed out on the POTS line, but I wasn’t hearing ringing or anything. The driver for the X101P PCI card had apparently choked again. I tried stopping and restarting Asterisk, but it didn’t help. I didn’t worry too much, though–incoming calls on my POTS line get forwarded to VoIP after about 15 seconds worth of ringing. Somewhere in the middle of this, a couple calls came in from family members, but I wasn’t close enough to a phone to answer them, so I let them go to voice mail. Except neither call left a message. That seemed strange (my family is big on leaving messages), so I called back, and they said that the line had went dead right after it answered–they never heard the “leave a message” message.
Problem 5: incoming VoIP calls weren’t working, either. The connection died in the middle of a Playback instruction. I tried changing codecs, restarting Asterisk, and even rebooting, all without success. My frustration level was rising. Quickly. At this point, almost nothing was working right–POTS was dead, and incoming VoIP calls only worked right if I answered the phone. I took a break, put the kids to bed, and resisted the urge to scream. I mean, I debug problems like this for a living, which means that I really hate to do it at home.
So I spent a minute thinking–it was really weird that rebooting didn’t fix the VoIP Voicemail problem. I’d tested it earlier in the day, and it had worked after the upgrade to CVS Asterisk. And I hadn’t changed any other software since then. I rebooted again, and this time the X101P PCI VoIP card disappeared completely. Asterisk wouldn’t even start now–my config files are sprinkled with references to this card, and Asterisk was aborting with “can’t find card” errors. So I thought a bit more and came up with a new theory–when I’d pulled the box out to install more RAM, maybe I’d pulled on the phone cable that goes into the X101P a bit too hard, and the card had popped partway out of the slot. It’s easy to test, so I shut the server down again, slid it out from under the desk, popped the PCI card out, checked it for obvious problems, and put it back in firmly. I then plugged everything back in, rebooted, and discovered that everything was working perfectly. POTS worked, incoming voicemail worked via POTS or VoIP, outgoing VoIP calls worked.
I checked again this morning, and everything’s still working fine.
So, did I really pull the card halfway out, or was yesterday’s Asterisk CVS tree just really unstable? And what sort of person goes through this just to save $15 per month on phone calls? And speaking of that–if I’d had a second VoIP provider set up, then I could have shunted calls to them when outgoing calls to NuFone failed, and I would have been able to avoid part of the mess. Hey, yeah, and I still need to finish the Asterisk-caller-ID-on-MythTV project, so I can see who’s calling on the TV while I’m watching movies. Oohh, and my shiny new internal PCI ADSL card is supposed to ship soon. And I need to mod an Xbox so I can run MythTV on it downstairs…
I guess that should answer the “what sort of person” question. A person with incurable gadget lust. I’m getting better. Really.
Posted in Asterisk | Tags asterisk, broken, ilbc, nufone, upgrade | 3 comments
Posted by Scott Laird
Thu, 27 Jan 2005 15:25:09 GMT
This is almost too much to believe:
A Londonder made a tsnuami-relief donation using lynx – a text-based browser used by the blind, Unix-users and others – on Sun’s Solaris operating system. The site-operator decided that this “unusual” event in the system log indicated a hack-attempt, and the police broke down the donor’s door and arrested him.
(From Boing Boing)
Posted in Web stuff | Tags arrested, broken, lynx | no comments
Posted by Scott Laird
Sat, 22 Jan 2005 16:03:58 GMT
Yesterday morning, when I arrived at work, I noticed that my laptop couldn’t connect to my home email server for some reason. Attempts to ping my home web server showed 90% packet loss. That’s kind of an unusual situation for a home network–I’ve had DSL go out quite a few times over the past 5 years, and I’ve had routers crash, but this is the first time that I’ve seen crippling packet loss. My best guess was that something strange had happened with a VPN that I’d set up between home and work, and that something was flinging non-rate-limited UDP or ESP packets at an insane rate.
Since 90% packet loss effects our home VoIP service as well, I had my wife hit reset on our home router/firewall PC. That fixed the unusual 90% packet loss, replacing it with 100% packet loss. When I got home last night, I found that the system had dropped into the BIOS setup screen with a “the previous boot didn’t complete right, so you probably want to change some BIOS settings” error. Grr. I changed the boot time error settings to tell it to ignore all errors, but this will probably happen the next time I need to do an emergency reboot.
Of course, it goes without saying that yesterday was the slowest day of the month, in terms of traffic to my blog. Traffic is way up this month–my average number of visits so far this month is only slightly behind the single best day from last year. My previous high was 595 hits, followed by 524 hits in second place. This month, I’m averaging around 540 hits per day, with a high of almost 800. Yeah, except for yesterday, which was barely 350.
Posted in Personal | Tags broken, dsl, outage, webstats | no comments
Posted by Scott Laird
Mon, 10 Jan 2005 23:00:49 GMT
Okay, so my RAID array died because I wasn’t paying enough attention and my 3ware card had already kicked out one perfectly good drive for no obvious reason. No sweat, I can handle that. I as I mentioned before, I took me most of a day, but I recovered almost all of the data off of the failed 4-drive array onto a new 2-drive RAID-0 array. Once the copy was complete, the goal was to destroy the old, broken RAID-5 array, create a new, working RAID-5 array, and then copy all of the data off of the RAID-0 array onto the new RAID-5 array. Then, when everything was complete, I was planning on using the RAID-0 disks as parity and spare drives for the RAID-5 set. Nice and simple, right?
So, by Friday night, I had 6 drives in front of me. One was bad, three were good, but part of the broken RAID array, and two held the data that had been on the RAID array. My goal was to take the 3 good drives and use them to build a new 4-drive RAID-5 array, so I built a software RAID-5 array in degraded mode–that way, I could get away with leaving out the 4th drive at the beginning. Once I copied the data off of the 5th and 6th drives, I was planning on adding them to the RAID-5 array so I’d have a 4th disk plus a spare.
I was very careful not to re-use the broken drive–it was on 3ware channel #2, so I cleverly built my new array using Linux’s sda, sdc, and sdd devices, skipping sdb. Once RAID-5 was running, I formatted the new array, copied everything from the RAID-0 set, broke down the RAID-0 set, and added the drives to the RAID-5 array. And promptly watched everything crumble to dust. My RAID-5 array started out in degraded mode, with 3 of 4 drives active. I then added 2 additional drives, and instead of watching it rebuild to 4 of 4 plus 1 spare, it went to 2 of 4 active. It even sent me this helpful email:
From: scott@mail.sigkill.org
Subject: Fail event on /dev/md1:nfs
Date: January 8, 2005 8:16:43 AM PST
To: scott@sigkill.org
This is an automatically generated mail message from mdadm
running on nfs
A Fail event had been detected on md device /dev/md1.
Faithfully yours, etc.
Although the array was still mounted, any attempt to access it generated a steady stream of I/O errors. What happened, you ask?
Basically, I was an idiot. Like I said, the drive on 3ware channel #2 failed, so I didn’t use drive sdb. Except that 3ware numbers their channels starting with 0. So channel #2 was drive number 3—sdc, not sdb. So I’d rebuilt by array using the bad drive, then copied my data onto the broken disk, and destroyed all of my good copies. I spent all morning Saturday trying to fix things, but I couldn’t even get the kernel to acknowledge that the RAID array existed. I finally gave up and tried cloning sdb onto sdc, to see if that’d work, but it didn’t make a bit of difference–I could at least get mdadm to tell me that sdb had once been a part of a RAID array, but it didn’t recognize any of the data on sdc as any part of anything.
In desperation, I tried re-creating the RAID array exactly as I’d first built it, using sda, sdc, and sdd. Amazingly enough, that worked, and I was able to mount the drive. I then carefully added sdc into the array, watched it rebuild the first 20% of the array, and then fail sdc back out of the array, leaving me back where I started. I finally turned off the computer in disgust and went and played with what was left of our snow.
Sunday was more snow, so I played with the kids, and then finally took one last swing at the computer. I re-built the RAID array again, and then built a RAID-0 array from sde and sdf. I then tried to copy anything that was salvageable off of the broken RAID-5 array. I figured that I’d be able to copy something before it croaked again. I checked back a couple hours later to discover that it’d copied all 216 GB without error. I was stunned–apparently the drive’s problem was really just corruption of a few sectors–writing new data back onto the drive overwrote the weak parts with a new, strong signal, and it was able to read them back safely. Ugh. It wouldn’t resync right because there were still a number of old sectors with old data on them–if I’d zeroed out the whole drive, it’d probably have worked right from the start, for at least a couple months, until it failed again.
So, I went back through the process again, destroying the array built from sda, sdc, and sdd, and then building a new one with sdb this time. There’s no way I’m going to trust the failing drive, even if it did work this time. I copied everything off of the little RAID-0 array, then carefully tore it apart and used its drives to rebuild the big array into its full RAID-5 glory. And it actually worked this time, without errors. Everything was finally finished around midnight last night, and I was able to reboot without problems.
All done, right?
Ha.
This morning I got up to find the screen full of syslogged Ethernet problems–apparently the network card had locked up. I could log in on the console, but I couldn’t ping anything. I rebooted, everything came up okay, and I tried copying a bunch of stuff onto the new RAID array. It copied just fine for about 5 minutes, and then the box locked up hard. No kernel panic or anything, just a dead box. The reset button didn’t help, and it ignored the soft power button, so I had to do the hold-the-power-button-for-5-seconds trick. After that, it didn’t boot right–there were 3ware card errors everywhere–timeouts, not drive problems. It locked up again halfway through booting.
So, practically speaking, I’m right back where I started on Friday morning–my box is dead, but the data is probably fine. I’m going to pop the box open and wiggle some cables, but I probably have bad hardware somewhere in the box–motherboard, 3ware card, or power supply. If this had happened at work, I’d just RMA the whole mess and let the vendor sort it out, but that’s not very useful at home, especially when dealing with a 4-year-old system with a second-hand RAID card. Ugh.
Update: I powered it off for a while, wiggled cables, removed spare hardware, rebooted, and found a nice kernel bug. If you have a RAID array with 4 drives plus a spare, and for some reason the spare’s RAID superblock has a higher timestamp then the 4 data drives, then the kernel’s RAID code will gladly kick the 4 good drives out of the array and keep just the spare. I sense a bug report in my near future.
Posted in Computer System Administration | Tags broken, ide, linux, raid | no comments
Posted by Scott Laird
Sat, 08 Jan 2005 08:29:24 GMT
I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.
I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.
First, here are the basic specs on the system and drives involved:
- Athlon 700 (slot A)
- 384 MB RAM (PC133)
- Via KT133 chipset (Asus K7A MB, I think)
- 3ware 7500-8 8-channel IDE RAID controller
- 3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive
The drive that failed was a Maxtor, on channel #2. Here’s what smartmontools 5.30 has to say about the drive in its current condition:
Device Model: Maxtor 4A160J0
Serial Number:A608B7WE
Firmware Version: RAMB1TU0
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Fri Jan 7 11:47:02 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity was
completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 24) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 243) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 99) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 214 214 063Pre-fail Always - 11805
4 Start_Stop_Count 0x0032 253 253 000Old_age Always - 73
5 Reallocated_Sector_Ct 0x0033 249 249 063Pre-fail Always - 41
6 Read_Channel_Margin 0x0001 253 253 100Pre-fail Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000Old_age Always - 0
8 Seek_Time_Performance 0x0027 252 244 187Pre-fail Always - 34394
9 Power_On_Hours 0x0032 224 224 000Old_age Always - 24560
10 Spin_Retry_Count 0x002b 253 252 157Pre-fail Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 253 253 000Old_age Always - 76
192 Power-Off_Retract_Count 0x0032 253 253 000Old_age Always - 0
193 Load_Cycle_Count 0x0032 253 253 000Old_age Always - 0
194 Temperature_Celsius 0x0032 253 253 000Old_age Always - 38
195 Hardware_ECC_Recovered 0x000a 253 252 000Old_age Always - 43456
196 Reallocated_Event_Count 0x0008 251 251 000Old_age Offline - 2
197 Current_Pending_Sector 0x0008 249 249 000Old_age Offline - 41
198 Offline_Uncorrectable 0x0008 253 252 000Old_age Offline - 0
199 UDMA_CRC_Error_Count0x0008 199 199 000Old_age Offline - 0
200 Multi_Zone_Error_Rate 0x000a 253 252 000Old_age Always - 0
201 Soft_Read_Error_Rate0x000a 253 216 000Old_age Always - 37
202 TA_Increase_Count 0x000a 253 248 000Old_age Always - 0
203 Run_Out_Cancel 0x000b 253 245 180Pre-fail Always - 19
204 Shock_Count_Write_Opern 0x000a 253 252 000Old_age Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000Old_age Always - 0
207 Spin_High_Current 0x002a 253 252 000Old_age Always - 0
208 Spin_Buzz 0x002a 253 252 000Old_age Always - 0
209 Offline_Seek_Performnce 0x0024 154 148 000Old_age Offline - 0
99 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
100 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
101 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.
Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.
So, let’s move on to the syslog results. The smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:
Dec 5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
Dec 5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
This pattern continues on like this the whole time, with Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:
Dec 8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
This is the first indication of trouble. Notice that it’s not very threatening–Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:
Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
This is the first time that Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.
The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:
Jan 1 03:20:57 starting scheduled Long Self-Test.
Jan 1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251
Jan 1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Jan 1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
Jan 1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252
Jan 1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252
Jan 1 05:50:56 Self-Test Log error count increased from 0 to 1
Jan 1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Jan 1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:
Jan 4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251
Jan 4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253
Jan 4 07:35:40 ATA error count increased from 980 to 981
Jan 4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Jan 5 02:05:42 starting scheduled Short Self-Test.
Jan 5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252
Jan 5 02:35:40 Self-Test Log error count increased from 3 to 4
Jan 5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252
Jan 5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252
Jan 5 06:36:10 ATA error count increased from 981 to 1293
Jan 5 07:14:45 ATA error count increased from 1293 to 2377
By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.
So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.
More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.
Posted in Computer System Administration | Tags broken, drive, ide, linux, smart | 13 comments