I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the
smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.
I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by
smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.
First, here are the basic specs on the system and drives involved:
- Athlon 700 (slot A)
- 384 MB RAM (PC133)
- Via KT133 chipset (Asus K7A MB, I think)
- 3ware 7500-8 8-channel IDE RAID controller
- 3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive
The drive that failed was a Maxtor, on channel #2. Here’s what
smartmontools 5.30 has to say about the drive in its current condition:
Device Model: Maxtor 4A160J0 Serial Number:A608B7WE Firmware Version: RAMB1TU0 Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Fri Jan 7 11:47:02 2005 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 24) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 243) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 99) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0027 214 214 063Pre-fail Always - 11805 4 Start_Stop_Count 0x0032 253 253 000Old_age Always - 73 5 Reallocated_Sector_Ct 0x0033 249 249 063Pre-fail Always - 41 6 Read_Channel_Margin 0x0001 253 253 100Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000Old_age Always - 0 8 Seek_Time_Performance 0x0027 252 244 187Pre-fail Always - 34394 9 Power_On_Hours 0x0032 224 224 000Old_age Always - 24560 10 Spin_Retry_Count 0x002b 253 252 157Pre-fail Always - 0 11 Calibration_Retry_Count 0x002b 253 252 223Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 253 253 000Old_age Always - 76 192 Power-Off_Retract_Count 0x0032 253 253 000Old_age Always - 0 193 Load_Cycle_Count 0x0032 253 253 000Old_age Always - 0 194 Temperature_Celsius 0x0032 253 253 000Old_age Always - 38 195 Hardware_ECC_Recovered 0x000a 253 252 000Old_age Always - 43456 196 Reallocated_Event_Count 0x0008 251 251 000Old_age Offline - 2 197 Current_Pending_Sector 0x0008 249 249 000Old_age Offline - 41 198 Offline_Uncorrectable 0x0008 253 252 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x0008 199 199 000Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x000a 253 252 000Old_age Always - 0 201 Soft_Read_Error_Rate0x000a 253 216 000Old_age Always - 37 202 TA_Increase_Count 0x000a 253 248 000Old_age Always - 0 203 Run_Out_Cancel 0x000b 253 245 180Pre-fail Always - 19 204 Shock_Count_Write_Opern 0x000a 253 252 000Old_age Always - 0 205 Shock_Rate_Write_Opern 0x000a 253 252 000Old_age Always - 0 207 Spin_High_Current 0x002a 253 252 000Old_age Always - 0 208 Spin_Buzz 0x002a 253 252 000Old_age Always - 0 209 Offline_Seek_Performnce 0x0024 154 148 000Old_age Offline - 0 99 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0 100 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0 101 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.
Looking at the results that
smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.
So, let’s move on to the syslog results. The
smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:
Dec 5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 Dec 5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 Dec 5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 Dec 5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
This pattern continues on like this the whole time, with
Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:
Dec 8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 Dec 8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
This is the first indication of trouble. Notice that it’s not very threatening–
Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:
Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252 Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
This is the first time that
Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the
Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.
The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:
Jan 1 03:20:57 starting scheduled Long Self-Test. Jan 1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251 Jan 1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 Jan 1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 Jan 1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252 Jan 1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 Jan 1 05:50:56 Self-Test Log error count increased from 0 to 1 Jan 1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 Jan 1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:
Jan 4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251 Jan 4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253 Jan 4 07:35:40 ATA error count increased from 980 to 981 Jan 4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 Jan 5 02:05:42 starting scheduled Short Self-Test. Jan 5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 Jan 5 02:35:40 Self-Test Log error count increased from 3 to 4 Jan 5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252 Jan 5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252 Jan 5 06:36:10 ATA error count increased from 981 to 1293 Jan 5 07:14:45 ATA error count increased from 1293 to 2377
By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.
So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the
Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.
More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.
Well, this isn’t looking promising–my new drives arrived, but the big RAID array is throwing errors left and right. I’m not sure how much data I’ll be able to recover off the thing. Most of the data on the drive is reconstructible, but not everything. Most of this contents are old digital pictures, but I’ve tried to write them all to DVD before throwing them onto the RAID array. Odds are I missed some stuff, though.
Amazingly enough, the system logs seem to have survived unscathed, so I’ll write up a “anatomy of a drive failure” article later, showing what this looked like from a SMART perspective. Since I’ve never actually seen a SMART-monitored drive failure before, it should be somewhat educational.
If you’ve been living under a rock, then you might not have noticed that PalmSource has announced that they’re going to be building a version of the Palm operating system that runs on top of Linux. It’s not completely clear what this means–are they replacing the kernel in PalmOS 6 with Linux, or is this a parallel project, intended to fit into new niches? PalmSource released an open letter to the Linux community that provides a few details:
- Existing 68k-based Palm apps will work fine.
- Apps based on the new Cobalt API will need to be recompiled.
- ARM-based apps for PalmOS 5 aren’t mentioned, it’s probably safe to assume that most of them will break.
- They’re going to enhance the Linux kernel as needed and contribute their changes back to the community.
- It’ll be possible to run Linux apps underneath their UI, but if you want a user interface, you’ll need to use their API. In other words, it’ll be possible to run things like Apache and MySQL on PalmOS for Linux, but not X applications.
- Their licensing model for PalmOS itself isn’t changing–they’re still licensing the whole package to hardware manufacturers and expecting them to port it to their hardware. Presumably, this will become easier when using Linux, because it comes with more drivers and Linux driver programming is a easier skill to hire then PalmOS driver programming.
Of course, that glosses over most of the important issues. Particularly, is any vendor actually going to ship this? Ever? PalmOS 6 (“Cobalt”) was released to manufacturers at the end of 2003, and not only is there no PalmOS 6 hardware available, there aren’t even any rumors of any on the horizon. It’s unclear if PalmOne will ever ship a PalmOS 6 device. It’s entirely possible that the only PalmOS 6 hardware to ship in 2005 will be from afleet of small asian contract manufacturers building for local markets, although Samsung may have something up their sleeves.
Given the glacial rate of PalmOS 6’s adoption, PalmSource will probably be best off focusing all of their attention onto PalmOS for Linux and calling it PalmOS 7, because there’s no way they can carry three software lines–PalmOS 5, PalmOS 6, and PalmOS for Linux. Since current PalmOS 6 applications won’t be binary-compatible with PalmOS for Linux, there’s no way they can call it PalmOS 6.2 and pretend that it’s an extension of the current 6.x line. If they’re going to push a Linux product at all, then they need to push it hard, and they can’t push two “next generation” products that are mutually incompatible.
Which brings up the big question: when will it be ready? After reading their press releases, I don’t thing they’ve been working on this for very long. They certainly aren’t ready to ship anything, and I’d be surprised if they actually have much more then a proof-of-concept port in-house. On the other hand, they have a solid, well-known base to work from, so it’s not like they have to fight with alpha-grade build tools, flaky OSes, and all of the other moving targets that they presumably had to deal with when building PalmOS 6. Porting the current PalmOS to run on top of the Linux framebuffer device shouldn’t be very hard. Adding support for Linux’s network stack might be interesting–as I recall, PalmOS 5’s TCP stack was entirely located in user space, so it the API might not be very close to the traditional BSD socket API, but I don’t really know. Porting 68k apps will be easy; they already have an emulator that runs on Linux and has for years. Adapting it to the new framework shouldn’t require a whole lot of work.
Unfortunately, the one thing that will probably be hardest is the thing that makes PalmOS so unique–it’s filesystem, or rather the lack of one. Traditionally, PalmOS applications don’t really have the notion of saving or multitasking–everything lived in RAM, and switching between programs didn’t involve a whole lot of extra effort. Applications kept their data organized into databases, not files, and they edited the databases directly, without any sort of “save” step. This meant that switching between apps is fast and gives a good user experience for simple applications, but it hasn’t scaled well because it doesn’t provide an easy way to manage block-based storage, like external flash cards or internal hard drives. Instead, PalmOS has had to add an whole extra API for accessing filesystem-based devices, and this has left us in a state where some applications won’t run off of flash cards, and many applications are unable to access data saved on flash cards.
With a virtual-memory based OS like Linux, it’s possible to fake a lot of this with
mmap, but that isn’t ideal when you’re dealing with flash cards–it’s easy to wear out most flash cards today by sending them thousands of small writes, and that’s what I’d expect to see when changing a
mmaped database. Also, what happens when a flash card is ejected while an application has a file mapped? Linux is never happy when removable devices go away, but causing applications to crash just because the card was removed is seriously user-unfriendly. If mmap won’t work, the big alternative is to copy things to RAM transparently and then copy them back out when done, but that will push the memory requirements up, which will push up costs and limit battery life.
Given all of this, I’d be surprised to see a PalmOS for Linux device before mid-2006, and that’s a long ways away. It’s not clear that the Palm world can wait for another year and a half, falling further and further behind the networking and multitasking abilities of their PocketPC-based competitors. Given that, PalmSource must be feeling a lot of pressure from their licensees to switch to Linux, or they wouldn’t have made this announcement at all.
I made a bit more MythTV progress today. DVD playing now works perfectly. I had had three problems:
- Audio was really quiet. After upgrading mplayer, I decided that this was really an issue with my receiver–it was decoding analog Dolby Surround correctly, but it wasn’t really configured for my speakers. A little bit of fiddling and it’s acceptable, if still a bit quiet. The reference source that I was using for comparisons is really loud–the meter on the receiver is peaking out all the time, while Finding Nemo (my DVD test today) is really just about where it should be.
- Mplayer was dropping frames while playing DVDs, but DVD rips played just fine. DMA wasn’t enabled on my DVD drive. Once I fixed that, it became perfect.
- I couldn’t eject DVDs without opening up a shell and
/dev/cdrom. I’m not sure what was up here, but something in KnoppMyth was automounting
/dev/cdromevery few seconds. I commented out the entry in
/etc/fstab, and everything seems okay–I can still play DVDs, but the eject button on the drive works now.
At this point, MythTV is an acceptable DVD player for me. It still isn’t perfect–it takes too many button pushes on the remote to start playing, and the remote buttons aren’t mapped quite right. In other words, it’s still kind of complex, but it works fine once you get through the complexity.
On the other hand, the image is stunning on the projector. I think the jump from NTSC DVD player to VGA DVD player is almost as big as the jump from VHS to NTSC DVD, at least in my setup.
Slashdot has an article this morning on the OpenBSD people’s new BGP daemon, OpenBGPD. In essence, the OpenBSD people did the same thing that they’ve done repeatedly before, and taken a protocol that didn’t have an open, secure implementation and provided a clean, minimalistic, BSD-licensed tool.
Personally, I find OpenBGPD kind of fascinating, because I’ve worked with router jockeys for years, and I get dragged into “can we run a BGP daemon on this PC” discussions with surprising frequency.
OpenBGPD’s stated goals include this fun little snippet:
Provide a lean implementation, sufficient for a majority. Don’t try to support each and every obscure usage case, but cover the typical ones
And that’s where my problem lies. I don’t think I’ve ever been asked for a “lean implementation” of BGP. Every time I’ve been dragged into a BGP discussion, it’s been because network engineers have been trying to do something bizarre and creative with BGP, and the tools that they’re used to using aren’t sufficient. For instance, at Internap, we wanted to add per-prefix, per-peer prepending for a huge number of prefixes, and we wanted to change the path selection algorithm to include a bunch of extra information that we had on reachability and performance. In other cases, I’ve been asked for simulators and BGP loggers that could feed BGP prefix reachability information into a database. Inevitably, every time someone needed just a “lean implementation,” they’d already have a Cisco box handy and they’d use it instead of monkeying with BGP on a PC.
That’s not to say the PCs make lousy routers or anything like that–the price/performance is impossible to match with anything from Cisco–but that the totals costs involved in any BGP peering that I’ve seen make the cost of the router little more then noise in the equation. If you’re paying tens of thousands of dollars per month for multiple pipes to providers, then what does saving $20k on a router buy you, besides maintenance and reliability headaches and a hard time finding network engineers familiar with your setup? Most of the time, it’s cheaper to spend $20k on hardware and make it up on productivity and reduced downtimes.
So, while OpenBGPD is cool, I’m not sure how useful it really is outside of test labs and maybe small ISPs, if there are any of them left. On the other hand, I’d love a good OpenBGPD-ish OSPF implementation. I’ve played with Zebra, and the whole design of the thing just rubs me wrong (although Quagga might be better). I need to remember to actually give Xorp a try, too. OSPF is more useful inside of existing networks, and it makes a lot more sense on a LAN then BGP does.
When it gets down to it, I suppose my real point is this: it’s largely pointless to scale PC-based routers up to make them compete toe-to-toe against Cisco’s big WAN routers, because the network costs and the maintenance costs of doing one-off routers works against us. It’s also really hard to get reliable, well-tested WAN interface cards for anything faster then a T1. Try finding a PCI OC-12 POS card with Linux drivers sometime.
On the other hand, other alternatives make a huge amount of sense:
- Scale them down. You can build a cheap Linux router for almost no money these days–look at the Linksys WRT54G.
- Scale them out. Imagine a medium sized company replacing all of their assorted branch office routers with PCs talking to DSL and providing QoS, routing, firewalling, VPNs, VoIP, etc. It’s expensive to do it once, but you can replicate the work onto a hundred devices for very little additional cost.
- Push them into niches. There are cases where the fantastic flexibility of PCs can make them much more useful then an equivalent Cisco. Linux, for example, has no problem running multiple routing tables and a fantastic number of firewall rules. You can do amazingly creative things with just the stock tools, if you can figure out how to use them.
Pretty much everyone is reporting that Sharp has announced a new Japan-only Zaurus, the SL-C3000. The big news is that Sharp has stuffed a 1” 4GB hard drive into this model, making it the first PDA on the market with a built-in hard drive.
I’ve been predicting that this would happen for a while; I’m kind of surprised that it’s taken this long.
That reminds me, I need to produce a set of PDA predictions for 2005 and score how my 2004 predictions went.
I just noticed that Busybox 1.00 has been released. If you aren’t familiar with it, Busybox is a software package that provides about 90% of what you need to build a simple Linux system (init, shell, cp/mv/ls/rm, ifconfig, dhcp, etc). It’s commonly used in embedded systems and on install disks.
I’ve found it fantastically useful on a number of occasions, just because it makes it *so* easy to build your own miniature Linux distribution for things like recovery disks, install disks, appliances, and so forth. Busybox, a kernel, a couple empty directories, a few devices in
/dev, and a bootloader, and you’re set to go.
So, suppose you’re setting up a test environment, and you want a server to be able to handle some improbably large number of IP addresses, like a /16 or even larger. You could just write a script to add them all one at a time, or you could use this little shortcut and add the entire netblock at once:
# ip route add local 10.0.0.0/8 dev eth0 proto kernel scope host table local
That’ll make the entire
10/8 block local to the box. It’ll answer pings sent to any address in the 10.x.x.x space, and you can bind to any address in the netblock and use it as the source address for packets. For all intents and purposes, it’ll act just like the addresses were added with
ip addr add or
I used this last week when I needed to simulate a network with 4,000+ hosts all sending UDP traffic to a specific server. It was a piece of cake.
I’ve seen this a few places online, but it’s never been very easy to find. To the best of my knowledge, it’s worked with Linux back to at least 2.0 or so; I’ve used it with 2.4 and 2.6 recently.
Re-reading my ”One Year of Blog” post reminded me of another anniversary that I’d forgotten: I’ve now been using Linux for over 12 years. I started in mid-August 1992, with Linux 0.97. It’s grown a bit since then; 0.97’s
tar.gz file was around 288 kB, while Linux 2.6.8 is over 44 MB. That’s a factor of 154 growth in 12 years, or a compounded rate of around 50% per year. At that rate, the compressed kernel source will be too big to fit onto a CD by 2011.
Backups suck. They always have, they always will, it’s just the nature of the beast. The big problem is that open-source backup solutions tend to suck really bad, and decent commercial solutions obey Joel’s pricing law, where no enterprise software costs between $3,000 and $100,000. At Internap, we paid well over $100k to Legato for their buggy, inflexible backup software, and we paid a similar amount of money for backup hardware. That’s what it took to do good backups of several hundred machines per night. On a smaller scale, people seem to have decent luck with Arkeia, but I’m not all that convinced that it’s much cheaper then Legato or Veritas once you pile on dozens of clients.
I’ve wanted a decent open-source solution for years. In ‘97 or so, I used Amanda for a while, but it had a kind of bizarre approach to tapes, weird security needs, and it didn’t scale very well. It still seems to be under some sort of active development, but I was using version 2.4.0 in the late ’90s, and the current release is 2.4.4p3; draw your own conclusions. I’ve been spending around one afternoon per year since then looking for something free that can handle a couple dozen mixed Linux and Windows systems, and can back up to a mix of tape and disk, and I’ve come away empty-handed every time.
Until today. Somehow, Bacula slipped past me without my noticing it. Besides the catchy name and motto (“It comes by night and sucks the vital essence from your computers”), it seems to have the features that I’m looking for. It understands tape changers as well as filesystem backups. It’s network based with clients for Windows and Unix (and sort of OS X). It looks like it does an okay job at backup parallelism. It doesn’t have plugins for hot backups of popular databases, but most of the time, it’s easier to dump the database to a file and then back up the file, at least when you’re dealing with smallish databases.
It’s currently at version 1.34, although Debian seems stuck at 1.32f4 for some reason. I’ll be doing some testing tonight at home, since I’ve been without a comprehensive backup solution for years, but I’m feeling pretty good about this.
Having said that, it still lacks a few things that I’d consider rather useful:
- It can’t migrate backups between storage devices. No staging from disk to tape, for instance.
- It doesn’t do anything with optical disks. I’ve had good luck with doing full backups of important data to CDs or DVDs; it’d be nice to integrate that into one central backup system. This is better for smaller-scale setups, though.
- It can’t stripe a single backup across multiple drives.
- It’s not clear if it can interleave backups onto a single tape. When you’re backing up slow hosts over a big network, this can be critical. For LAN usage with cheap tape drives, it’s less critical.
They maintain a to-do list online. I haven’t really looked into their security history; there are a couple worrisome comments about
sscanf in their to-do list, but it can’t be any worse then Legato Networker–it had holes that you could drive a bus through along with a few really fun failure modes. We triggered a great DDoS on ourselves one night in 2000 with every single backup client in the company sending 64-byte UDP packets to our backup server as fast as they could transmit them. We really loved them for that.
Update (8/6/2004): I let it rip last night. First, its UI needs work. Most configuration is done via text files, which is okay, but monitoring and management can be done via either a bad CLI without command-completion, or via a bad GUI which is basically just the CLI in a window with a scrollbar. Windows backups don’t include open files *or* the registry. Mac backups don’t include resource forks, but I knew that going into it. Finally, I’m not getting any email notification of success or failure, but that might be a mail problem on my backup server. All in all, it’s still better then most of the free solutions that I’ve looked at, but still not 100% there. It’s usable if you only care about Unix systems and you’re willing to spend a bit of time learning and scripting. It’s not usable if you’re looking for an out-of-the-box system that’ll work with Windows.
A friend suggested that I look at Box Backup. It doesn’t do tapes at all, which is a bit of a shortcoming in my mind, but it looks like it might be better for pure-Unix backups. On the other hand, it’s designed to do continuous backups, not nightly snapshots, which is probably a better design, if you can handle the load.
I should really write the backup strategy guide that I keep meaning to write. Leave a comment if you’re interested.