So, as part of my new home server series, I want to explain why I’m using OpenSolaris instead of Linux.
I’ve used Linux since 0.97.1, in August of 1992. I’ve had at least one Linux box at home continuously since 1993 or so. I’ve had a few small chunks of my code added to the kernel over the years. I’ve built several install disks and one embedded appliance distro from scratch, starting with a kernel and busybox and going on up from there. I’ve written X drivers, camera drivers, and drivers for embedded devices on the motherboard. I’ve managed Great Heaping Big Gobs of Hardware at various jobs. Basically, I know Linux well, and I’ve used it for almost half of my life.
That in itself might mean that it’s time for a change–professionally, I’ve been very tightly focused on Linux, and diversity is a good thing. But that’s not why I’m using Solaris this week. I’m using it because I’m fed up with losing data to weird RAID issues with Linux, and I believe that OpenSolaris with ZFS will be substantially more reliable long-term. Things I’m specifically fed up with:
- md (the Linux RAID driver)’s response to any sort of drive error, even a transient timeout, is to kick the drive from the array, no matter what. Most of the IDE drives that I’ve had over the years have been prone to random timeouts every few months, at least once you bundle more then 2 or 3 of them in a single box and then try snaking massive ribbon cable through the case. My SATA experiences haven’t been substantially better. Linux will happily bump an otherwise working 4-drive RAID 5 array to a 3-drive degraded RAID 5 array on the first failure, and then on to a 2-drive failed array on the second failure. Even when a simple retry would have cleared both errors. This has cost me data repeatedly, because I’ve been forced to manually intervene and re-add “failed” disks to RAID arrays. If I was too slow, then a second drive failure risked total data loss. Even worse, these random transient failures blind you to real drive failures, like the one that ate my NAS box last weekend.
- Actual drive failures can hang the kernel. I’ve had at least 3 cases at home where broken drives either caused system lockups or completely kept the system from booting. That sucks. Odds are some drivers are good while others are broken; apparently I’ve just had bad luck.
- None of Linux’s filesystems are particularly resilient in the face of on-disk data corruption. Compare with ZFS, which checksums everything that it reads or writes.
In short: everything works great when things are perfect, but building a reliable multi-drive storage system requires careful component and kernel compatibility work, and then you have to stay right on top of things if you want everything to keep working. When things stop working, they usually fail badly. That’s almost the complete antithesis of what I want for home: plug it in, and it just keeps working. I don’t want small failures to cascade through the system. Little failures should isolated, identified, and automatically repaired whenever possible. OpenSolaris and ZFS seems to provide that, while Linux with md and ext3 does not.
That’s why I’m planning on using ZFS. My logic for building a server vs. buying another little NAS box is simple: none of the little NAS boxes on the market use ZFS right now, and none of the cheap ones have room for more then 5 drives. I’m planning on using a double-parity system (RAID 6 or ZFS’s raidz2, where the system can cope with a 2-drive failure) plus a spare drive, and that’d only leave me with 2 data disks. The only way that I can get enough data with only 2 disks would be to use 1TB drives, and they’re too pricy right now.
So, I’m willing to spend the time to build a somewhat complex server because I believe (hope?) that it’ll save me time in the future, and it’ll let me avoid ever having to do the reconstruct-from-the-source dance again. I don’t think I lost anything critical last weekend, and I’m reasonably confident that I’ll be able to get things limping along well enough to recover data anyway, but I’ve now done this 3 times in the past 4 years, and I’ve had it.
Coming up soon: backups, OpenSolaris hardware compatibility, and GC-RAMDISK performance benchamarks. Stay tuned :-).
CNet says that there’s a security bug in zlib 1.2.2. There’s no exploit yet, but since everything uses zlib, this will probably turn into a problem for those who don’t upgrade to 1.2.3 once it becomes available.
libpng and OpenSSL both use zlib, we’re going to see a lot of network-based programs with issues.
According to a number of sources, Nokia has just announced a new tablet-like wireless internet device, the Nokia 770. No one really seems to know what to do with it–it’s slightly larger then a PDA with a 4.13 inch 800x480 LCD, 802.11 and Bluetooth, 64 MB of RAM, 128 MB of flash, and an RS-MMC socket. Nokia’s positioning it as a cheaper, more portable alternative to the laptop, and equipping it with a web browser and email software. There have been a number of products with similar aims in the past, but none of them have been able to achieve any amount of success.
The 770 will probably fail, too. It does have a could things going for it, though–it’s a relatively open platform (it supposedly runs Debian Linux), and the software for the device is open-source. The hardware is surprisingly capable for the cost–at $350, this is cheaper then any PocketPC with a VGA screen. It’s a bit limited on the storage front, with room for only a single RS-MMC card (up to 512 MB), but that’s not really all that bad.
Personally, I wouldn’t mind something like this, but I’d be tempted to use it as a portable video player, and I doubt that the 770’s 200-ish MHz OMAP chip has enough oomph to play back video at any reasonable resolution and frame rate.
I’m not really sure what Nokia has up their sleeves here. On one hand, the hardware looks pretty good. Unfortunately, the software is brand new and doesn’t seem to include any PDA-type features–it’s focused entirely on web browsing (using a scaled-down Opera), email, and RSS reading. If Nokia can keep the platform alive for a year or two, it might gain enough support to be interesting, but as it stands I don’t see how it’ll have much of a chance in the market.
Newsforge is running an interview with the three main participants in The Great Linux SCM Saga, Linux, Larry McVoy, and Tridge. By and large, it’s a good article, but I suspect that someone who didn’t know the people involved would assume that the whole mess was Tridge’s fault–he’s the one that was working on cloning BitKeeper, even though any sane person would know that it would really piss Larry off. Even after people pointed this fact out to him, he kept working on his BitKeeper tools.
I’d be remiss if I didn’t point out that Tridge has a history of doing this sort of thing. I’m aware of two other cases where he’s dug in and reverse-engineered similar sets of protocols and file formats. The first time, the result was Samba, which was (and still is) really one of Linux’s first killer apps. The second time, he decoded TiVo’s on-disk media format. Pretty much any tool on the net that knows how to extract video from TiVos (except for TiVo’s recent TiVo-to-Go release) is based on Tridge’s work.
That’s not to say that reverse-engineering is all that he does–rsync is his too.
I remember people questioning his ethics during his TiVo work–besides just downloading video from TiVos, his would could (in theory at least) allow someone to buy a TiVo and feed it program guide information without paying TiVo’s monthly subscription. Without that, TiVo’s revenue model falls apart, and the company would be forced to either sue their own users or go out of business. The BitKeeper folks might have paid attention to how he handled the TiVo issue–as I recall, he released the video download code, but kept the programming guide code to himself. In some ways, that actually helped TiVo–I had no qualms about buying a second TiVo, even when their financial footing was shaky. Without Tridge’s programming guide code, a TiVo box without TiVo, Inc would just be a big paperweight. Just knowing that the program guide code existed was enough to ensure that my TiVo would continue to be useful, because someone would pick up the torch if TiVo fell.
I don’t know what Tridge was planning to do with his BitKeeper tool, but based on his past record, I really doubt that he would have used it to sabotage BitMover. Or, at least not to do anything that he saw as sabotage. Clearly Larry McVoy (and to some extent Linus) saw things differently.
Getting off of BitKeeper is probably best for Linux in the long run. It’s a pity we couldn’t have waited for another year or so for open-source SCM software to mature more, though. There are a number of promising contenders, but they all have issues that keep them from being usable for the Linux kernel today.
I finally have Xen working on a system at home. I hadn’t expected this to be very difficult, but apparently Xen doesn’t like my new Athlon 64 system (bought mostly for running Xen). They’ll fix it eventually, but for now I’m using an old Athlon 700 system that I had sitting around. It needed a new CPU fan (just try finding Slot A fans these days!), but I was able to scrounge up 512 MB of RAM and an 80 GB hard drive, so it’s perfectly usable.
I built a couple quick disk images and booted them under Xen, and everything worked as expected. This is always a good sign, and it suggests that I’ll be able to make progress on my little virtual-server project without a whole lot of trouble.
It’s sort of an axiom of programming that features that aren’t continually used or tested won’t actually work. A similar rule holds for system administration–any feature that hasn’t been tested since the last upgrade is probably broken. An obvious corollary suggests that systems get more reliable as their user load increases–more users means more features are used more frequently, and broken features will be spotted sooner. And the corollary to that is that any server wedged under a desk in someone’s home office is probably flakier then hell because it’s probably just sitting there collecting dust and not getting used.
I’m not convinced that that applies to my home gateway box. It’s a busy little beaver:
Chain INPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes prot opt in out source destination 234M 75G all -- dsl0 * 0.0.0.0/0 0.0.0.0/0 47M 1001G all -- eth0 * 0.0.0.0/0 0.0.0.0/0
In the 25.75 days since I last rebooted this system, it’s received over 75 GB via its DSL link and around 1 TB over its main Ethernet link. If my math is right, that’s an average of 3.6 Mbps on the Ethernet link and around 270 kbps over DSL. I wasn’t keeping outgoing traffic stats when I first booted this box, but more recent estimates make it look like there’s almost as much outgoing traffic on
dsl0 as there is incoming.
CPU load is similarly heavy–the box has averaged 51.9% idle since it was rebooted. My rule of thumb for years was that any production box that was under 80% idle was due to be upgraded soon, because it was probably pegging the CPU during peak times during the day. If the box was under 70% idle, then it was time to start scrounging for an immediate upgrade. By those metrics, this box is way overdue for a major upgrade. Fortunately for my wallet, those metrics don’t really apply to this box–it’s spending a lot of its CPU time on tasks that aren’t particularly critical. Also, Linux 2.6 made some changes to
procinfo doesn’t seem to have picked up on; once you factor those into the equation, the box is really closer to 75% idle. Subtract off the non-critical usage, and the system is probably only 10% busy. I’ll probably upgrade it later this year if my virtual-server project works out, but that’s more for security and reliability then pure performance.
There’s an interesting thread going on right now on the Linux netdev mailing list, speculating about the network accelerator technology that Intel’s been talking about recently. No one’s quite sure what Intel is planning on adding, but for the past several years “network accelerator” has usually meant TCP offload engines (ToE), and Linux’s core networking guys are almost famously anti-ToE. Even though no one really knows what Intel’s up to, there’s a feeling that it’s not just ToE this time.
Several people have pointed out other technologies that can make a huge difference without requiring the sorts of compromises that ToE needs to work. For instance, this post by Lennert Buytenhek suggests that PCI and memory system latency is a big problem, but fixing it can have huge payoffs:
The reason a 1.4GHz IXP2800 processes 15Mpps while a high-end PC hardly does 1Mpps is exactly because the PC spends all of its cycles stalling on memory and PCI reads (i.e. ‘latency’), and the IXP2800 has various ways of mitigating this cost that the PC doesn’t have. First of all, the IXP has 16 cores which are 8-way ‘hyperthreaded’ each (128 threads total.)
I haven’t paid much attention to Intel’s IXP network processor family in the past, and that may be a mistake–from the description here, the IXP2800 sounds like a cross between Tera’s multithreaded CPU and IBM’s new Cell processor. Tera’s CPU, which was designed to support tons of threads, automatically switches between threads whenever one thread blocked due to I/O or memory access. The goal with Tera was to be able to remain efficient while the gap between CPU and memory speeds continued to grow. The IXP2800 isn’t as ambitious as the Tera, but the fundamental concept looks similar–support lots of threads in hardware, and switch when latency gets in the way. The IXP2800’s threaded CPUs aren’t full-blown processors, though–like the Cell, the IXP2800 contains one main CPU and a cluster of smaller domain-specific processors that are specialized for one specific task.
It’s unlikely that Intel will roll something like this into their Xeon CPUs anytime soon, though. It’s certainly not a quick fix–it’d require major changes in any OS that wanted to make use of it, and would probably take 3-6 years before it was really fully utilized.
Massively-multithreaded CPUs aren’t the only approach that has paid off for dedicated network processors, though. Some of FreeScale and Broadcom’s chips know how to pre-populate the CPU’s cache with headers from recently-received packets. This drastically cuts latency, but it seems to require that the CPU and network interface be very tightly coupled. Reducing the overhead needed to talk to the NIC can help, too–apparently some of Intel’s 865 and 875 motherboards use a version of their GigE chip that is connected directly to the north bridge, bypassing the PCI bus entirely, and some benchmarks show substantial improvements.
Reading the thread suggests that most of the effort going into Linux network optimization in the next few years will be happening on the receive end of things. Over the past several years, most higher-end NICs have added limited support for checksum generation and TCP segmentation offloading (TSO), where the CPU can hand the NIC a block of data and a TCP header template, and then have the NIC produce a stream of TCP packets without requiring the CPU to touch the data at all. Relatively little has happened on the receive side, but this seems to be changing. For example, Neterion’s newest card can separate headers from data, and is nearly able to re-assemble TCP streams on its own, sort of the inverse of transmit-time TSO. It’s not clear how many streams the card can handle at a time, though–even my little web server at home is currently maintaining 384 simultaneous TCP connections, and a busy system could easily have tens or hundreds of thousands of open streams. Odds are, throwing 100,000 steams at the card would run it out of RAM and completely negate any benefit that receive offloading would have. Unless it’s bright enough to be able to handle the 1,000 or so fastest streams and then let the main CPU handle the 99,000 that are dribbling data at 28k modem speeds.
This is a fascinating topic, and I can’t wait to see how this will turn out.
As regular readers know, I recently turned up a new DSL circuit at home, replacing an older, slower line that Verizon had refused to upgrade for months. As part of the upgrade process, I needed to buy a new DSL modem. Instead of using an external DSL modem (DSL-Ethernet bridge would probably be more accurate, but “modem” seems to have stuck), I decided to buy a Sangoma S518 PCI ADSL modem. I had two main reasons for preferring this internal modem to a generic external model:
- Better control over upstream buffering, for better VoIP QoS.
- Better visibility into the modem’s state, so I can syslog minor outages and notice things like speed changes.
I chose the Sangoma model instead of a cheap, generic card because the manufacturer strongly supports its use with Linux, and a number of people on the Asterisk-Users mailing list have recommended it. I paid $115 plus shipping from BSD Mall.
Packaging and physical installation
The card arrived via UPS about a week after I ordered it from BSD Mall. Historically, Sangoma has mostly made cards for T1s and leased lines; the S518 is their lowest-end product, but it uses the same configuration tools and drivers as their more expensive cards. Since this is Sangoma’s only product line, and Linux is a big part of their market (and has been for over 10 years), the drivers are much more mature and stable then you’d really expect for a DSL card.
The packaging for the card is fairly generic–it looks like Sangoma uses the same box for all of their products. Inside, there was a RJ11 cord, the PCI card, wrapped in an anti-static bag and bubble-wrap, a manual, and a CD with drivers.
I downloaded their most recent drivers from their web site a few days before the card arrived and pre-installed them, so I’d be ready to install the card as soon as it arrived. After untarring the drivers, all you have to do to install them is run
./Setup install and their setup script takes care of everything else. Rather impressively, it located the source for the 2.6 kernel that I’m running, patched it (after asking my permission), built new kernel modules for everything needed, installed them, and then compiled and installed the user tools for configuring their interfaces. It also installed a startup script into
/etc/init.d/wanrouter and created all of the right links in
/etc/rc*.d to make sure that it starts on boot. The fact that it all worked correctly on my Debian unstable system was rather impressive, and a sign that Sangoma’s been doing this for a while.
Configuring the card was also relatively easy. The setup program installed a tool called
wancfg that provides a curses-based UI for setting up their network interface cards. It took me a couple minutes to tell it that I was about to install a S518, guess which ADSL encapsulation I’d need, and tell it to assign the new interface the name
dsl0 and a dummy IP address.
Once the drivers were installed and configured, I shut the system down, installed the card in a spare PCI slot, and rebooted. The system came back up, loaded the S518 drivers successfully, set up the ADSL interface using the specs that I’d provided, and started training. After about 30 seconds, it told me that training had failed, and it couldn’t find a signal on the line. Since the line wasn’t supposed to be live for two more days, this didn’t seem like a problem. I left the interface installed in the box, perpetually attempting to re-train, and went to bed.
At 8:19 the next morning, Verizon finished configuring their end of the DSL line. The S518 immediately trained on the line, syslogging:
Feb 9 08:19:22 guam kernel: wanpipe1: ADSL Link connected (Down 1792 kbps, Up 448 kbps) Feb 9 08:19:30 guam kernel: wanpipe1: Link connected!
About two seconds after logging the last line, the system locked up. I rebooted the box, only to discover that the system could no longer see the S518 card. Even
lspci failed to detect anything–the card was locked up so hard that PCI bus probing no longer worked. I had to power the system down before I could access the card. I tried rolling the driver back to the previous stable version, without any luck, and rolled it forward to a newer beta, but that didn’t work, either. Everything was stable until it trained, so I unplugged the DSL line from the S518 and left for work.
During the day, I contacted Sangoma’s tech support department and asked what was wrong. Then emailed me back within 15 minutes and asked a few questions – “is this a 64-bit system in 64-bit mode?” for example. They suggested several things that I’d already tried–rolling forward or back to new releases. They looked through my lcpci output and noticed the Digium cards that I use with Asterisk and suggested removing them, just for testing. Their support engineer admitted he was grasping at straws–they had the same card with the same drivers and kernel working in the labs. They suggested re-compiling the driver, targeting it for a generic 386 kernel, instead of the Athlon-optimized version that built by default.
That night, I tried the re-compiled drivers: no luck. I pulled both Digium cards: no luck. It still crashed immediately after training.
Finally, I moved the S518 to a new PCI slot, and discovered that that worked perfectly. So, either the S518 driver had a hard time sharing an IRQ with my Ethernet card, or I have a bad PCI slot in my system. Since it’s a cheap motherboard, and it’s almost 5 years old, I’m going to go with ‘bad slot’. I re-installed the Digium cards, cleaned everything up, and it all continued working perfectly.
I was very pleased with Sangoma’s support–they seemed competent, they responded quickly to my request for help, they asked sensible questions, made decent suggestions, and didn’t disappear once they ran out of easy fixes. Frankly, this is why I paid more then I had to for their card–good support. In the future, when people come to me for recommendations on T1 or ADSL cards, I’m going to recommend Sangoma without any reservations at all.
ADSL and IP configuration
By this point, Verizon had emailed me my new static IP address, so I configured
dsl0 for the right IP and tried pinging my gateway. No luck–I couldn’t ARP the gateway, so ping wasn’t working. I fired up
tcpdump, and saw that there was traffic on the link, but it didn’t look right–the source MAC address was
00:00:00:ff:ff:ff and the Ethernet frame type was
ffff. So, most likely, the DSL framing option that I’d picked when I’d configured the link was wrong. I fired
wancfg back up and looked over my options:
- Bridged Eth LLC over ATM (PPPoE)
- Bridged Eth VC over ATM
- Classical IP LLC over ATM
- PPP (LLC) over ATM
- PPP (VC) over ATM (PPPoA)
I was pretty sure that I was being fed bridged Ethernet, although Verizon hadn’t actually told me what they were using, so I’d picked ‘Bridged Eth VC over ATM’. I looked at the configuration screen for a minute, deciding which one to try next, when I noticed the next line down:
ATM_AUTOCFG-> NO. I set that to yes, ran
/etc/init.d/wanrouter restart, and watched syslog. Within 30 seconds, it reported that the link was up, that there was traffic on VCI 35, VPI 0 (which was the default in
wancfg), but that it wasn’t framed right. The kernel driver said that it was expecting Bridged Eth VC over ATM, but it was seeing Bridged Eth LLC instead. So I ran
wancfg, turned off autoconfig, and changed the encapsulation to ‘Bridged Eth LLC over ATM (PPPoE)’, saved, and re-ran
/etc/init.d/wanrouter restart. As soon as it came back up, I was able to ping the gateway; everything was working. For what it’s worth, the
PPPoE in the encapsulation name is a complete misnomer in this case–there’s no PPP involved anywhere in this system.
If I’d spent a bit of time with Google, I probably wouldn’t have had to fiddle with encapsulation settings, but it was nice to see that it could auto-detect it for me.
Once I was confident that the link was up, I changed IP addresses in DNS, edited my system startup scripts to use the new IP address and device
dsl0 instead of
eth1 and rebooted. I noticed that the Sangoma
/etc/init.d/wanrouter was set to run at step 20 in
/etc/rc3.d, while my IP configuration script ran way earlier, and this was causing problems, because DNS was failing for some system services, like NTP and Apache, because the WAN link wasn’t up before they started. So I deleted the calls to start
wanrouter out of
/etc/rc*.d, and then called it by hand right before configuring my IP addresses and firewall. One more quick reboot, and everything seems to be working fine. I was able to download a kernel image over the DSL line at nearly 150 KB/sec, which was around twice as fast as before.
My main reason for ditching my old DSL line and modem and installing a new DSL line with the S518 was lower latency and jitter for VoIP. With normal external Ethernet-to-DSL modems, the modem had a buffer that it uses to hold outgoing packets. If you’ve every tried uploading a large file over DSL, only to discover that it takes 5 or 6 seconds to ping packets to cross the link, you’ve discovered the joys of big buffers in DSL modems. While a 5 second delay is bad for SSH, it’s horrible for VoIP.
The usual way around this is to install a rate limiter on your router’s outbound Ethernet interface, like Wondershaper on Linux. This works by limiting how fast your router’s Ethernet interface can transmit data. If you slow the router down so it’s just a bit slower then the DSL line, then the router can prioritize packets and let VoIP packets go to the head of the line, without letting the DSL modem receive enough traffic to fill up its buffers. This works, but it’s always a spotty thing–for best results, you need to set Wondershaper to be a bit slower then your DSL line, so you lose some performance there. In addition, Wondershaper’s default settings don’t really pay full attention to the ToS headers on IP packets, so you have to spend some time tweaking its idea of high-priority and low-priority traffic. In addition, it’s really hard to make changes–Linux’s QoS tools are powerful, but they’re complex and hard to understand.
The first thing that I did when I brought the S518 up was to turn off Wondershaper completely and see how well the kernel’s default QoS scheme (pfifo_fast) worked. By default, Linux prioritizes packets based on the ToS field on the IP header, and most tools actually seem to set the header to reasonable values. Or, at least Asterisk and BitTorrent both use reasonable settings. Since the S518 doesn’t have a buffer built into it, the kernel’s native queueing works perfectly, and I’m seeing nearly perfect Asterisk VoIP performance, even without a complex set of shaping tools.
Since this was my primary goal when I bought the S518, I’m quite pleased with the card.
I’d strongly recommend this card to anyone with a need for decent QoS over ADSL, as long as they have the technical skills needed to get it to work. As mentioned, the drivers did a good job of installing themselves, and Sangoma’s tech support is good, but it still took some understanding to get the system working correctly. Sangoma supports most of the *BSDs as well as Linux and Windows, so the only people left out in the cold are the ones trying to use OS X as a router.
In addition, based on what I’ve seen of Sangoma’s drivers and toolset, as well as their tech support, I’d recommend that people in the market for 1-4 channel T1 cards for data or voice check their offerings out as well. Sangoma supports Asterisk directly on their T1 cards, and while they’re slightly more expensive then Digium’s cards, they probably come with better support. Given what I’ve seen of their tools, setting up data T1s with Sangoma’s drivers looks like child’s play.
If Sangoma’s looking for suggestions, I’d love to see a model of the S518 that can act as a FXO card with Asterisk while still acting as a ADSL card. One card could handle voice and DSL at the same time. The market for this isn’t huge today, but it’s a relatively simple change that could have huge benefits as Asterisk grows.
My DSL modem showed up yesterday, so I dropped it into my gateway box and fired it up. It immediately reported that it was unable to train; there was nothing to talk to on the other end of the phone line yet. Since my official install day is still a couple days out, that didn’t surprise me. Then this morning, I saw this in the logs:
Feb 9 08:19:22 guam kernel: wanpipe1: ADSL Link connected (Down 1792 kbps, Up 448 kbps) Feb 9 08:19:30 guam kernel: wanpipe1: Link connected! Feb 9 08:41:03 guam kernel: klogd 1.4.1#11, log source = /proc/kmsg started.
The gap between the second and third lines is the problem–the box went down, hard, right after the DSL line came up. On the other hand, it looks like I’m provisioned above 1.5/384 on the ATM side. Assuming a 20% cell tax, this gives me a usable connection of around 1430 kbps down and 360 kbps up, which isn’t too bad. Now I just have to keep the thing from crashing. I’m rolling my ADSL drivers back from the beta version that I’d started with to the most recent release; hopefully that’ll be good enough to fix my problem.
Okay, so my RAID array died because I wasn’t paying enough attention and my 3ware card had already kicked out one perfectly good drive for no obvious reason. No sweat, I can handle that. I as I mentioned before, I took me most of a day, but I recovered almost all of the data off of the failed 4-drive array onto a new 2-drive RAID-0 array. Once the copy was complete, the goal was to destroy the old, broken RAID-5 array, create a new, working RAID-5 array, and then copy all of the data off of the RAID-0 array onto the new RAID-5 array. Then, when everything was complete, I was planning on using the RAID-0 disks as parity and spare drives for the RAID-5 set. Nice and simple, right?
So, by Friday night, I had 6 drives in front of me. One was bad, three were good, but part of the broken RAID array, and two held the data that had been on the RAID array. My goal was to take the 3 good drives and use them to build a new 4-drive RAID-5 array, so I built a software RAID-5 array in degraded mode–that way, I could get away with leaving out the 4th drive at the beginning. Once I copied the data off of the 5th and 6th drives, I was planning on adding them to the RAID-5 array so I’d have a 4th disk plus a spare.
I was very careful not to re-use the broken drive–it was on 3ware channel #2, so I cleverly built my new array using Linux’s
sdd devices, skipping
sdb. Once RAID-5 was running, I formatted the new array, copied everything from the RAID-0 set, broke down the RAID-0 set, and added the drives to the RAID-5 array. And promptly watched everything crumble to dust. My RAID-5 array started out in degraded mode, with 3 of 4 drives active. I then added 2 additional drives, and instead of watching it rebuild to 4 of 4 plus 1 spare, it went to 2 of 4 active. It even sent me this helpful email:
From: email@example.com Subject: Fail event on /dev/md1:nfs Date: January 8, 2005 8:16:43 AM PST To: firstname.lastname@example.org This is an automatically generated mail message from mdadm running on nfs A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.
Although the array was still mounted, any attempt to access it generated a steady stream of I/O errors. What happened, you ask?
Basically, I was an idiot. Like I said, the drive on 3ware channel #2 failed, so I didn’t use drive
sdb. Except that 3ware numbers their channels starting with 0. So channel #2 was drive number 3—
sdb. So I’d rebuilt by array using the bad drive, then copied my data onto the broken disk, and destroyed all of my good copies. I spent all morning Saturday trying to fix things, but I couldn’t even get the kernel to acknowledge that the RAID array existed. I finally gave up and tried cloning
sdc, to see if that’d work, but it didn’t make a bit of difference–I could at least get
mdadm to tell me that
sdb had once been a part of a RAID array, but it didn’t recognize any of the data on
sdc as any part of anything.
In desperation, I tried re-creating the RAID array exactly as I’d first built it, using
sdd. Amazingly enough, that worked, and I was able to mount the drive. I then carefully added
sdc into the array, watched it rebuild the first 20% of the array, and then fail
sdc back out of the array, leaving me back where I started. I finally turned off the computer in disgust and went and played with what was left of our snow.
Sunday was more snow, so I played with the kids, and then finally took one last swing at the computer. I re-built the RAID array again, and then built a RAID-0 array from
sdf. I then tried to copy anything that was salvageable off of the broken RAID-5 array. I figured that I’d be able to copy something before it croaked again. I checked back a couple hours later to discover that it’d copied all 216 GB without error. I was stunned–apparently the drive’s problem was really just corruption of a few sectors–writing new data back onto the drive overwrote the weak parts with a new, strong signal, and it was able to read them back safely. Ugh. It wouldn’t resync right because there were still a number of old sectors with old data on them–if I’d zeroed out the whole drive, it’d probably have worked right from the start, for at least a couple months, until it failed again.
So, I went back through the process again, destroying the array built from
sdd, and then building a new one with
sdb this time. There’s no way I’m going to trust the failing drive, even if it did work this time. I copied everything off of the little RAID-0 array, then carefully tore it apart and used its drives to rebuild the big array into its full RAID-5 glory. And it actually worked this time, without errors. Everything was finally finished around midnight last night, and I was able to reboot without problems.
All done, right?
This morning I got up to find the screen full of syslogged Ethernet problems–apparently the network card had locked up. I could log in on the console, but I couldn’t ping anything. I rebooted, everything came up okay, and I tried copying a bunch of stuff onto the new RAID array. It copied just fine for about 5 minutes, and then the box locked up hard. No kernel panic or anything, just a dead box. The reset button didn’t help, and it ignored the soft power button, so I had to do the hold-the-power-button-for-5-seconds trick. After that, it didn’t boot right–there were 3ware card errors everywhere–timeouts, not drive problems. It locked up again halfway through booting.
So, practically speaking, I’m right back where I started on Friday morning–my box is dead, but the data is probably fine. I’m going to pop the box open and wiggle some cables, but I probably have bad hardware somewhere in the box–motherboard, 3ware card, or power supply. If this had happened at work, I’d just RMA the whole mess and let the vendor sort it out, but that’s not very useful at home, especially when dealing with a 4-year-old system with a second-hand RAID card. Ugh.
Update: I powered it off for a while, wiggled cables, removed spare hardware, rebooted, and found a nice kernel bug. If you have a RAID array with 4 drives plus a spare, and for some reason the spare’s RAID superblock has a higher timestamp then the 4 data drives, then the kernel’s RAID code will gladly kick the 4 good drives out of the array and keep just the spare. I sense a bug report in my near future.