Posted by Scott Laird
Sat, 20 Oct 2007 04:56:19 GMT
A few days ago, I mentioned that my home NAS box had failed, and that I was considering replacing it with a PC server running OpenSolaris and ZFS. I’ve read a pile of ZFS docs, and it looks like the best option available to me today, so I decided to order some suitable hardware.
At that point, pretty much everything broke down. I have a hard enough time keeping track of which hardware works with Linux this week, and OpenSolaris is completely new to me. Sun’s list of officially-supported hardware is pretty sparse, and digging through their mailing list archives gets frustrating quickly. From what I can tell, it boils down to:
- Current Intel and AMD CPUs are all fine.
- Most of Intel’s chipsets are fine.
- Most of nVidia’s AMD chipsets are fine.
- nVidia and Intel video chips are good.
- Most common Ethernet chipsets are either supported natively or have drivers available.
- The only SATA controllers that work are Intel’s ICH southbridges, Silicon Image’s PCI and PCI-E chips, Marvell’s PCI chips, and nVidia’s southbridges. It’s not clear that Marvell’s PCI-E chips work. Most motherboards with additional, non-southbridge SATA ports probably won’t work.
- Venturing too far outside of this list will probably result in problems.
I was looking for a motherboard with 8 SATA ports, and was hoping that the Intel D975XBX2 (“Bad Axe 2”) would work, but 4 of its 8 SATA ports belong to a Marvell PCI-E SATA chip that doesn’t appear to be supported. I went through every single 8-port motherboard in Newegg’s (the ‘WS’ is important–the P5K is a different board). It only has 6 on-board SATA ports, but it includes a PCI-X slot. That’ll let me use the Supermicro AOC-SAT2-MV8, which is far and away the cheapest 8-port SATA card on the market. That’ll give me a total of 14 SATA ports, which should be enough for a whatever I want to throw at it. The Marvell PCI-X chip at the heart of the Supermicro card is the same one used in Sun’s Sun Fire x4500 48-drive server, so it’s safe to assume that Sun has put a lot of effort into the driver.
Most of the test of the system is fairly generic–a cheap nVidia 7200GS video card (the cheapest PCI-E card that NewEgg carries), a nice case and power supply, RAM, and a boatload of drives.
The one odd component that I’ve added is a Gigabyte GC-RAMDISK with 1 GB of RAM. The GC-RAMDISK is a battery-backed SATA ramdisk; it looks like a hard drive to the system and can survive up to 18 hours without power. I’ve had my eye on this thing for years, and it looks like it’ll be a perfect external log device for GFS. I had to ask to see how ZFS will behave if the device fails, and it looks like manual intervention may be required after an 18+ hour power outage, but it should be pretty minimal. I’m planning on posting some benchmarks here once I’ve had a chance to try it out.
Assuming that I’m able to get this whole mess to work at all, I should have lots to write about here over the next week or so. I’m going to start by explaining why I want to use Solaris instead of Linux or *BSD, and why I’m building something instead of buying a pre-build NAS box.
Tags home, opensolaris, raid, server, zfs | 3 comments
Posted by Scott Laird
Fri, 19 Oct 2007 23:28:44 GMT
So, as part of my new home server series, I want to explain why I’m using OpenSolaris instead of Linux.
I’ve used Linux since 0.97.1, in August of 1992. I’ve had at least one Linux box at home continuously since 1993 or so. I’ve had a few small chunks of my code added to the kernel over the years. I’ve built several install disks and one embedded appliance distro from scratch, starting with a kernel and busybox and going on up from there. I’ve written X drivers, camera drivers, and drivers for embedded devices on the motherboard. I’ve managed Great Heaping Big Gobs of Hardware at various jobs. Basically, I know Linux well, and I’ve used it for almost half of my life.
That in itself might mean that it’s time for a change–professionally, I’ve been very tightly focused on Linux, and diversity is a good thing. But that’s not why I’m using Solaris this week. I’m using it because I’m fed up with losing data to weird RAID issues with Linux, and I believe that OpenSolaris with ZFS will be substantially more reliable long-term. Things I’m specifically fed up with:
- md (the Linux RAID driver)’s response to any sort of drive error, even a transient timeout, is to kick the drive from the array, no matter what. Most of the IDE drives that I’ve had over the years have been prone to random timeouts every few months, at least once you bundle more then 2 or 3 of them in a single box and then try snaking massive ribbon cable through the case. My SATA experiences haven’t been substantially better. Linux will happily bump an otherwise working 4-drive RAID 5 array to a 3-drive degraded RAID 5 array on the first failure, and then on to a 2-drive failed array on the second failure. Even when a simple retry would have cleared both errors. This has cost me data repeatedly, because I’ve been forced to manually intervene and re-add “failed” disks to RAID arrays. If I was too slow, then a second drive failure risked total data loss. Even worse, these random transient failures blind you to real drive failures, like the one that ate my NAS box last weekend.
- Actual drive failures can hang the kernel. I’ve had at least 3 cases at home where broken drives either caused system lockups or completely kept the system from booting. That sucks. Odds are some drivers are good while others are broken; apparently I’ve just had bad luck.
- None of Linux’s filesystems are particularly resilient in the face of on-disk data corruption. Compare with ZFS, which checksums everything that it reads or writes.
In short: everything works great when things are perfect, but building a reliable multi-drive storage system requires careful component and kernel compatibility work, and then you have to stay right on top of things if you want everything to keep working. When things stop working, they usually fail badly. That’s almost the complete antithesis of what I want for home: plug it in, and it just keeps working. I don’t want small failures to cascade through the system. Little failures should isolated, identified, and automatically repaired whenever possible. OpenSolaris and ZFS seems to provide that, while Linux with md and ext3 does not.
That’s why I’m planning on using ZFS. My logic for building a server vs. buying another little NAS box is simple: none of the little NAS boxes on the market use ZFS right now, and none of the cheap ones have room for more then 5 drives. I’m planning on using a double-parity system (RAID 6 or ZFS’s raidz2, where the system can cope with a 2-drive failure) plus a spare drive, and that’d only leave me with 2 data disks. The only way that I can get enough data with only 2 disks would be to use 1TB drives, and they’re too pricy right now.
So, I’m willing to spend the time to build a somewhat complex server because I believe (hope?) that it’ll save me time in the future, and it’ll let me avoid ever having to do the reconstruct-from-the-source dance again. I don’t think I lost anything critical last weekend, and I’m reasonably confident that I’ll be able to get things limping along well enough to recover data anyway, but I’ve now done this 3 times in the past 4 years, and I’ve had it.
Coming up soon: backups, OpenSolaris hardware compatibility, and GC-RAMDISK performance benchamarks. Stay tuned :-).
Tags linux, opensolaris, raid, solaris, storage, zfs | 3 comments
Posted by Scott Laird
Tue, 16 Oct 2007 11:40:48 GMT
So, the comments on yesterday’s post about my nasty RAID failure encouraged me to spend some time looking at ZFS on OpenSolaris, and I really like what I see. I’ve ordered some new hardware, so I should have lots to write about by next weekend.
Reading the ZFS docs reminded me of my Holy Grail of Storage: a storage system that could actually do reasonably smart things with 3–5 drives. Imagine a system where you could start with 3 drives and simply plug new drives in as you need more space, without worrying about RAID or data layout. When you run out of slots, then just unplug the oldest, smallest drive and plug in a new, larger one, and the data will resync, giving you more disk space without needing any special work on your part. For bonus points, you’d be able to designate specific bits of your data as more or less important, so Bittorrent files might not be replicated at all, while your Word documents might be replicated onto every available drive.
I’ve wanted that for years, but I’ve largely dismissed it as a pipe dream, because it doesn’t fit cleanly into the drive/RAID/LVM/filesystem model that everything uses. The only thing that I’ve seen that even comes close is Drobo, and it’s supposedly fairly slow and really just too “magic” for me to trust.
I realized this morning that it’d be easy to build a storage system like this using ZFS. Just create a zpool with 3 drives to start, and then create zfs filesystems with copies=2 on top of it. When you add new drives, just add them to the pool. Blindly removing a single old drive will only leave you with a single copy of some of your files, but that shouldn’t be fatal, and ZFS can copy everything off of it if you give it a chance. There are some corner cases that will give you less redundancy–if you manage to fill the system 98% full before adding a new drive, then all of the replicas of new data will probably end up on the same disk. There are a couple obvious workarounds, and Sun will probably add replication rebalancing at some point, if it isn’t there already.
Tags opensolaris, raid, storage, zfs | no comments
Posted by Scott Laird
Tue, 28 Jun 2005 02:28:37 GMT
Somehow I missed this earlier this month–Gigabyte has announced a $50-ish PCI card that takes up to 4 DDR DIMMs and acts like a SATA RAMDISK. It has a battery that supposedly lasts 12-16 hours and will recharge via the PCI standby power line.
I’ve seen a bunch of people excited about using this as a boot disk or a Windows paging disk, but personally I’d love to see this used as an external journal for EXT3 filesystems. For some workloads, this would result in huge performance boosts for an amazingly small amount of cash. It’d be nice to have more battery life (36-72 hours would be ideal)–my personal record for a home power outage is 13 hours. All of my work-related outages have been brief, except for the facility in Manhattan that was dark for about a week in September of 2001.
via Ambient Irony
Update: I’ve been thinking about this, and the whole thing would be massively more useful with a couple small additions. First, add a Compact Flash socket to the board, and then update the ASIC that runs the board so it will copy the contents of the RAM onto the CF card after an hour or two without power. Then copy it all back when the power comes back up. You should be able to buy 1 GB of DRAM and 1 GB of CF flash for around $150; adding $100 for the PCI card give you 1 GB of seriously non-volatile memory for $250. I’d probably make them a standard feature in every server that I bought, just for the performance boost.
Posted in Computer Hardware | Tags ddr, gigabyte, journal, nonvolatile, raid, ramdisk, sata | 1 comment
Posted by Scott Laird
Mon, 10 Jan 2005 23:00:49 GMT
Okay, so my RAID array died because I wasn’t paying enough attention and my 3ware card had already kicked out one perfectly good drive for no obvious reason. No sweat, I can handle that. I as I mentioned before, I took me most of a day, but I recovered almost all of the data off of the failed 4-drive array onto a new 2-drive RAID-0 array. Once the copy was complete, the goal was to destroy the old, broken RAID-5 array, create a new, working RAID-5 array, and then copy all of the data off of the RAID-0 array onto the new RAID-5 array. Then, when everything was complete, I was planning on using the RAID-0 disks as parity and spare drives for the RAID-5 set. Nice and simple, right?
So, by Friday night, I had 6 drives in front of me. One was bad, three were good, but part of the broken RAID array, and two held the data that had been on the RAID array. My goal was to take the 3 good drives and use them to build a new 4-drive RAID-5 array, so I built a software RAID-5 array in degraded mode–that way, I could get away with leaving out the 4th drive at the beginning. Once I copied the data off of the 5th and 6th drives, I was planning on adding them to the RAID-5 array so I’d have a 4th disk plus a spare.
I was very careful not to re-use the broken drive–it was on 3ware channel #2, so I cleverly built my new array using Linux’s sda, sdc, and sdd devices, skipping sdb. Once RAID-5 was running, I formatted the new array, copied everything from the RAID-0 set, broke down the RAID-0 set, and added the drives to the RAID-5 array. And promptly watched everything crumble to dust. My RAID-5 array started out in degraded mode, with 3 of 4 drives active. I then added 2 additional drives, and instead of watching it rebuild to 4 of 4 plus 1 spare, it went to 2 of 4 active. It even sent me this helpful email:
From: scott@mail.sigkill.org
Subject: Fail event on /dev/md1:nfs
Date: January 8, 2005 8:16:43 AM PST
To: scott@sigkill.org
This is an automatically generated mail message from mdadm
running on nfs
A Fail event had been detected on md device /dev/md1.
Faithfully yours, etc.
Although the array was still mounted, any attempt to access it generated a steady stream of I/O errors. What happened, you ask?
Basically, I was an idiot. Like I said, the drive on 3ware channel #2 failed, so I didn’t use drive sdb. Except that 3ware numbers their channels starting with 0. So channel #2 was drive number 3—sdc, not sdb. So I’d rebuilt by array using the bad drive, then copied my data onto the broken disk, and destroyed all of my good copies. I spent all morning Saturday trying to fix things, but I couldn’t even get the kernel to acknowledge that the RAID array existed. I finally gave up and tried cloning sdb onto sdc, to see if that’d work, but it didn’t make a bit of difference–I could at least get mdadm to tell me that sdb had once been a part of a RAID array, but it didn’t recognize any of the data on sdc as any part of anything.
In desperation, I tried re-creating the RAID array exactly as I’d first built it, using sda, sdc, and sdd. Amazingly enough, that worked, and I was able to mount the drive. I then carefully added sdc into the array, watched it rebuild the first 20% of the array, and then fail sdc back out of the array, leaving me back where I started. I finally turned off the computer in disgust and went and played with what was left of our snow.
Sunday was more snow, so I played with the kids, and then finally took one last swing at the computer. I re-built the RAID array again, and then built a RAID-0 array from sde and sdf. I then tried to copy anything that was salvageable off of the broken RAID-5 array. I figured that I’d be able to copy something before it croaked again. I checked back a couple hours later to discover that it’d copied all 216 GB without error. I was stunned–apparently the drive’s problem was really just corruption of a few sectors–writing new data back onto the drive overwrote the weak parts with a new, strong signal, and it was able to read them back safely. Ugh. It wouldn’t resync right because there were still a number of old sectors with old data on them–if I’d zeroed out the whole drive, it’d probably have worked right from the start, for at least a couple months, until it failed again.
So, I went back through the process again, destroying the array built from sda, sdc, and sdd, and then building a new one with sdb this time. There’s no way I’m going to trust the failing drive, even if it did work this time. I copied everything off of the little RAID-0 array, then carefully tore it apart and used its drives to rebuild the big array into its full RAID-5 glory. And it actually worked this time, without errors. Everything was finally finished around midnight last night, and I was able to reboot without problems.
All done, right?
Ha.
This morning I got up to find the screen full of syslogged Ethernet problems–apparently the network card had locked up. I could log in on the console, but I couldn’t ping anything. I rebooted, everything came up okay, and I tried copying a bunch of stuff onto the new RAID array. It copied just fine for about 5 minutes, and then the box locked up hard. No kernel panic or anything, just a dead box. The reset button didn’t help, and it ignored the soft power button, so I had to do the hold-the-power-button-for-5-seconds trick. After that, it didn’t boot right–there were 3ware card errors everywhere–timeouts, not drive problems. It locked up again halfway through booting.
So, practically speaking, I’m right back where I started on Friday morning–my box is dead, but the data is probably fine. I’m going to pop the box open and wiggle some cables, but I probably have bad hardware somewhere in the box–motherboard, 3ware card, or power supply. If this had happened at work, I’d just RMA the whole mess and let the vendor sort it out, but that’s not very useful at home, especially when dealing with a 4-year-old system with a second-hand RAID card. Ugh.
Update: I powered it off for a while, wiggled cables, removed spare hardware, rebooted, and found a nice kernel bug. If you have a RAID array with 4 drives plus a spare, and for some reason the spare’s RAID superblock has a higher timestamp then the 4 data drives, then the kernel’s RAID code will gladly kick the 4 good drives out of the array and keep just the spare. I sense a bug report in my near future.
Posted in Computer System Administration | Tags broken, ide, linux, raid | no comments
Posted by Scott Laird
Fri, 07 Jan 2005 17:51:59 GMT
Well, this isn’t looking promising–my new drives arrived, but the big RAID array is throwing errors left and right. I’m not sure how much data I’ll be able to recover off the thing. Most of the data on the drive is reconstructible, but not everything. Most of this contents are old digital pictures, but I’ve tried to write them all to DVD before throwing them onto the RAID array. Odds are I missed some stuff, though.
Amazingly enough, the system logs seem to have survived unscathed, so I’ll write up a “anatomy of a drive failure” article later, showing what this looked like from a SMART perspective. Since I’ve never actually seen a SMART-monitored drive failure before, it should be somewhat educational.
Posted in Computer System Administration | Tags broken, drive, failure, linux, raid, smart | no comments
Posted by Scott Laird
Thu, 06 Jan 2005 19:49:46 GMT
As mentioned earlier, I spent part of the long weekend cleaning up home theater stuff. Part of this involved migrating files onto my home file server, which is an old Athlon 700 with an 8-channel 3ware RAID card and 4 160 GB drives in a 450 GB RAID 5 array.
So what happens as soon as I finish copying stuff onto the array? A drive starts failing on the RAID array, and I discover that it was already running in degraded mode. Now I’m in danger of losing all 200 GB on the array. Most likely, it won’t come to that, but it’s still fantastically irritating. Of the 4 160 GB drives that I bought last year, 2 of them have now failed.
To make sure that this doesn’t happen again, I just ordered 2 more 160 GB drives from NewEgg (only $76 each), along with a 3-in-2 style drive cooler. Assuming that it all arrives tomorrow, I should be able to rebuild the array, including a spare drive this time, and hopefully I won’t have to worry about it failing again.
Posted in Computer System Administration, Toys, Personal | Tags 3ware, broken, drive, failure, ide, raid | no comments