Why not Linux (new server part 2)

Posted by Scott Laird Fri, 19 Oct 2007 23:28:44 GMT

So, as part of my new home server series, I want to explain why I’m using OpenSolaris instead of Linux.

I’ve used Linux since 0.97.1, in August of 1992. I’ve had at least one Linux box at home continuously since 1993 or so. I’ve had a few small chunks of my code added to the kernel over the years. I’ve built several install disks and one embedded appliance distro from scratch, starting with a kernel and busybox and going on up from there. I’ve written X drivers, camera drivers, and drivers for embedded devices on the motherboard. I’ve managed Great Heaping Big Gobs of Hardware at various jobs. Basically, I know Linux well, and I’ve used it for almost half of my life.

That in itself might mean that it’s time for a change–professionally, I’ve been very tightly focused on Linux, and diversity is a good thing. But that’s not why I’m using Solaris this week. I’m using it because I’m fed up with losing data to weird RAID issues with Linux, and I believe that OpenSolaris with ZFS will be substantially more reliable long-term. Things I’m specifically fed up with:

  • md (the Linux RAID driver)’s response to any sort of drive error, even a transient timeout, is to kick the drive from the array, no matter what. Most of the IDE drives that I’ve had over the years have been prone to random timeouts every few months, at least once you bundle more then 2 or 3 of them in a single box and then try snaking massive ribbon cable through the case. My SATA experiences haven’t been substantially better. Linux will happily bump an otherwise working 4-drive RAID 5 array to a 3-drive degraded RAID 5 array on the first failure, and then on to a 2-drive failed array on the second failure. Even when a simple retry would have cleared both errors. This has cost me data repeatedly, because I’ve been forced to manually intervene and re-add “failed” disks to RAID arrays. If I was too slow, then a second drive failure risked total data loss. Even worse, these random transient failures blind you to real drive failures, like the one that ate my NAS box last weekend.
  • Actual drive failures can hang the kernel. I’ve had at least 3 cases at home where broken drives either caused system lockups or completely kept the system from booting. That sucks. Odds are some drivers are good while others are broken; apparently I’ve just had bad luck.
  • None of Linux’s filesystems are particularly resilient in the face of on-disk data corruption. Compare with ZFS, which checksums everything that it reads or writes.

In short: everything works great when things are perfect, but building a reliable multi-drive storage system requires careful component and kernel compatibility work, and then you have to stay right on top of things if you want everything to keep working. When things stop working, they usually fail badly. That’s almost the complete antithesis of what I want for home: plug it in, and it just keeps working. I don’t want small failures to cascade through the system. Little failures should isolated, identified, and automatically repaired whenever possible. OpenSolaris and ZFS seems to provide that, while Linux with md and ext3 does not.

That’s why I’m planning on using ZFS. My logic for building a server vs. buying another little NAS box is simple: none of the little NAS boxes on the market use ZFS right now, and none of the cheap ones have room for more then 5 drives. I’m planning on using a double-parity system (RAID 6 or ZFS’s raidz2, where the system can cope with a 2-drive failure) plus a spare drive, and that’d only leave me with 2 data disks. The only way that I can get enough data with only 2 disks would be to use 1TB drives, and they’re too pricy right now.

So, I’m willing to spend the time to build a somewhat complex server because I believe (hope?) that it’ll save me time in the future, and it’ll let me avoid ever having to do the reconstruct-from-the-source dance again. I don’t think I lost anything critical last weekend, and I’m reasonably confident that I’ll be able to get things limping along well enough to recover data anyway, but I’ve now done this 3 times in the past 4 years, and I’ve had it.

Coming up soon: backups, OpenSolaris hardware compatibility, and GC-RAMDISK performance benchamarks. Stay tuned :-).

Tags , , , , ,  | 3 comments

ZFS and The Holy Grail of Storage

Posted by Scott Laird Tue, 16 Oct 2007 11:40:48 GMT

So, the comments on yesterday’s post about my nasty RAID failure encouraged me to spend some time looking at ZFS on OpenSolaris, and I really like what I see. I’ve ordered some new hardware, so I should have lots to write about by next weekend.

Reading the ZFS docs reminded me of my Holy Grail of Storage: a storage system that could actually do reasonably smart things with 3–5 drives. Imagine a system where you could start with 3 drives and simply plug new drives in as you need more space, without worrying about RAID or data layout. When you run out of slots, then just unplug the oldest, smallest drive and plug in a new, larger one, and the data will resync, giving you more disk space without needing any special work on your part. For bonus points, you’d be able to designate specific bits of your data as more or less important, so Bittorrent files might not be replicated at all, while your Word documents might be replicated onto every available drive.

I’ve wanted that for years, but I’ve largely dismissed it as a pipe dream, because it doesn’t fit cleanly into the drive/RAID/LVM/filesystem model that everything uses. The only thing that I’ve seen that even comes close is Drobo, and it’s supposedly fairly slow and really just too “magic” for me to trust.

I realized this morning that it’d be easy to build a storage system like this using ZFS. Just create a zpool with 3 drives to start, and then create zfs filesystems with copies=2 on top of it. When you add new drives, just add them to the pool. Blindly removing a single old drive will only leave you with a single copy of some of your files, but that shouldn’t be fatal, and ZFS can copy everything off of it if you give it a chance. There are some corner cases that will give you less redundancy–if you manage to fill the system 98% full before adding a new drive, then all of the replicas of new data will probably end up on the same disk. There are a couple obvious workarounds, and Sun will probably add replication rebalancing at some point, if it isn’t there already.

Tags , , ,  | no comments