Why not Linux (new server part 2)

So, as part of my new home server series, I want to explain why I’m using OpenSolaris instead of Linux.

I’ve used Linux since 0.97.1, in August of 1992. I’ve had at least one Linux box at home continuously since 1993 or so. I’ve had a few small chunks of my code added to the kernel over the years. I’ve built several install disks and one embedded appliance distro from scratch, starting with a kernel and busybox and going on up from there. I’ve written X drivers, camera drivers, and drivers for embedded devices on the motherboard. I’ve managed Great Heaping Big Gobs of Hardware at various jobs. Basically, I know Linux well, and I’ve used it for almost half of my life.

That in itself might mean that it’s time for a change–professionally, I’ve been very tightly focused on Linux, and diversity is a good thing. But that’s not why I’m using Solaris this week. I’m using it because I’m fed up with losing data to weird RAID issues with Linux, and I believe that OpenSolaris with ZFS will be substantially more reliable long-term. Things I’m specifically fed up with:

  • md (the Linux RAID driver)’s response to any sort of drive error, even a transient timeout, is to kick the drive from the array, no matter what. Most of the IDE drives that I’ve had over the years have been prone to random timeouts every few months, at least once you bundle more then 2 or 3 of them in a single box and then try snaking massive ribbon cable through the case. My SATA experiences haven’t been substantially better. Linux will happily bump an otherwise working 4-drive RAID 5 array to a 3-drive degraded RAID 5 array on the first failure, and then on to a 2-drive failed array on the second failure. Even when a simple retry would have cleared both errors. This has cost me data repeatedly, because I’ve been forced to manually intervene and re-add “failed” disks to RAID arrays. If I was too slow, then a second drive failure risked total data loss. Even worse, these random transient failures blind you to real drive failures, like the one that ate my NAS box last weekend.
  • Actual drive failures can hang the kernel. I’ve had at least 3 cases at home where broken drives either caused system lockups or completely kept the system from booting. That sucks. Odds are some drivers are good while others are broken; apparently I’ve just had bad luck.
  • None of Linux’s filesystems are particularly resilient in the face of on-disk data corruption. Compare with ZFS, which checksums everything that it reads or writes.

In short: everything works great when things are perfect, but building a reliable multi-drive storage system requires careful component and kernel compatibility work, and then you have to stay right on top of things if you want everything to keep working. When things stop working, they usually fail badly. That’s almost the complete antithesis of what I want for home: plug it in, and it just keeps working. I don’t want small failures to cascade through the system. Little failures should isolated, identified, and automatically repaired whenever possible. OpenSolaris and ZFS seems to provide that, while Linux with md and ext3 does not.

That’s why I’m planning on using ZFS. My logic for building a server vs. buying another little NAS box is simple: none of the little NAS boxes on the market use ZFS right now, and none of the cheap ones have room for more then 5 drives. I’m planning on using a double-parity system (RAID 6 or ZFS’s raidz2, where the system can cope with a 2-drive failure) plus a spare drive, and that’d only leave me with 2 data disks. The only way that I can get enough data with only 2 disks would be to use 1TB drives, and they’re too pricy right now.

So, I’m willing to spend the time to build a somewhat complex server because I believe (hope?) that it’ll save me time in the future, and it’ll let me avoid ever having to do the reconstruct-from-the-source dance again. I don’t think I lost anything critical last weekend, and I’m reasonably confident that I’ll be able to get things limping along well enough to recover data anyway, but I’ve now done this 3 times in the past 4 years, and I’ve had it.

Coming up soon: backups, OpenSolaris hardware compatibility, and GC-RAMDISK performance benchamarks. Stay tuned :-).

Posted by Scott Laird Fri, 19 Oct 2007 23:28:44 GMT


Comments

  1. Igor M Podlesny 8 months later:

    «Even worse, these random transient failures blind you to real drive failures». Huh. man smartd.

    «plus a spare drive» — I almost wonder what prevented you to have a spare drive when using Linux Software RAID.

    Almost — because I really don’t. ;-)

  2. Scott Laird 8 months later:

    Well, SMART’s of limited use–about half of all drive failures happen with no warning (see the Google drive failure paper from last year). In my experience over about 10 years of using it, both personally and professionally, Linux’s RAID system is about 80% worthless–it looks nice in principle, and it’s fast, but it will eat your data if you give it long enough. It’s just not reliable enough, and it has all sorts of weird failure scenarios that are hard to trigger and hard to recover from.

    Spares are nice in theory, but again, they’re of limited use. They’re only about 90% effective in saving your data–according to the Solaris folk’s statistical models (and backed up by some of the large storage admins that hang around on Sun’s mailing lists), with big drives, about 10% of all single-drive-failure-with-spare-in-place rebuilds still result in a failed RAID array, either due to undetected media problems (a second drive was actually bad, but the bad sectors hadn’t been read recently enough to notice) or a second total drive failure during the rebuild window.

    My problem is that I (a) don’t want to lose data due to drive failure ever again and (b) don’t want to spend any time at all thinking about storage at home in an average month.

  3. Igor M Podlesny 8 months later:

    Even if SMARTD was of limited use, it still helps sometimes. At least it could help you not «to be blind» with those «UFO’s timeouts» I personally never met in mine over than 5 years Linux Software RAID (LSR) practice.

    Speaking about «spare drive» I just realize that you’re setting more safer and better conditions with ZFS than you had with LSR; it’s… ergh… biased. :)

    I don’t have anything against ZFS but I’m pretty sure (having read about those voodoo «timeouts») that you have had all your trouble due to rather crappy hardware and personal inadequate actions — who, tell me, will be running degraded RAID-5, even being «blinded» with «these random transient failures»? It’s RAID-5, man, one disk less and you lose. As you did.

    P. S. Still there is RAID-6, BTW.

  4. http://www.nexenta.com/corp/ 9 months later:

    Scott Laird wrote:

    My logic for building a server vs. buying another little NAS box is simple: none of the little NAS boxes on the market use ZFS right now, and none of the cheap ones have room …

    NexentaStor Developer Edition is time-unlimited and full feature product that can be deployed with up to one terabyte of user data with no limits on the total size of attached storage.

    http://www.nexenta.com/corp/

    NexentaStor http://www.nexenta.com/corp/index.php?option=com_content&task=blogsection&id=4&Itemid=67

    Evaluate a FREE unlimited trial, pay for support.