I try to be pragmatic, but sometimes I just can't help it and try to pick up lost causes. I think the world would be a better place if computers were easier to maintain. Fortunately, I'm a server person, and the server side of things is actually a lot easier then the desktop side, at least for now.

Warning: most of my experience is with Linux and Solaris boxes in ISP-like settings, although I've done a fair bit of time in small non-computer-related businesses and software houses. I have no idea how much of this applies to Windows.

I've been thinking about server management for years. Sometimes, I've been paid for it, sometimes (like now), I'm paid for other things. I still can't stop thinking about it, though. There has to be a better way to manage servers then we're doing now. As I mentioned yesterday, I think I might have a solution for at least a few common cases.

Traditionally, there are two models for server deployment. Either the heavyweight model (deploy a small number of servers and run lots of services on each) or the lightweight model (deploy a lot of servers, and run a small number of services on each). One of the problems is that, at least for small services, the heavyweight model seems cheaper. Why buy 10 servers that are going to sit 95% idle when you could buy 2 servers and have them be 75% idle? Or even one server that'll only be 50% full. What happens pretty much every time is that a couple of the services start conflicting with each other somehow--one needs perl 5.6 for something, while another needs 5.8. Or they need two different versions of the JVM. Or one needs a critical security upgrade that ends up killing another service. So, you keep tweaking things, and you (barely) keep everything running, largely by avoiding making changes. Except, when you avoid making small changes, you inevitably miss little security fixes and little bug fixes, and you drift further and further from the mainline of whatever OS you're running. So, inevitably, you reach the "server event horizon," where things have grown so complex and unmanageable that the only thing you can do is buy 2-3 new computers to replace your one big system, and then slowly migrate services off of the old box onto the newer box. Except you end up with a lot of implicit assumptions lurking, assuming that DNS and DHCP are on the same server, or that Apache and Mysql are on the same box, and it takes forever to untangle them. Even once that's done, you'll find out that people have hard-coded server names into applications deployed all over the company, and you'll end up spending 3 months untangling your one heavy-weight server that seemed like such a good way to save money at the time.

Conventional wisdom says that the way out of this problem is virtualization. Instead of buying 10 small computers, you buy one or two really big computers, and then partition them in software, and then install the software that you would have installed onto the little computers onto the partitions of the big computers. Lots of vendors love this model; IBM's whole Linux-on-mainframes push is based on it. There are a couple problems with it as I see it, though. First, you'll end up paying a ton of money for virtualization hardware or software--VMWare wants at least $2,500 per server for their PC-based virtualization code; pretty much everything else else is more expensive. Second, you're still left with a bunch of small general-purpose servers that you need to manage individually, even if they do happen to physically reside within a single box. There are also reliability issues, but I'm going to ignore them for now; in my experience, even cheap PCs running Linux rarely crash, and when they do it's usually a bad power supply, a bad hard drive, or bad RAM. Spending more money on hardware gives you multiple power supplies, better RAID, and more redundant memory. Plus buggy virtualization software, but we'll come back to that, too.

Fortunately, the open-source world is making progress. User-mode-linux (UML) is making a lot of headway. It's included in Linux 2.6, although it still needs a few little patches for optimum operation. It seems to have a 30% speed hit in a lot of cases; sometimes that's a problem, sometimes it isn't. Using it, you can build a big Linux host server, and then run a bunch of little virtualized servers on it for free. Sounds nice? Sort of--you still have to admin a ton of little general-purpose boxes, but at least you've mostly solved the dependancy problem that killed us a few paragraphs ago.

The nice thing about Linux is that it's so flexible. Unlike every other OS that I'm aware of, there's no one environment that is definitively Linux. Instead, we have a herd of Linux distributions, ranging from Red Hat to Debian to Gentoo to "Linux From Scratch" all the way down to the mini Linux distribution that wireless access point vendors call "firmware." There's no real reason for a special-purpose DNS server to run a full Linux distribution, except that it's usually less work that way. However, once we have a UML-based virtualization scheme in place, it can actually be easier to use specialized distributions then general-purpose ones. I mean, the hard parts of a distribution are generally the installer, the hardware handling, and the update code. With a virtualized server, none of that applies. There is no hardware, really (it all pretends to be the same), the installer is really just a script that copies a hard disk image into place on the host system, and the update system is even easier--just save the data and completely discard the old OS image. In an ideal world, the OS image would be completely read-only, with configuration settings and data kept outside of the server in a standardized format. Then, software upgrades are truly trivial--kill off the old server VM and start up a new VM using the old data.

This won't work for everything, of course. It'd be horrible for big database servers, or frankly big servers of any type. In my experience, though, there are a lot more small servers then there are big servers.

The other nice thing about this scheme is that the virtual server images are simple to build and easy to trade. They're not utterly trivial, but it's easier to build a server image then to actually write the server software or or maintain a full-sized OS distribution. Given a standardized interface between the host OS and the server image (things like IP address, DNS server, hostname, logging, and all of the other little details needed to make a server run), there's no real reason that you can't swap between server images from different "vendors", grabbing whatever best serves your needs.

I'm starting to build a framework for this, more details as I have time to write them down.