The importance of a good test environment

I’ve been a Unix sysadmin for years, professionally since 1994 or 1995, but my current job is more programming and embedded design then traditional system administration. Since we’re just a small startup, there really isn’t anyone here working on the servers unless things break. When I started, I asked my boss “how much time should I spend as a sysadmin working on our servers” and the answer was “as little as possible.” So I did. When things broke, I fixed them, and when we needed a new box internally for something, I set it up, but that’s pretty much it. Someone else, yet to be hired, will be the sysadmin.

That is, until last week, when we finally broke down and decided that we needed to fix a bunch of things, including printing, LDAP, and Samba. So, I had 3 days in the schedule for sysadmin work, but LDAP ended up taking most of a week to get straightened out. Even though things worked correctly in testing, they didn't quite work right in production. Plus, I had to wait until after 6:00 or so to work on production systems, and I needed to be in the office at 8:30 or so to verify that things weren't broken when people showed up. It was a pain, and everything took longer then it was supposed to.

At my previous job, stuff like this was still irritating, but actually rolling things out in production tended to go very smoothly. That's because we (a) could clone the production environment to produce an accurate test environment and (b) once the test environment worked, we could merge the changes made back onto the production environment.

Of course, we don't have any of that here, at least not for IT servers. We do have that for our product, but that's not strictly relevant here. The way that Internap did it was wonderful, but I don't think it scales down far enough--it's great with 700 servers, and probably even with 70 servers, but with 7 it's probably overkill.

Testing is absolutely one place that system administration can learn something from programming. No matter what you think about XP, it's obvious that automated unit tests are a major win for program reliability. I've never worked any place that put any thought at all into automated system testing (outside of a few things like DNS and ping tests), but it seems obvious that it's a good thing. Or, rather, once the tests exist, running them would be a good thing. Actually creating tests (and a testing framework) is, as always, a pain.

This is one of the things that I want so fix with the server management stuff that I'm slowly working on. Smaller servers are (obviously) easier to test then bigger servers, because the number of weird interactions is lower, and the server's function is much more obvious.

If anyone has a good source of 27 hour days, let me know.