Posted by Scott Laird
Thu, 28 Jul 2005 22:18:05 GMT
ThinkSecret says that Apple is getting ready to upgrade OS X Server with some sort of improved mail and calendar solution, probably Hula. That’s nice and all, but I REALLY want them to upgrade iCal to support some publicly-available calendar server. The ability to publish read-only calendars was nice in 2002 when it was first added, but it’s been three years, and I’m still waiting for the ability to share read/write calendars with other family members. I’m aware that I could probably do this with .Mac, but I’m not willing to pay $100/year just so I can edit events on my wife’s calendar a couple times per week.
Having said that, Hula looks pretty nice. Even without iCal syncing support, I’ll probably consider it when it’s time to upgrade my mail server software again; fortunately that’s probably at least a year away still. If iCal gets CalDAV support before then, then I might have to be a bit more aggressive with the timeframe. Either that or look for other CalDAV servers.
Posted in Mac stuff, Computer System Administration | Tags apple, caldav, hula, ical, imap, mail, pop | no comments
Posted by Scott Laird
Wed, 13 Jul 2005 02:11:30 GMT
I’ve spent most of the past two days working on a little project at work that needs the ability to generate Java JKS keystore files (compatible with the Java keytool program) containing X.509 certificates signed by a private certificate authority.
If you think that sounds simple, then you’ve obviously never worked with X.509.
This turns out to be astoundingly difficult, largely because X.509 is insane. It doesn’t help that Sun’s keytool program is missing a lot of functionality–if you want to rename keys or extract the private keys from the keystore file, then you’ll need to resort to coding it in Java. It’s also really hard to find usable certificate authority software. I’ve been looking for it at least 8 years for a complete open-source corporate CA! There are lots of partial solutions out there, but none of the ones that I’ve used have actually been able to solve all of the problems that I’ve needed solved. I’ve always fallen back on scripting openssl directly, and that always requires a day or so of digging through OpenSSL documentation to find the right incantations to get it to work.
In the end, all I needed to do was run openssl 3 times per key generated (make key, sign key, convert to PKCS#12), then run a bit of Java code out of Jetty to convert the PKCS#12 key to a format that keytool can read. Don’t ask why I had to drag Jetty into the picture–that’s like requiring Apache in order to get your version control software to compile or something–it just doesn’t make any sense. Sigh.
For future reference, here are a few useful references:
Posted in Computer Security, Computer System Administration, Work | Tags cryptography, openssl, rant, x509 | 4 comments
Posted by Scott Laird
Wed, 29 Jun 2005 18:12:17 GMT
I just locked myself out of a remote server for the first time in years. I’m usually better then that, but I finally screwed up and typed something that ended up requiring local intervention.
I’m going to blame all of the GNU tools for this–GNU getopt almost universally allows you to enter command-line flags anywhere on the command line. So ls -l foo and ls foo -l are equivalent. Frequently, if I need to add a new flag to an existing command line, I’ll just tack it on at the end rather then using the arrows to go back a word or two.
Unfortunately, sometimes the order matters. For instance, kill -1 1234 and kill 1234 -1 do very different things. The first one sends SIGTERM to process 1234. The second one sends SIGTERM to process 1234, as well as every other process on the system.
Oops.
Posted in Computer System Administration | Tags oops, sysadmin | no comments
Posted by Scott Laird
Tue, 08 Mar 2005 07:23:21 GMT
It’s sort of an axiom of programming that features that aren’t continually used or tested won’t actually work. A similar rule holds for system administration–any feature that hasn’t been tested since the last upgrade is probably broken. An obvious corollary suggests that systems get more reliable as their user load increases–more users means more features are used more frequently, and broken features will be spotted sooner. And the corollary to that is that any server wedged under a desk in someone’s home office is probably flakier then hell because it’s probably just sitting there collecting dust and not getting used.
I’m not convinced that that applies to my home gateway box. It’s a busy little beaver:
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes prot opt in out source destination
234M 75G all -- dsl0 * 0.0.0.0/0 0.0.0.0/0
47M 1001G all -- eth0 * 0.0.0.0/0 0.0.0.0/0
In the 25.75 days since I last rebooted this system, it’s received over 75 GB via its DSL link and around 1 TB over its main Ethernet link. If my math is right, that’s an average of 3.6 Mbps on the Ethernet link and around 270 kbps over DSL. I wasn’t keeping outgoing traffic stats when I first booted this box, but more recent estimates make it look like there’s almost as much outgoing traffic on dsl0 as there is incoming.
CPU load is similarly heavy–the box has averaged 51.9% idle since it was rebooted. My rule of thumb for years was that any production box that was under 80% idle was due to be upgraded soon, because it was probably pegging the CPU during peak times during the day. If the box was under 70% idle, then it was time to start scrounging for an immediate upgrade. By those metrics, this box is way overdue for a major upgrade. Fortunately for my wallet, those metrics don’t really apply to this box–it’s spending a lot of its CPU time on tasks that aren’t particularly critical. Also, Linux 2.6 made some changes to /proc/stat that procinfo doesn’t seem to have picked up on; once you factor those into the equation, the box is really closer to 75% idle. Subtract off the non-critical usage, and the system is probably only 10% busy. I’ll probably upgrade it later this year if my virtual-server project works out, but that’s more for security and reliability then pure performance.
Posted in Computer System Administration | Tags dsl, home, linux, networking, router | no comments
Posted by Scott Laird
Tue, 08 Mar 2005 00:36:14 GMT
I’ve been watching Xen for a while now, and I’m nearly ready to take the jump and do some testing with it. I’m thinking about ordering a cheap Athlon 64 box for home to use as a testbed for the lightweight server concept that I’ve been kicking around for years. In the 18 months that have passed since I last talked about it, virtualization on the PC has advanced by leaps and bounds; at the time, I was looking at UML, which wasn’t really fast or stable enough. Xen looks to be both fast and stable, and it has a clear migration path onto the virtualization hardware offered by the next generation of PC hardware. That makes it nearly ideal for my purposes.
Posted in Linux, Xen, Computer System Administration, LWVS | Tags sysadmin, xen | no comments
Posted by Scott Laird
Mon, 10 Jan 2005 23:00:49 GMT
Okay, so my RAID array died because I wasn’t paying enough attention and my 3ware card had already kicked out one perfectly good drive for no obvious reason. No sweat, I can handle that. I as I mentioned before, I took me most of a day, but I recovered almost all of the data off of the failed 4-drive array onto a new 2-drive RAID-0 array. Once the copy was complete, the goal was to destroy the old, broken RAID-5 array, create a new, working RAID-5 array, and then copy all of the data off of the RAID-0 array onto the new RAID-5 array. Then, when everything was complete, I was planning on using the RAID-0 disks as parity and spare drives for the RAID-5 set. Nice and simple, right?
So, by Friday night, I had 6 drives in front of me. One was bad, three were good, but part of the broken RAID array, and two held the data that had been on the RAID array. My goal was to take the 3 good drives and use them to build a new 4-drive RAID-5 array, so I built a software RAID-5 array in degraded mode–that way, I could get away with leaving out the 4th drive at the beginning. Once I copied the data off of the 5th and 6th drives, I was planning on adding them to the RAID-5 array so I’d have a 4th disk plus a spare.
I was very careful not to re-use the broken drive–it was on 3ware channel #2, so I cleverly built my new array using Linux’s sda, sdc, and sdd devices, skipping sdb. Once RAID-5 was running, I formatted the new array, copied everything from the RAID-0 set, broke down the RAID-0 set, and added the drives to the RAID-5 array. And promptly watched everything crumble to dust. My RAID-5 array started out in degraded mode, with 3 of 4 drives active. I then added 2 additional drives, and instead of watching it rebuild to 4 of 4 plus 1 spare, it went to 2 of 4 active. It even sent me this helpful email:
From: scott@mail.sigkill.org
Subject: Fail event on /dev/md1:nfs
Date: January 8, 2005 8:16:43 AM PST
To: scott@sigkill.org
This is an automatically generated mail message from mdadm
running on nfs
A Fail event had been detected on md device /dev/md1.
Faithfully yours, etc.
Although the array was still mounted, any attempt to access it generated a steady stream of I/O errors. What happened, you ask?
Basically, I was an idiot. Like I said, the drive on 3ware channel #2 failed, so I didn’t use drive sdb. Except that 3ware numbers their channels starting with 0. So channel #2 was drive number 3—sdc, not sdb. So I’d rebuilt by array using the bad drive, then copied my data onto the broken disk, and destroyed all of my good copies. I spent all morning Saturday trying to fix things, but I couldn’t even get the kernel to acknowledge that the RAID array existed. I finally gave up and tried cloning sdb onto sdc, to see if that’d work, but it didn’t make a bit of difference–I could at least get mdadm to tell me that sdb had once been a part of a RAID array, but it didn’t recognize any of the data on sdc as any part of anything.
In desperation, I tried re-creating the RAID array exactly as I’d first built it, using sda, sdc, and sdd. Amazingly enough, that worked, and I was able to mount the drive. I then carefully added sdc into the array, watched it rebuild the first 20% of the array, and then fail sdc back out of the array, leaving me back where I started. I finally turned off the computer in disgust and went and played with what was left of our snow.
Sunday was more snow, so I played with the kids, and then finally took one last swing at the computer. I re-built the RAID array again, and then built a RAID-0 array from sde and sdf. I then tried to copy anything that was salvageable off of the broken RAID-5 array. I figured that I’d be able to copy something before it croaked again. I checked back a couple hours later to discover that it’d copied all 216 GB without error. I was stunned–apparently the drive’s problem was really just corruption of a few sectors–writing new data back onto the drive overwrote the weak parts with a new, strong signal, and it was able to read them back safely. Ugh. It wouldn’t resync right because there were still a number of old sectors with old data on them–if I’d zeroed out the whole drive, it’d probably have worked right from the start, for at least a couple months, until it failed again.
So, I went back through the process again, destroying the array built from sda, sdc, and sdd, and then building a new one with sdb this time. There’s no way I’m going to trust the failing drive, even if it did work this time. I copied everything off of the little RAID-0 array, then carefully tore it apart and used its drives to rebuild the big array into its full RAID-5 glory. And it actually worked this time, without errors. Everything was finally finished around midnight last night, and I was able to reboot without problems.
All done, right?
Ha.
This morning I got up to find the screen full of syslogged Ethernet problems–apparently the network card had locked up. I could log in on the console, but I couldn’t ping anything. I rebooted, everything came up okay, and I tried copying a bunch of stuff onto the new RAID array. It copied just fine for about 5 minutes, and then the box locked up hard. No kernel panic or anything, just a dead box. The reset button didn’t help, and it ignored the soft power button, so I had to do the hold-the-power-button-for-5-seconds trick. After that, it didn’t boot right–there were 3ware card errors everywhere–timeouts, not drive problems. It locked up again halfway through booting.
So, practically speaking, I’m right back where I started on Friday morning–my box is dead, but the data is probably fine. I’m going to pop the box open and wiggle some cables, but I probably have bad hardware somewhere in the box–motherboard, 3ware card, or power supply. If this had happened at work, I’d just RMA the whole mess and let the vendor sort it out, but that’s not very useful at home, especially when dealing with a 4-year-old system with a second-hand RAID card. Ugh.
Update: I powered it off for a while, wiggled cables, removed spare hardware, rebooted, and found a nice kernel bug. If you have a RAID array with 4 drives plus a spare, and for some reason the spare’s RAID superblock has a higher timestamp then the 4 data drives, then the kernel’s RAID code will gladly kick the 4 good drives out of the array and keep just the spare. I sense a bug report in my near future.
Posted in Computer System Administration | Tags broken, ide, linux, raid | no comments
Posted by Scott Laird
Sat, 08 Jan 2005 08:29:24 GMT
I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.
I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.
First, here are the basic specs on the system and drives involved:
- Athlon 700 (slot A)
- 384 MB RAM (PC133)
- Via KT133 chipset (Asus K7A MB, I think)
- 3ware 7500-8 8-channel IDE RAID controller
- 3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive
The drive that failed was a Maxtor, on channel #2. Here’s what smartmontools 5.30 has to say about the drive in its current condition:
Device Model: Maxtor 4A160J0
Serial Number:A608B7WE
Firmware Version: RAMB1TU0
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Fri Jan 7 11:47:02 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity was
completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 24) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 243) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 99) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 214 214 063Pre-fail Always - 11805
4 Start_Stop_Count 0x0032 253 253 000Old_age Always - 73
5 Reallocated_Sector_Ct 0x0033 249 249 063Pre-fail Always - 41
6 Read_Channel_Margin 0x0001 253 253 100Pre-fail Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000Old_age Always - 0
8 Seek_Time_Performance 0x0027 252 244 187Pre-fail Always - 34394
9 Power_On_Hours 0x0032 224 224 000Old_age Always - 24560
10 Spin_Retry_Count 0x002b 253 252 157Pre-fail Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 253 253 000Old_age Always - 76
192 Power-Off_Retract_Count 0x0032 253 253 000Old_age Always - 0
193 Load_Cycle_Count 0x0032 253 253 000Old_age Always - 0
194 Temperature_Celsius 0x0032 253 253 000Old_age Always - 38
195 Hardware_ECC_Recovered 0x000a 253 252 000Old_age Always - 43456
196 Reallocated_Event_Count 0x0008 251 251 000Old_age Offline - 2
197 Current_Pending_Sector 0x0008 249 249 000Old_age Offline - 41
198 Offline_Uncorrectable 0x0008 253 252 000Old_age Offline - 0
199 UDMA_CRC_Error_Count0x0008 199 199 000Old_age Offline - 0
200 Multi_Zone_Error_Rate 0x000a 253 252 000Old_age Always - 0
201 Soft_Read_Error_Rate0x000a 253 216 000Old_age Always - 37
202 TA_Increase_Count 0x000a 253 248 000Old_age Always - 0
203 Run_Out_Cancel 0x000b 253 245 180Pre-fail Always - 19
204 Shock_Count_Write_Opern 0x000a 253 252 000Old_age Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000Old_age Always - 0
207 Spin_High_Current 0x002a 253 252 000Old_age Always - 0
208 Spin_Buzz 0x002a 253 252 000Old_age Always - 0
209 Offline_Seek_Performnce 0x0024 154 148 000Old_age Offline - 0
99 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
100 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
101 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.
Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.
So, let’s move on to the syslog results. The smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:
Dec 5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
Dec 5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
This pattern continues on like this the whole time, with Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:
Dec 8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
This is the first indication of trouble. Notice that it’s not very threatening–Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:
Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
This is the first time that Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.
The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:
Jan 1 03:20:57 starting scheduled Long Self-Test.
Jan 1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251
Jan 1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Jan 1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
Jan 1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252
Jan 1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252
Jan 1 05:50:56 Self-Test Log error count increased from 0 to 1
Jan 1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Jan 1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:
Jan 4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251
Jan 4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253
Jan 4 07:35:40 ATA error count increased from 980 to 981
Jan 4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Jan 5 02:05:42 starting scheduled Short Self-Test.
Jan 5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252
Jan 5 02:35:40 Self-Test Log error count increased from 3 to 4
Jan 5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252
Jan 5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252
Jan 5 06:36:10 ATA error count increased from 981 to 1293
Jan 5 07:14:45 ATA error count increased from 1293 to 2377
By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.
So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.
More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.
Posted in Computer System Administration | Tags broken, drive, ide, linux, smart | 13 comments
Posted by Scott Laird
Fri, 07 Jan 2005 17:51:59 GMT
Well, this isn’t looking promising–my new drives arrived, but the big RAID array is throwing errors left and right. I’m not sure how much data I’ll be able to recover off the thing. Most of the data on the drive is reconstructible, but not everything. Most of this contents are old digital pictures, but I’ve tried to write them all to DVD before throwing them onto the RAID array. Odds are I missed some stuff, though.
Amazingly enough, the system logs seem to have survived unscathed, so I’ll write up a “anatomy of a drive failure” article later, showing what this looked like from a SMART perspective. Since I’ve never actually seen a SMART-monitored drive failure before, it should be somewhat educational.
Posted in Computer System Administration | Tags broken, drive, failure, linux, raid, smart | no comments
Posted by Scott Laird
Thu, 06 Jan 2005 19:49:46 GMT
As mentioned earlier, I spent part of the long weekend cleaning up home theater stuff. Part of this involved migrating files onto my home file server, which is an old Athlon 700 with an 8-channel 3ware RAID card and 4 160 GB drives in a 450 GB RAID 5 array.
So what happens as soon as I finish copying stuff onto the array? A drive starts failing on the RAID array, and I discover that it was already running in degraded mode. Now I’m in danger of losing all 200 GB on the array. Most likely, it won’t come to that, but it’s still fantastically irritating. Of the 4 160 GB drives that I bought last year, 2 of them have now failed.
To make sure that this doesn’t happen again, I just ordered 2 more 160 GB drives from NewEgg (only $76 each), along with a 3-in-2 style drive cooler. Assuming that it all arrives tomorrow, I should be able to rebuild the array, including a spare drive this time, and hopefully I won’t have to worry about it failing again.
Posted in Computer System Administration, Toys, Personal | Tags 3ware, broken, drive, failure, ide, raid | no comments
Posted by Scott Laird
Sat, 07 Aug 2004 01:38:05 GMT
I’m planning on doing more research on this in a while, but I should mention it now: the latest release of Stalker Software’s CommuniGate Pro email software includes basic VoIP support.
CommuniGate Pro (CGP) is kind of fascinating to me. At it’s heart, it’s just commercial email software. It does SMTP, POP, IMAP, LDAP, and web mail, all of which you can do for free with open-source software. However, if you’re a small business or ISP, and email means anything to you, and you aren’t tied to Exchange, you owe it to yourself to take a serious look at CGP. It’s fast, it’s reliable, it’s completely standards-based, it’s trivial to configure, and it’s cheap. It starts at $500 for 50 users and drops off quickly. For $2,000, you can get a 1,000 user license. Now ask yourself, how long would it take to set up a 1,000-user POP/Webmail/SMTP mail server? How much support time will it take?
I’m starting to sound like an ad. I’ll try to stop.
They also do clustered mail servers, but their previously-reasonable prices suddenly jump well into the 6-figure price range. This isn’t the way to go if you’re looking for SPOF-free corporate email for cheap.
Their more recent releases have added some Exchange-like functionality–they support MAPI- and web-based calendaring with an Outlook plugin (for an additional cost), and they’ll provide spam and virus filtering for a price.
The thing that’s always fascinated me about these guys is that they seem to be a dinky, 5-10 person outfit, but they’re able to keep adding features faster then anyone else on the market, and do it without turning their software into a complete pig. At Internap, we were amazed to discover that their basic server with SMTP, HTTP, POP, IMAP, LDAP, SSL for everything, decent logging, a web UI for configuration and for email, and a mailing list manager all fit into under 2 MB of RAM. Once it got running, with hundreds of busy users, it grew to need 15 MB or so, but that was about it. I think we only managed to crash it once or twice in two years, and that’s under a murderous load–I think I was averaging over 2,000 email messages/day for part of that, and I was rarely the busiest user. We had way more problems getting Linux to keep up with the server’s I/O load, but that’s a whole different issue–we were saturating 2 external RAID arrays for almost the entire day every day.
Anyway, the latest release (4.2) adds SIP and RADIUS to their list of supported protocols. It isn’t really intended for serious PBX-replacing VoIP, but rather for IM and voice messaging. Since Windows XP includes SIP IM software, this seems like a useful addition to CGP. It’ll do VoIP as well, but it’s based on email addresses, not phone numbers, so it’ll be hard to get SIP phones to interoperate with it (although not impossible–most of them will let you dial names, but it’s hard to enter them with a phone keypad).
Personally, I’m going to keep my eye on them over the next year or two–they aren’t very far from turning CGP into a cheap all-in-one solution for small-business communications. All they need is a dialing plan, voicemail, support for external SIP-to-PSTN devices, and maybe faxing.
One quick disclaimer–it’s been a couple years since I last used their software. I’m not a sysadmin at my present job, and I have nothing to do with out email environment. And, I’m not willing to pay $500 for my home email server, although I was tempted back in the .com days.
Posted in Computer System Administration, Asterisk | Tags communigatepro, email, imap, ldap, smtp, voip | 2 comments
Posted by Scott Laird
Fri, 06 Aug 2004 03:38:22 GMT
Backups suck. They always have, they always will, it’s just the nature of the beast. The big problem is that open-source backup solutions tend to suck really bad, and decent commercial solutions obey Joel’s pricing law, where no enterprise software costs between $3,000 and $100,000. At Internap, we paid well over $100k to Legato for their buggy, inflexible backup software, and we paid a similar amount of money for backup hardware. That’s what it took to do good backups of several hundred machines per night. On a smaller scale, people seem to have decent luck with Arkeia, but I’m not all that convinced that it’s much cheaper then Legato or Veritas once you pile on dozens of clients.
I’ve wanted a decent open-source solution for years. In ‘97 or so, I used Amanda for a while, but it had a kind of bizarre approach to tapes, weird security needs, and it didn’t scale very well. It still seems to be under some sort of active development, but I was using version 2.4.0 in the late ’90s, and the current release is 2.4.4p3; draw your own conclusions. I’ve been spending around one afternoon per year since then looking for something free that can handle a couple dozen mixed Linux and Windows systems, and can back up to a mix of tape and disk, and I’ve come away empty-handed every time.
Until today. Somehow, Bacula slipped past me without my noticing it. Besides the catchy name and motto (“It comes by night and sucks the vital essence from your computers”), it seems to have the features that I’m looking for. It understands tape changers as well as filesystem backups. It’s network based with clients for Windows and Unix (and sort of OS X). It looks like it does an okay job at backup parallelism. It doesn’t have plugins for hot backups of popular databases, but most of the time, it’s easier to dump the database to a file and then back up the file, at least when you’re dealing with smallish databases.
It’s currently at version 1.34, although Debian seems stuck at 1.32f4 for some reason. I’ll be doing some testing tonight at home, since I’ve been without a comprehensive backup solution for years, but I’m feeling pretty good about this.
Having said that, it still lacks a few things that I’d consider rather useful:
- It can’t migrate backups between storage devices. No staging from disk to tape, for instance.
- It doesn’t do anything with optical disks. I’ve had good luck with doing full backups of important data to CDs or DVDs; it’d be nice to integrate that into one central backup system. This is better for smaller-scale setups, though.
- It can’t stripe a single backup across multiple drives.
- It’s not clear if it can interleave backups onto a single tape. When you’re backing up slow hosts over a big network, this can be critical. For LAN usage with cheap tape drives, it’s less critical.
They maintain a to-do list online. I haven’t really looked into their security history; there are a couple worrisome comments about sscanf in their to-do list, but it can’t be any worse then Legato Networker–it had holes that you could drive a bus through along with a few really fun failure modes. We triggered a great DDoS on ourselves one night in 2000 with every single backup client in the company sending 64-byte UDP packets to our backup server as fast as they could transmit them. We really loved them for that.
Update (8/6/2004): I let it rip last night. First, its UI needs work. Most configuration is done via text files, which is okay, but monitoring and management can be done via either a bad CLI without command-completion, or via a bad GUI which is basically just the CLI in a window with a scrollbar. Windows backups don’t include open files *or* the registry. Mac backups don’t include resource forks, but I knew that going into it. Finally, I’m not getting any email notification of success or failure, but that might be a mail problem on my backup server. All in all, it’s still better then most of the free solutions that I’ve looked at, but still not 100% there. It’s usable if you only care about Unix systems and you’re willing to spend a bit of time learning and scripting. It’s not usable if you’re looking for an out-of-the-box system that’ll work with Windows.
A friend suggested that I look at Box Backup. It doesn’t do tapes at all, which is a bit of a shortcoming in my mind, but it looks like it might be better for pure-Unix backups. On the other hand, it’s designed to do continuous backups, not nightly snapshots, which is probably a better design, if you can handle the load.
I should really write the backup strategy guide that I keep meaning to write. Leave a comment if you’re interested.
Posted in Computer System Administration | Tags backups, bacula, linux | 7 comments
Posted by Scott Laird
Thu, 06 May 2004 17:59:00 GMT
I’ve looked at Zoe once or twice in the past, but it never quite grabbed enough of my interest for me to bother installing it. If you aren’t familiar with Zoe, it’s a Java-based email search proxy thing that they’ve never really been able to explain on their website. Yesterday I was searching for more information on Near-Time Flow, and came across a blog entry by Tom Malaher titled “Google your Email”:
Who needs GMail? You’ve got your own CPU and Disk space, use it.
ZOE lets you read and search your email (with Lucene), without supplying helpful related advertising. Not to mention that it also has a very cool non-linear email access metaphor. Forget Inbox/Sent Mail/…customFolders.. you just browse.
Ah, finally–someone explains the point of Zoe. It’s basically a personal email search engine. Once I got that, I grabbed a copy and tried it out. It’s trivial to install–just extract the files from the archive and double-click on Zoe.jar. Zoe runs its own web server on port 10080, and automatically fires up your favorite browser when it starts. The web interface is intuitive and reasonable attractive, and it’s easy to add new POP or IMAP accounts and have Zoe import mail from them. While it’s possible to use Zoe as a web-based mail reader, it’s not really very good at that–it doesn’t do folders at all, and I can’t figure out how to get it to do threads, but that’s not really a problem, because it’s not supposed to be used for normal mail reading: it’s a search engine, not a mail reader.
I probably have around 100,000 messages sitting in assorted IMAP mail boxes in various places, and Zoe is the first program that I’ve found that is actually usable for searching them. OS X’s Mail program isn’t very good at searching huge volumes of mail, particularly when most of it lives on IMAP servers.
The big problem with Zoe is its resource needs–it’s written in Java, and wants at least 70 MB of RAM when it’s running on my laptop, plus a few hundred MB of disk space. I just don’t have enough free RAM on my laptop to add another 70+ MB program, so I’m going to try running it on one of my Linux servers at home and see how that goes.
A couple points about Zoe: while its UI is predictable and easy to use, its documentation is nearly non-existant. Like Asterisk, you’re stuck using Google to search mailing lists and third-party wikis to find details. Zoe really needs a more detailed configuration interface. As it is, a lot of less-common features need to be controlled by editing Java property files.
It’s an easy install, though, and it’s very usable right out of the box, so I’d recommend installing it and checking it out.
Posted in Computer System Administration | Tags email, java, search, zoe
Posted by Scott Laird
Wed, 05 May 2004 01:32:21 GMT
I co-worker just pointed this out. On my Powerbook (a G4/550, 768 MB, 40 GB 4200 RPM drive), running ‘find /usr’ takes around 30 seconds every time I run it:
tibook$ time find /usr | wc
48402 48402 2390367
real0m30.041s
user0m0.410s
sys 0m2.620s
tibook$ time find /usr | wc
48402 48402 2390367
real0m32.034s
user0m0.450s
sys 0m2.710s
On the other hand, one of my Linux boxes at home isn’t that much faster (Athlon 700, 384 MB, old Maxtor 5 GB drive), but it’s able to do repeated finds much quicker:
debian# time find /usr | wc
124088 124108 5869110
real1m43.631s
user0m0.680s
sys 0m1.170s
debian# time find /usr | wc
124088 124108 5869110
real0m2.090s
user0m0.530s
sys 0m0.700s
Notice that repeated finds drop from 103 seconds to 2 seconds on the Linux box, while they stay around 30 seconds on the Mac, even though the Mac has twice the RAM of the PC.
I’m assuming that OS X is restricting the amount of RAM used for disk caching, but it’s really painful in this case.
Posted in Mac stuff, Computer System Administration | Tags linux, macosx, powerbook, slow | no comments
Posted by Scott Laird
Tue, 04 May 2004 21:54:40 GMT
You know, this error message might just make it worth it to go through the incredible hassle required to install Kerberos.
Posted in Computer System Administration | Tags funny, kerberos, windows | no comments
Posted by Scott Laird
Tue, 06 Apr 2004 17:40:15 GMT
databasejournal.com has a nice article on PostgreSQL query analysis from one of the guys behind RubyForge. It’s not rocket science, but it demonstrates that it’s really easy to do statistical query analysis with Postgres and a bit of Ruby code. The article concentrates on the statistical side of things (“which queries are we running most often”) rather then the query analysis side (*why is this query so slow?”). The implication is that the biggest performance wins are to be found by removing unneeded and excessive queries, rather then speeding up the ones that you’re already making. I’m not sure that I completely agree with that, but most of the database tuning articles that I’ve seen concentrate on the other side of things, so it’s nice to see some balance.
Posted in Computer System Administration, Ruby | Tags database, postgres, ruby | no comments