I’ve been using SpamProbe for almost two years, and it’s done a great job of filtering my spam. Unfortunately, it’s become a resource pig in the process. My spam database has grown to over 500 MB, and iostat -x suggests that SpamProbe was keeping my disk busy almost 80% of the time for minutes at a stretch. It wasn’t uncommon for messages to sit in the queue for up to 10 minutes, delayed by spam checking.

I finally decided that this is too much, so I’m re-training SpamProbe using its new hash database format. Instead of saving the text from each Bayes entry, it simply saves a 32-bit hash of the spam text. It costs a little bit of accuracy, but it’s supposed to be a huge speed win. Unfortunately this will require over an hour of CPU and disk time to reprocess thousands of messages, but it should be worth it.