Time to retrain SpamProbe

I’ve been using SpamProbe for almost two years, and it’s done a great job of filtering my spam. Unfortunately, it’s become a resource pig in the process. My spam database has grown to over 500 MB, and iostat -x suggests that SpamProbe was keeping my disk busy almost 80% of the time for minutes at a stretch. It wasn’t uncommon for messages to sit in the queue for up to 10 minutes, delayed by spam checking.

I finally decided that this is too much, so I’m re-training SpamProbe using its new hash database format. Instead of saving the text from each Bayes entry, it simply saves a 32-bit hash of the spam text. It costs a little bit of accuracy, but it’s supposed to be a huge speed win. Unfortunately this will require over an hour of CPU and disk time to reprocess thousands of messages, but it should be worth it.

Posted by Scott Laird Wed, 20 Jul 2005 01:36:07 GMT