I finally turned on server-side spam filtering at home this weekend. I've fought doing it for a couple years; first due to lack of decent spam filtering software, and next because Jaguar's Mail.app did decent spam filtering on its own. Why bother with server-side filtering when client-side filtering works well and is easier to train?

That argument held up for a while, but the latest rounds of pharmaceutical spam have been pretty good at getting past Mail.app's filters, and moving to Panther hasn't done anything to help. So, over the weekend, I added a Bayesian filter (Spamprobe) into my Courier-based mail server. As usual, it was more difficult then I had hoped it would be, but not as bad as I'd feared. The first step was turning on Courier's maildrop filter. This was easy, just a simple edit of /etc/courier/courierd to add

DEFAULTDELIVERY="| /usr/bin/maildrop"

I could have used procmail, but I've grown increasingly irritated by its obtuse syntax over the years. I refuse to use languages that consist mostly of punctuation. Once maildrop was running, I added this to my .mailfilter file:

# save mail to the "saved" mbox, better safe than sorry
cc "$HOME/Maildir/.spam.saved"

# score the mail and tag it
SCORE=`spamprobe -8 receive`
xfilter "reformail -I \"X-SpamProbe: $SCORE\""

echo "Score: $SCORE"

# if it's spam, reroute it to the spamprobe mbox
if (/^X-SpamProbe: SPAM/)
  to "$HOME/Maildir/.spam.spam"

This is mostly copied from the README.maildrop that came with the Debian version of spamprobe, but I had to tweak it a bit before it'd drop mail into the right maildir. I then had to create $HOME/.spamprobe and spam/saved, spam/spam, and spam/ham mail folders.

Once this is complete, all mail that comes through my system will be copied to spam/saved, and then scored as spam. Spam will be copied into spam/spam, while non-spam mail is delivered normally.

The next step is to fill spam/spam and spam/ham ("ham is not spam") with a bunch of samples of spam and non-spam mail. Fortunately, I had 500 or so of each just sitting around. I copied them into place, and then ran a script like this:

IMAPDIR=$HOME/Maildir
spamprobe good $IMAPDIR/.spam.ham/*/*
spamprobe spam $IMAPDIR/.spam.spam/*/*

This tells spamprobe to analyze the contents of my spam/spam and spam/ham folders to discover which keywords signify spam and which signify ham. I then added a cron job to re-run this script hourly.

To train the spam filter, all I need to do is drag messages around in Mail.app. If a spam message appears in my inbox, then I drag it to the spam/spam folder. From time to time, I check the spam/spam folder to look for false positives, and then drag them to spam/ham. The next time the cron job runs, my filter will adjust itself and do a better job categorizing spam.

So far, it's working well. Most of the spam that I receive is addressed to one specific account that is forwarded from a previous employer; until last night, I was just dumping all of the mail from this account into a folder automatically, and then checking it a couple times per week to remove the ~150 spams/day that it receives. Last night, I stopped filtering it into its own box, and let spamprobe handle it. And, so far, it's doing a good job. I've only seen 3 or 4 false negatives, and those were from early in the training process. Annoyingly, I've had 6 false positives that I had to pluck out of the spam folder; one was a MAILTO web form that went to my old college user group mailing list; it was categorized as spam, and that primed the pump so that several followups to the same list also went into the spam box. Once I moved them to the ham folder, mail for that list started making its way into my inbox correctly. Spamprobe also ate an opt-in ad from REI and a notice from a vendor that I wanted to see, but moving both of these to spam/ham seems to have fixed the problem.

According to a grep of my spam/spam folder, I received 217 spam messages yesterday and 101 so far today. Good riddance.