Spam filtering

Posted by Scott Laird Wed, 29 Oct 2003 22:32:26 GMT

I finally turned on server-side spam filtering at home this weekend. I’ve fought doing it for a couple years; first due to lack of decent spam filtering software, and next because Jaguar’s Mail.app did decent spam filtering on its own. Why bother with server-side filtering when client-side filtering works well and is easier to train?

That argument held up for a while, but the latest rounds of pharmaceutical spam have been pretty good at getting past Mail.app’s filters, and moving to Panther hasn’t done anything to help. So, over the weekend, I added a Bayesian filter (Spamprobe) into my Courier-based mail server. As usual, it was more difficult then I had hoped it would be, but not as bad as I’d feared. The first step was turning on Courier’s maildrop filter. This was easy, just a simple edit of /etc/courier/courierd to add

DEFAULTDELIVERY="| /usr/bin/maildrop"

I could have used procmail, but I’ve grown increasingly irritated by its obtuse syntax over the years. I refuse to use languages that consist mostly of punctuation. Once maildrop was running, I added this to my .mailfilter file:

# save mail to the "saved" mbox, better safe than sorry
cc "$HOME/Maildir/.spam.saved"

# score the mail and tag it
SCORE=`spamprobe -8 receive`
xfilter "reformail -I \"X-SpamProbe: $SCORE\""

echo "Score: $SCORE"

# if it's spam, reroute it to the spamprobe mbox
if (/^X-SpamProbe: SPAM/)
  to "$HOME/Maildir/.spam.spam"

This is mostly copied from the README.maildrop that came with the Debian version of spamprobe, but I had to tweak it a bit before it’d drop mail into the right maildir. I then had to create $HOME/.spamprobe and spam/saved, spam/spam, and spam/ham mail folders.

Once this is complete, all mail that comes through my system will be copied to spam/saved, and then scored as spam. Spam will be copied into spam/spam, while non-spam mail is delivered normally.

The next step is to fill spam/spam and spam/ham (“ham is not spam”) with a bunch of samples of spam and non-spam mail. Fortunately, I had 500 or so of each just sitting around. I copied them into place, and then ran a script like this:

IMAPDIR=$HOME/Maildir
spamprobe good $IMAPDIR/.spam.ham/*/*
spamprobe spam $IMAPDIR/.spam.spam/*/*

This tells spamprobe to analyze the contents of my spam/spam and spam/ham folders to discover which keywords signify spam and which signify ham. I then added a cron job to re-run this script hourly.

To train the spam filter, all I need to do is drag messages around in Mail.app. If a spam message appears in my inbox, then I drag it to the spam/spam folder. From time to time, I check the spam/spam folder to look for false positives, and then drag them to spam/ham. The next time the cron job runs, my filter will adjust itself and do a better job categorizing spam.

So far, it’s working well. Most of the spam that I receive is addressed to one specific account that is forwarded from a previous employer; until last night, I was just dumping all of the mail from this account into a folder automatically, and then checking it a couple times per week to remove the ~150 spams/day that it receives. Last night, I stopped filtering it into its own box, and let spamprobe handle it. And, so far, it’s doing a good job. I’ve only seen 3 or 4 false negatives, and those were from early in the training process. Annoyingly, I’ve had 6 false positives that I had to pluck out of the spam folder; one was a MAILTO web form that went to my old college user group mailing list; it was categorized as spam, and that primed the pump so that several followups to the same list also went into the spam box. Once I moved them to the ham folder, mail for that list started making its way into my inbox correctly. Spamprobe also ate an opt-in ad from REI and a notice from a vendor that I wanted to see, but moving both of these to spam/ham seems to have fixed the problem.

According to a grep of my spam/spam folder, I received 217 spam messages yesterday and 101 so far today. Good riddance.

Posted in  | 1 comment

Comments

  1. Scott Laird said 5 days later:

    I’ve used SpamAssassin before, at Internap, and it didn’t work very well for us. That is, it was okay, but tweaking heuristics was a pain, and a lot of small stuff slipped through. The filtering in Apple’s Mail.app worked better for me, but it was still a bit flaky. Pure Bayesian filtering is less complex but seems to work substantially better, at least for me. And, the only tuning required is moving spam into my ‘spam’ folder and moving false-positives into my ‘ham’ folder. No tweaking config files and trying to guess which scoring changes will do the right thing.

Comments are disabled