Posted by Scott Laird
Wed, 20 Jul 2005 01:36:07 GMT
I’ve been using SpamProbe for almost two years, and it’s done a great job of filtering my spam. Unfortunately, it’s become a resource pig in the process. My spam database has grown to over 500 MB, and iostat -x suggests that SpamProbe was keeping my disk busy almost 80% of the time for minutes at a stretch. It wasn’t uncommon for messages to sit in the queue for up to 10 minutes, delayed by spam checking.
I finally decided that this is too much, so I’m re-training SpamProbe using its new hash database format. Instead of saving the text from each Bayes entry, it simply saves a 32-bit hash of the spam text. It costs a little bit of accuracy, but it’s supposed to be a huge speed win. Unfortunately this will require over an hour of CPU and disk time to reprocess thousands of messages, but it should be worth it.
Posted in Spam | Tags email, spam, spamprobe | no comments
Posted by Scott Laird
Thu, 24 Jun 2004 19:34:01 GMT
I’ve received a handful of email messages recently that aren’t exactly normal spam. This includes messages with no body and very few headers, and messages composed of slightly random text with no real attempt to sell anything. Here’s an example:
Hello, handsome!
No one ever lost his honor, except he who had it not.
Difficult times always create opportunities for you to experience more love in your life.A man who exposes himself when he is intoxicated, has not the art of getting drunk.
necessarian unseldom pronationalistic malthas scarfing
I tend to play mostly villains and twisted people. Unsavory guys. I think it’s my face, the way I look.
Ain’t no man can avoid being born average, but there ain’t no man got to be common.
I can only assume that both of these types of messages are an attempt to screw up Bayesian filtering tables by sneaking borderline words into your pool of non-spam (“ham”) email messages. The idea is that spam filters won’t find anything objectionable in the message, so it won’t mark it as spam, and users will just delete the message without using it to train their filters. I’m not convinced that it’ll work, but it’s a nice try.
Posted in Spam | Tags bayes, spam | no comments
Posted by Scott Laird
Tue, 09 Mar 2004 22:50:39 GMT
My spam is missing.
I used to receive over 100 spams per day, but that was mostly due to forwarding from a former employer. Once the forwarding stopped, I was still receiving around 20 spams per day. Recently, though, they’ve all but stopped. I only received 5 spam messages yesterday.
It’s not a filter issue–spam only rarely makes it through the filter gauntlet into my inbox. For some reason, spammers aren’t sending me as much spam as usual. My usual spam load isn’t very diverse, it’s possible that only one or two spammers make up the bulk of my spam. Maybe they’re on vacation. Maybe they removed me from their lists for some reason (ha, right). Maybe the CAN-SPAM law scared them straight (ha, right). Maybe whoever’s been paying them for spam stopped (ha).
Or maybe I’ll get 50 tomorrow, just to even out the average.
Posted in Spam | Tags spam | no comments
Posted by Scott Laird
Wed, 31 Dec 2003 02:38:46 GMT
Interesting: a proposal (from Microsoft) to implement a “sender pays” scheme for email by a cycle tax, instead of an unworkable micropayment scheme:
De-spamming with a cycle tax - A summary and an extension idea
[Via Ole Eichhorn, via Dave Winer] Microsoft is noising around an anti-spam technique that would essentially create a cycle tax on each piece of e-mail. This is done by forcing a client computer that wants to submit e-mail to a server to solve a cryptographic problem of known difficulty set by the server, presumably by adding a challenge/response step in the mail protocol. To the normal sender of mail, the few second delay is no problem. For a spammer, it bogs down even networks of hijacked machines and reduces the flow of garbage into the network. There remains an interesting problem of the inter-server protocols, since replicating the same technique per message would become an egregious burden, but something must be done since hijacked relays are part of the problem. But there are a variety of options there: batching messages, trust networks among servers, throttled tiers of forwarding service based on the size of cycle tax provably paid by the originator.
This is one of the anti-spam options explored by a Microsoft Research project called Penny Black, named after the original postage stamp. It has the merit of creating a real cost, without requiring all the apparatus and problematic economics of a microtransaction infrastructure. Like Dave, I’ll wait and see if it’s a ploy by Microsoft to sink its proprietary hooks into the mail networks before I cheer too loud, but this does have potential.
[Due Diligence]
I haven’t read all of the original article, but the basic scheme is probably workable. You just add an extra step in SMTP, just like authentication is handled today. Just like SMTP authentication is usually required before a server will relay messages for non-local IP addresses, you could require this before servers will accept messages from unknown servers. The mail server could base the complexity of the problem on a pile of things, like spam IP blacklists, previous traffic patterns, and so forth. So, if the server has some reason to believe that it’s going to receive good mail from the peer, then it doesn’t need to request authentication at all, while when dealing with random dialup IPs, it can request a fairly complex problem.
Now, this suffers from the usual implementation problems that all SMTP replacements (or mandatory enhancements, they’re basically the same thing) suffer from–basically, it’s worthless until most of the market is already using it, and that means that no one will probably ever deploy it. The basic idea seems sound, though. It’d probably make an interesting addition to new protocols, even if it’s logistically hard to add to existing protocols.
My personal experience shows that Bayesian filters work really well for personal email. Paul Graham suggests that the recent anti-spam legislation plays right into our hands; while it doesn’t do much to stop hard-core fraudulent spammers, it’s going to destroy middle-of-the-road spammers, because decent filters will be 100% effective stopping their spam. Which means that we’re winning, really. Some of his other ideas (”Filters that Fight Back,” particularly) sound like they should be effective at raising the effective cost of spamming, and that’s really everyone’s goal: make spam uneconomical without destroying “regular” email. So, I’m not really sure that this cycle-tax plan really matters in the long run. But it’s an interesting idea.
Posted in Spam | 2 comments
Posted by Scott Laird
Tue, 02 Dec 2003 20:46:54 GMT
I’ve been reading a few things that suggest that the PSTN (Public Switched Telephone Network–the traditional phone system) is dying, soon to be devastated by VOIP. Since companies like Vonage are starting to switch consumers and small businesses, and larger companies have been moving internal phone service to VOIP for a while, within a few years most profitable customers will have have left traditional telcos for nice, cheap facility-less VOIP. That’ll leave expensive customers (rural consumers, for example) as the primary users of the old phone system, and that’ll destroy the business model of all of the telcos.
We’ll see. Maybe it’ll go that way, maybe it won’t, but I wouldn’t buy telco stock right now :-).
The story goes like this: once we have a good, semi-open way to map traditional phone numbers onto VoIP providers, we’ll start seeing pure-VoIP calls between (say) Vonage and 8x8 customers. Companies can jump in and do direct VoIP calls to other companies and consumers using the same database, either the ENUM thing that never seems to go anywhere, something DNS driven, or something new. Doesn’t really matter which way it happens, because one of them is going to happen very soon, probably within 6 months, and it’s probably going to be at least partially driven by the Vonages of the world in an attempt to cut their costs when talking to customers of the other VoIP providers.
Once we have open-ish IP telephony, unless regulation rears its ugly head, phone service will end up looking a lot like email. Consumers and very small businesses will pay a provider, and larger companies (and geeks) will handle it themselves, directly. In either case, you’ll end up paying someone to connect your calls to (and from) the legacy PSTN, so Aunt Mildred in Kansas can call you, but 95% of the traffic will be SIP end-to-end. Once this is in place, we can start exploring what you can really do with SIP above and beyond traditional telephone service (your phone number follows your laptop on vacation, for instance).
But, here’s the problem: SIP spam. What’s going to keep the spamming scum of the earth (and I’m being charitable here) from blasting your phone with automated crap 24x7? Regulation probably won’t do it–they’ll just connect directly (via IP) from outside of the US, just like some SMTP spammers do today. The economics are basically the same as they are for SMTP spam, it’s astoundingly cheap to send, and you only need a few returns per million to break even. It’s not completely clear that today’s spam filtering techniques really apply to SIP spam, besides blacklisting and whitelisting.
I’ve been enjoying the near-complete lack of phone spam at home since the FTC do-not-call list took effect. I wonder how long the quiet is going to last, though.
Posted in Business, Spam | no comments
Posted by Scott Laird
Tue, 02 Dec 2003 01:36:39 GMT
I’ve now been using SpamProbe for over a month to filter my spam at home, and it’s been working perfectly. It’s blocking an average of 210 messages per day, although it’s climbing; November 18-28 only had 2 days below 210, and averaged closer to 250. The last weekend was quite a bit lower, mostly due to server problems, but I’m currently at 433 spam messages today and counting.
Of those 210 messages each day, I’m probably seeing 1-2 false negatives. I haven’t seen any real false positives, although I’ve pulled a couple bounce messages out of my spam folder, just to keep from poisoning the spam database.
Interestingly enough, spamprobe is also eating all of the current Windows email virus messages. So, not only is it a spam filter, it does viruses too :-).
All in all, I’m a very happy camper. My personal spam count has been rolled back to 1998 levels. In essence, spam is now a complete non-issue for me.
Posted in Spam | no comments
Posted by Scott Laird
Mon, 03 Nov 2003 23:29:09 GMT
VentureBlog has an interesting bit on spam, claiming that spam is going to give Microsoft control over the entire email server market. The logic is kind of interesting; basically it boils down to using your Exchange server license as a bond against sending spam. If you spam, they yank your license, so owning a valid Exchange server license is an automatic key to spam whitelisting:
However, corporations are already shelling out big bucks for email - specifically for Microsoft Exchange or IBM/Lotus which between them have 75% of the corporate market.
Microsoft could just provide a stamp on each outgoing message (think public key cryptography) identifying that it came from a specific exchange server. This would be verified with Microsoft, which would provide a whitelist of valid exchange servers to every anti-spam company.
[VentureBlog]
Three problems with this:
- Bayesian filtering seems to work really well. My home email filter is over 99% effective right now, blocking roughly 200 messages per day with no false positives.
- Spammers are already using viruses to generate open relays. How long will it take before office computers are attacked deliberately to use their whitelisted Exchange server for spamming?
- The liability issues of point 2 will effectively keep Microsoft from blacklisting large customers, even when bushels of spam are pouring out of their servers.
So, in short, I think it’s a neat idea, and I wouldn’t be surprised if Microsoft tries it, but it isn’t going to help. In fact, it’ll probably just make corporate PCs even more attractive to spammers.
Posted in Spam | 1 comment
Posted by Scott Laird
Sat, 01 Nov 2003 21:00:54 GMT
My spam blocker is working better then expected; over the past two days, it’s been over 99% accurate, with 1 or 2 false negatives and no false positives. I’ve been receiving around 200 spams/day, and Apple’s Mail.app was only catching 80% of them, with a handful of false positives. I’m pretty happy with the new system.
Posted in Spam | no comments
Posted by Scott Laird
Thu, 30 Oct 2003 03:15:15 GMT
I just added a mailto: link to each post on scottstuff.net. Since I’ve added a new spam filter, I’m not quite as worried about putting my email address up on a website. But, in a fit of spam prevention, I’m using a JavaScript mailto-rewriter. If your browser doesn’t support JavaScript, then you’ll get a nasty-but-still-understandable mailto: URL. If JavaScript works, then you should get a perfectly decent working URL without even knowing it.
Posted in Spam, Blog stuff | no comments
Posted by Scott Laird
Wed, 29 Oct 2003 22:32:26 GMT
I finally turned on server-side spam filtering at home this weekend. I’ve fought doing it for a couple years; first due to lack of decent spam filtering software, and next because Jaguar’s Mail.app did decent spam filtering on its own. Why bother with server-side filtering when client-side filtering works well and is easier to train?
That argument held up for a while, but the latest rounds of pharmaceutical spam have been pretty good at getting past Mail.app’s filters, and moving to Panther hasn’t done anything to help. So, over the weekend, I added a Bayesian filter (Spamprobe) into my Courier-based mail server. As usual, it was more difficult then I had hoped it would be, but not as bad as I’d feared. The first step was turning on Courier’s maildrop filter. This was easy, just a simple edit of /etc/courier/courierd to add
DEFAULTDELIVERY="| /usr/bin/maildrop"
I could have used procmail, but I’ve grown increasingly irritated by its obtuse syntax over the years. I refuse to use languages that consist mostly of punctuation. Once maildrop was running, I added this to my .mailfilter file:
# save mail to the "saved" mbox, better safe than sorry
cc "$HOME/Maildir/.spam.saved"
# score the mail and tag it
SCORE=`spamprobe -8 receive`
xfilter "reformail -I \"X-SpamProbe: $SCORE\""
echo "Score: $SCORE"
# if it's spam, reroute it to the spamprobe mbox
if (/^X-SpamProbe: SPAM/)
to "$HOME/Maildir/.spam.spam"
This is mostly copied from the README.maildrop that came with the Debian version of spamprobe, but I had to tweak it a bit before it’d drop mail into the right maildir. I then had to create $HOME/.spamprobe and spam/saved, spam/spam, and spam/ham mail folders.
Once this is complete, all mail that comes through my system will be copied to spam/saved, and then scored as spam. Spam will be copied into spam/spam, while non-spam mail is delivered normally.
The next step is to fill spam/spam and spam/ham (“ham is not spam”) with a bunch of samples of spam and non-spam mail. Fortunately, I had 500 or so of each just sitting around. I copied them into place, and then ran a script like this:
IMAPDIR=$HOME/Maildir
spamprobe good $IMAPDIR/.spam.ham/*/*
spamprobe spam $IMAPDIR/.spam.spam/*/*
This tells spamprobe to analyze the contents of my spam/spam and spam/ham folders to discover which keywords signify spam and which signify ham. I then added a cron job to re-run this script hourly.
To train the spam filter, all I need to do is drag messages around in Mail.app. If a spam message appears in my inbox, then I drag it to the spam/spam folder. From time to time, I check the spam/spam folder to look for false positives, and then drag them to spam/ham. The next time the cron job runs, my filter will adjust itself and do a better job categorizing spam.
So far, it’s working well. Most of the spam that I receive is addressed to one specific account that is forwarded from a previous employer; until last night, I was just dumping all of the mail from this account into a folder automatically, and then checking it a couple times per week to remove the ~150 spams/day that it receives. Last night, I stopped filtering it into its own box, and let spamprobe handle it. And, so far, it’s doing a good job. I’ve only seen 3 or 4 false negatives, and those were from early in the training process. Annoyingly, I’ve had 6 false positives that I had to pluck out of the spam folder; one was a MAILTO web form that went to my old college user group mailing list; it was categorized as spam, and that primed the pump so that several followups to the same list also went into the spam box. Once I moved them to the ham folder, mail for that list started making its way into my inbox correctly. Spamprobe also ate an opt-in ad from REI and a notice from a vendor that I wanted to see, but moving both of these to spam/ham seems to have fixed the problem.
According to a grep of my spam/spam folder, I received 217 spam messages yesterday and 101 so far today. Good riddance.
Posted in Spam | 1 comment