Blog Spam

This seems to be the day for people griping about spam in blog comments. I haven't been hit yet, but I've seen a lot of attempts to access cgi-bin/mt/mt-comments.cgi. I guess I'm glad that I didn't install MT in cgi-bin. In general, most of these attempts aren't really spam in the traditional sense; they're really an attempt to influence Google's ranking of the website that the spam links to. This has a few implications. First, things like Bayesian spam filtering won't help much; the comment spam is supposed to look just like a regular post. Unlike email spam, blog spammers don't really even care if people read their comments, so they don't have to make them stand out from the crowd. As long as Google sees the comment, they're happy.

I've seen a lot of people claim that blog spam will be easy to kill. In their favor, they tend to link to a small-ish number of web sites, and post from a small number of hosts. Open http proxies should be less common then open mail servers (although I've seen quite a few people fishing for open proxies recently), and this means that it's harder to get someone else to relay your spam for you. On the other hand, the same pool of compromised hosts that currently send DDoS attacks and email spam can be repurposed for blog spam without problems. Adding authentication, email call-backs, or CAPTCHA would add inconvenience for human users, but would come close to completely blocking automated blog comment spam.

Which brings up the other, harder problem: blog trackback spam. I haven't seen any trackback spam yet, but it'll show up eventually, and it's going to be a bitch to stop. Trackback is the mechanism used by blog software for automatic notification of links between sites. This is part of the real power of blogs, and a lot of interesting things will come of it in the future. Unfortunately, since it's an automated mechanism, it's not really amenable to CAPTCHA tests. There are a few suggestions that may help, but most of these rely on external services like technorati.

So, in short, it seems like we can either let the spamming scum of the earth run over us and screw up yet another useful, innovative communication system, we can hand control of part of our infrastructure into the hands of (probably friendly) companies and have them manage things, or we can build walls around what we have and destroy most of its utility. I just love spammers.

Update: Hmm. Does Google do full HTML parsing, or just extract URLs from documents? Assuming that PageRank only cares about actual <a href="..."> links, then a combined approach might work; only provide HTML links to sites that pass some sort of verification, either manual, via technorati, or via some sort of heuristic. That'll stop PageRank-driven spam. Which just means that we'll get to see what spammers come up with next. Sigh.