Over the past few months, I’ve dealt with something of a blog spam nightmare on Copyrighteous: my blog running the PyBlosxom weblog software.
Wiser and less stubborn individuals might given up on either PyBlosxom or the ability to receive public comments. However, I find PyBlosxom unique in its flexibility and great ReStructured Text support and am always frustrated with others’ blogs that don’t accept comments. At the end of the day, I couldn’t bring myself to part with either.
Historically, my blogspam protection has been to use a simple weak CAPTCHA and to have my blog software email me each time a comment is successfully submitted so that I can (with a built-in macro in mutt) delete each spam comment that slips in. This has worked well for the last couple years.
This summer, Mika pointed out that my blog was full of Chinese link spam that I had not noticed or been notified about. Around the same time, I realized that my website had been dealt a massive spam penalty by Google and was basically not showing up in any search results.
I have spent a significant amount of time over the last month repairing the damage and working to prevent it from reoccuring. I’m documenting this process here in the hopes that it might save other time and energy.
Upon reflection, the situation could have been prevented the in three relatively easy ways — all of which I have now implemented.
- Had the PyBlosxom comment.py plugin’s mail function been working properly, I would have known that I was being spammed.
- Had PyBlosxom been configured to only make blog entries (and not comments) indexable by search engines, Google and others never would have seen the link spam.
- Had I installed a stronger CAPTCHA, I might have blocked the spam from being submitted in the first place (although at the expense of the participation in comments by visually impaired users).
A month or so, hours of work, and a Google reinclusion request later, my website is beginning to show up in search entries again. Hopefully this message will help save others from a similar fate.
Fixing PyBlosxom’s Comment Notification
The most critical problem was a bug in PyBlosxom’s contributed comment.py plugin and its comment notification system. In short, the email based comment notification system failed silently if the body of the email — which included the full text of the comment — included any non-ASCII UTF-8 encoded text.
I’ve filed a bug against PyBlosxom and included a patch that fixes this issue. However, since this is a rather critical problem and because PyBlosxom releases tend to be few and far between, it might be worth patching your system now. My patch is against version 1.3 but can easily be modified and applied to version 1.2.
Hiding Comments from Search Engines
The major reason that the successful spam became a problem was that it triggered Google’s abuse detector, resulted in a spam penalty, and made all of the (non-spam) material on my website more difficult for others to find. A simple way to prevent this is to hide all comments from the search engines.
I’ve done this by creating a new PyBlosxom flavor that shows comments (and allows them to be input) which is not indexed by search Engines and to remove comments altogether from the default indexable flavors.
To do this, I removed all of the comment-* templates from html flavor and created a new flavor called comment.flav that included the comment templates. I also had to make the comment submit action point to the new flavor and to change the "Comments: N" link to point instead to .comment flavor rather than the .html. The rest of the template is simply symbolic links to the the HTML template.
The next step is to ensure that the comment flavor is not indexed by search engines. I found two ways of doing this and did both. The first was to add a "no index" meta tag to the header of each .comment page. It looked like this:
<meta name="robots" content="noindex, nofollow">
This is necessary because the robots.txt standard, the normal way to tell search engines not to index a page, does not support wildcards.
Luckily, Google (and others I imagine) do support an extension to Robots.txt that allows you to use wildcards. To take advantage of this, I created a robots.txt for mako.cc that blocks indexing all of the comment flavor. The following robot.txt did the trick for me:
User-agent: Googlebot Disallow: /copyrighteous/*.comment$
An Improved CAPTCHA
Ultimately, the best solution would be to keep the spam from showing up on the blog at all.
The only decent PyBlosxom CAPTCHA is the "nospam" plugin by Steven Armstrong. It is a simple image-based CAPTCHA and I was running it when I was spammed. It uses PIL but generates purely number-based strings and does some minimum obfuscation. Basically, spambots were able to break the CAPTCHA and toward the end, I was receiving thousands of pieces of a blog spam a day.
I’ve incorporated the PIL image generation code from Mediawiki’s ConfirmEdit/FancyCaptcha extension into nospam.py with this patch which I have also sent to Steven Armstrong — nospam.py’s original author. It’s much stronger.
Apologies, of course, go to all of my vision impaired users. Image-based CAPTCHAs really are evil. In this situation though with many thousands of attempts of a day, the alternative is that I will turn off comments altogether — the standard (and poor) lesser of two evils argument.
Ultimately, I will write a python implementation of a new strong text-based CAPTCHA I’ve invented that uses commonsense knowledge and pulls off some cool data acquisition in the process. I presented this project at the Wikimania Hacking Days and at a Media Lab open house for AAAI 2006 where I got universally positive and useful feedback. CAPTCHA inventor and recent genius-grantee Luis von Ahn seemed to like the idea too. I’ll write more about this on another day though.
In newer comment.py versions you can simply set the comment_nofollow configuration variable and all links will be written with the “nofollow” attribute as defined by Google:
py[‘comment_nofollow’] = 1
Man, your Captcha is way to strong, this is my third try!
I’ve increased the font size. It makes it a lot easier. :)
I had nofollow on. The problem was that my site was being ranked highly for certain words because they were uncommon and on my site and this was seen as trying to attract traffic (which it was) unfairly on the goodwill of my legimate content.
How about using Akismet http://akismet.com/ python implementation can be found at http://kemayo.wordpress.com/2005/12/02/akismet-py/
So far, it has captured tons and tons of spam for me.
Thanks for the hint Wari. I was talking with someone about Akismet last night. If it works reasonably well (and the person I was talking with had mixed reviews) then I’d be thrilled to install it instead of a CAPTCHA.
Just ensure that you don’t delete comments outright, have spam/ham folders or something. The more spam you report to akismet, the better the service gets. Only downside is that you need a wordpress account to use it, if you call that a downer.
PS: Toughest CAPTCHA eva! Too many retries, even for a human (me) – 4th try now
BTW, using Akismet on my less than one month old. 2/108 spam caught. That’s just over 1% of error. Eases my mind a lot :)
Yes, the CAPTCHA is too hard. I’m working on it. 1% is a stilla a lot. I received 10K successful spam whils using a weaker CAPTCHA last week alone.
I got a kismet account this morning and will try it out today.
Do you have a patch to Pyblosxom to do the spam/ham stuff. Your full kismet stuff as a patch would be great as well.
Those that Akismet didn’t catch actually tried to make the spam looked like a very normal conversation (except for the URL). My blog is relatively new, so catching a ton of it makes me happy.
I do not have any implementation of Akismet in pyblosxom anywhere. Where I use pyblosxom is not exposed to the web, and heck, I use pyblosxom 1.1 and I’m quite happy with that.
I have not touched pyblosxom development for the longest time, and the current code base is quite different from what I’m familiar with.
My usage of Akismet is with WordPress, that’s as far as the experience I got with it.
Thanks for all the hints Wari. I’ll write-up a Akismet implementation for PyBlosxom in the next day or so and then post something about it on my blog here.
What, you have a really cool captcha but you don’t have room in this margin to show it? :)
Heh. magicword.py works just dandy for me. Simply and low-tech. Not sure that I get spammed that much, but so far, zero. knock on silicon
-todd
check this post, it’s got comment spam on it, right above my comment. TC
I deleted it. Thanks Simon!