Wiser and less stubborn individuals might given up on either PyBlosxom or the ability to receive public comments. However, I find PyBlosxom unique in its flexibility and great ReStructured Text support and am always frustrated with others’ blogs that don’t accept comments. At the end of the day, I couldn’t bring myself to part with either.
Historically, my blogspam protection has been to use a simple weak CAPTCHA and to have my blog software email me each time a comment is successfully submitted so that I can (with a built-in macro in mutt) delete each spam comment that slips in. This has worked well for the last couple years.
This summer, Mika pointed out that my blog was full of Chinese link spam that I had not noticed or been notified about. Around the same time, I realized that my website had been dealt a massive spam penalty by Google and was basically not showing up in any search results.
I have spent a significant amount of time over the last month repairing the damage and working to prevent it from reoccuring. I’m documenting this process here in the hopes that it might save other time and energy.
Upon reflection, the situation could have been prevented the in three relatively easy ways — all of which I have now implemented.
- Had the PyBlosxom comment.py plugin’s mail function been working properly, I would have known that I was being spammed.
- Had PyBlosxom been configured to only make blog entries (and not comments) indexable by search engines, Google and others never would have seen the link spam.
- Had I installed a stronger CAPTCHA, I might have blocked the spam from being submitted in the first place (although at the expense of the participation in comments by visually impaired users).
A month or so, hours of work, and a Google reinclusion request later, my website is beginning to show up in search entries again. Hopefully this message will help save others from a similar fate.
Fixing PyBlosxom’s Comment Notification
The most critical problem was a bug in PyBlosxom’s contributed comment.py plugin and its comment notification system. In short, the email based comment notification system failed silently if the body of the email — which included the full text of the comment — included any non-ASCII UTF-8 encoded text.
I’ve filed a bug against PyBlosxom and included a patch that fixes this issue. However, since this is a rather critical problem and because PyBlosxom releases tend to be few and far between, it might be worth patching your system now. My patch is against version 1.3 but can easily be modified and applied to version 1.2.
Hiding Comments from Search Engines
The major reason that the successful spam became a problem was that it triggered Google’s abuse detector, resulted in a spam penalty, and made all of the (non-spam) material on my website more difficult for others to find. A simple way to prevent this is to hide all comments from the search engines.
I’ve done this by creating a new PyBlosxom flavor that shows comments (and allows them to be input) which is not indexed by search Engines and to remove comments altogether from the default indexable flavors.
To do this, I removed all of the comment-* templates from html flavor and created a new flavor called comment.flav that included the comment templates. I also had to make the comment submit action point to the new flavor and to change the "Comments: N" link to point instead to .comment flavor rather than the .html. The rest of the template is simply symbolic links to the the HTML template.
The next step is to ensure that the comment flavor is not indexed by search engines. I found two ways of doing this and did both. The first was to add a "no index" meta tag to the header of each .comment page. It looked like this:
<meta name="robots" content="noindex, nofollow">
This is necessary because the robots.txt standard, the normal way to tell search engines not to index a page, does not support wildcards.
Luckily, Google (and others I imagine) do support an extension to Robots.txt that allows you to use wildcards. To take advantage of this, I created a robots.txt for mako.cc that blocks indexing all of the comment flavor. The following robot.txt did the trick for me:
User-agent: Googlebot Disallow: /copyrighteous/*.comment$
An Improved CAPTCHA
Ultimately, the best solution would be to keep the spam from showing up on the blog at all.
The only decent PyBlosxom CAPTCHA is the "nospam" plugin by Steven Armstrong. It is a simple image-based CAPTCHA and I was running it when I was spammed. It uses PIL but generates purely number-based strings and does some minimum obfuscation. Basically, spambots were able to break the CAPTCHA and toward the end, I was receiving thousands of pieces of a blog spam a day.
I’ve incorporated the PIL image generation code from Mediawiki’s ConfirmEdit/FancyCaptcha extension into nospam.py with this patch which I have also sent to Steven Armstrong — nospam.py’s original author. It’s much stronger.
Apologies, of course, go to all of my vision impaired users. Image-based CAPTCHAs really are evil. In this situation though with many thousands of attempts of a day, the alternative is that I will turn off comments altogether — the standard (and poor) lesser of two evils argument.
Ultimately, I will write a python implementation of a new strong text-based CAPTCHA I’ve invented that uses commonsense knowledge and pulls off some cool data acquisition in the process. I presented this project at the Wikimania Hacking Days and at a Media Lab open house for AAAI 2006 where I got universally positive and useful feedback. CAPTCHA inventor and recent genius-grantee Luis von Ahn seemed to like the idea too. I’ll write more about this on another day though.