Deleting 10,000 Spam Comments

I haven’t always done the best of jobs when it comes to managing comments on this blog. In fact, for the longest time, I did nothing about the piles of comments other than letting them sit in the moderation queue. As part of cleaning house I decided to review and moderate all of the old comments. Turns out, I had only missed a couple, though mostly because I would check for new comments for a while after posting new entries. I’m planning on moving to a hosted solution for comments (most likely Disqus) in the next iteration so that I can allow comments without having moderation–which I feel stunts the conversation.

More interestingly, this was the first time I’d looked at my spam comments holistically instead of individually and a lot of patterns emerged. Spam filter writers take note:

  1. 2000 comments from a single IP address on a single post are going to be spam.
  2. Not a single one of the comments that contained the words “sex,” “casino,” “cheap,” or “drugs” were valid.
  3. None of the comments whose opening characters were the beginning of an HTML element were valid.
  4. Every email address which ended in .ru was spam.
  5. The phrase “I’m new here” is synonymous with “I’m a spambot.”
  6. BBCode was also universally a trigger, but mostly because I’m not hosting a forum.

I’m sure these facts are well known at this point, but it appears that most spammers simply haven’t responded in ways that they will be successful at getting past filters. After doing a quick SQL cleanse for these most common scenarios I was left with 500 comments that I quickly scrolled through for valid ones. The valid ones stood out very plainly due to the fact that their content was complete gibberish (those were the only ones that got past my first few filters).

I would wager that some of the gibberish ones were intended to plant searchable strings which would identify possible future victims instead of actually starting with spam. Were I to be a spammer that seems like a reasonable place to start the process. Bonus points for real-sounding posts that end with a gibberish “flag” string.