SpamLookup's Keyword Filter Explained

| 2 Comments | 4 TrackBacks

The SpamLookup 'Keyword Filter' plugin provided with MT 3.2 works in a similar manner to MT-Blacklist (which SpamLookup hopes to make obsolete), but there are some differences in the way it works with keywords and regexes compared to MT-Blacklist. This post will explain some of those differences so you can understand how SpamLookup's 'Keyword Filter' uses the keywords and regexes when it is filtering spam, and perhaps experience a smoother transition when migrating from MT-Blacklist.

SpamLookup's 'Keyword Filter' plugin accepts two type of entries for filtering: 1) keywords, and 2) regexes, compared to the three types of entries accepted by MT-Blacklist: 1) strings, 2) URL patterns, and 3) regexes.

1. Keywords

Keywords in the SpamLookup 'Keyword Filter' plugin are the equivalent of blacklist strings in MT-Blacklist:

# Words and phrases can be listed plainly. They are tested in a
# case-insensitive manner and match against "whole" words:

The 'match against "whole" words' is a significant difference from how MT-Blacklist handles blacklist strings. When checking a comment or trackback for matching keywords, the SpamLookup 'Keyword Filter' code converts the keyword to a regex and encloses the keyword in '\b' (word boundary) regex metacharacters. The example keyword 'cialis' provided in SpamLookup is converted to the regex '/\bcialis\b/i'.

Because of this behavior, 'cialis' will only be matched in a comment or trackback if it is preceded and followed by a non-word character (a character that is not 0-9, a-z, A-Z, or '_'). In particular, it will match 'Buy cialis!', because the space and exclamation point are non-word characters, but it will not match 'buycialis.com', since the 'y' preceding 'cialis' is not a non-word character. SpamLookup will match 'cialis' if it is a "whole" word (as perl's regex routines define them), but it will not match 'cialis' if it within a larger 'word'.

This behavior can be desirable in some circumstances. By using '\bcialis\b' to perform the match, it will prevent SpamLookup from matching a comment or trackback containing the word 'specialist'. But this behavior can also be a source of frustration if it is not understood how SpamLookup works.

Another side-effect of this behavior by SpamLookup: Keywords that start or end with non-word characters will not be matched if the keyword occurs at the beginning or end of the comment/trackback.

An example of this: I get dozens of spam comments every day that start with the HTML tag '<h1>'. It seemed to me it should be very easy to block these comments by adding '<h1>' as a keyword to SpamLookup's Keyword Filter junk list. This in fact does not work.

The reason this does not work is because the keyword '<h1>' is converted to the regex '/\b\<h1\>\b/i'. The very beginning of a string is considered to be a non-word character, and since '<' is also a non-word character, the '\b' in the regex can never match this. The same thing occurs with trying to match '</h1> at the end of a comment - the very end of the string is also considered a non-word character, and the trailing '>' is a non-word character, so the trailing '\b' in the regex cannot match this either. The solution to this issue is to set up '<h1>' as a regex instead.

(Note: I've also been getting a lot of comments with ' <h1>'- the comment starts with a space character before the '<h1>. Since the space is a non-word character, SpamLookup will not match these comments to a keyword of '<h1>' either.)

There are some types of keywords that are suitable to enter in SpamLookup as keywords. Domain names are one, since the domain name in a URL will begin and end with word character, and be preceded and followed by a non-word character (either '/' or '.'). Plain-text words and phrases (such as profanity) would be another. But if you want to block words that may appear within other words (such as within a URL), a regex is probably a better choice.

2. Regexes

Regexes in the SpamLookup 'Keyword Filter' plugin are the equivalent of regexes in MT-Blacklist:

Regexes in SpamLookup must be entered in the keyword filter list with '/' delimiters (the regex must start and end with a '/' character). No other delimiters are recognized by SpamLookup. You'll probably want to use the 'i' modifier after the regex, so SpamLookup will perform a case-insensitive match. Taking the above example, to set up '<h1>' as a regex in SpamLookup, you could use the following (without the quotes): '/<h1>/i'. This will match the text '<h1>' or '<H1>' if it occurs anywhere within a comment or trackback.

URL patterns are an MT-Blacklist feature that is not present in SpamLookup. MT-Blacklist searches the domain names of all URLs appearing within comments and trackbacks for any matching URL patterns. This was a powerful feature of MT-Blacklist, as it allowed blocking of any domain names containing the word 'poker', for example, but still allow 'poker' to be used as a plain word in comments and trackbacks. Spammers have caught on to this technique, however, and are now using ordinary domain names instead, with the page names in their links containing the spam terms to escape this method of blocking.

Adding the equivalent of an MT-Blacklist URL pattern to SpamLookup requires using a regex. I've crafted a regex that works in a similar manner to MT-Blacklist's URL patterns, but will scan the entire URL for the matching term (not just the domain name) in all URLs that appear in the comment or trackback:

/https?:\/\/[^\s\'"<>]*(?:spamword1|spamword2|spamword3)[^\s\'"<>]*/i

Replace 'spamword1', 'spamword2', etc., with the actual words you want to check for and block within URLs. You can add more words within a single regex - separate each word with a "|".

I've set up a series of these regexes in my installation. Each one generally has somewhere between 3 and 7 words to be searched for. This is to minimize the total number of regexes that SpamLookup will execute for each comment/trackback submission. SpamLookup will check every keyword and regex even after it encounters a match, and SpamLookup actually executes each keyword and regex twice (once on the raw comment/trackback text, and once after the comment/trackback has been HTML-decoded).

Here's what I have set up to block various gambling/poker web sites from spamming me:

/https?:\/\/[^\s\'"<>]*(?:online|poker|casino)[^\s\'"<>]*/i 2
/https?:\/\/[^\s\'"<>]*(?:blackjack|roulette|slots|craps|gambling)[^\s\'"<>]*/i 2
/https?:\/\/[^\s\'"<>]*(?:texas[\w\-_.]*hold[\w\-_.]*em)[^\s\'"<>]*/i 2

The first regex will block comments and trackbacks that contain a URL with any of the words 'online', 'poker' or 'casino' within it. The second regex performs the same function, with a different group of gambling-related words. The last regex is a special one that looks for variations on 'texas hold-em'. Each regex also has a '2' following it, as I wanted any matching comments/trackbacks to have its junk score docked 2 points in case of a match.

These could have all been set up as one regex. I've set them up as I have them because 1) they all fit within the display width of the keyword entry box, so I don't have to scroll back and forth to see the entire regex, and 2) if more than one regex is matched, the junk score will be increased for each regex matched.

Similar regexes can be set up for other types of spam sites that are related, such as sites selling various drugs, lending/mortgage sites, and adult web sites.

I currently have about a dozen regexes currently in my SpamLookup setup. For the most part, SpamLookup's IP address and domain name lookups are able to identify spam, but for the few that get by (new domain names, or machines on IP addresses that have not yet been identified as a spam source), these keyword regexes add an extra layer of defense to help make sure the spam doesn't get through.

4 TrackBacks

David Phillips explains SpamLookup's Keyword Filter, how its similar to MT Blacklist and how its different. Read More

OK, we've back from Mexico for two weeks now, and to post our exploits here, I've had to resort to a few small repairs on this blog. I normally don't like talking about my blog maintenance work - it's a bunch of navel-gazing I say - but it's important ... Read More

Quick Way To Eliminate a LOT of Comment Spam from Three Years of Hell to Become the Devil on October 19, 2005 9:34 AM

I've been getting over 100 spam messages a day, almost all of the form: You might be interested in Notice two things: the spam message is included in H1 tags, and as a result it's extra-obnoxious, because it's very, very... Read More

Recently I've noticed a marked increase in the amount of comment spam I've been receiving. It got so bad at one point that I could have a hundred or (many) more spam comments to wade through in my inbox.... Read More

2 Comments

Tweezer, thanks for this. I've been looking everywhere for a definitive "how to convert your MT-Blacklist entries to SpamLookup" tutorial, and you've cinched it. I don't know why the regex is different between the two systems (one would even think that Jay would have just dumped his latest blacklist.txt into the Spamlookup wordlist, go figure), but it is frustrating that there is virtually no SpamLookup documentation, and the spammers are already working their way around MT3.2's SpamLookup through random URL spam.

Anyway, thanks again.

It took me a while to find out that HTML-tags have to be enabled on comments, otherwise they get stripped out before SpamLookup can test its filters on them on MT 3.3.
Thanks for the explainations.