Better spam filtering

The classic method for spam detection used to be Bayesian filtering, but honestly, filters using it are getting easier and easier to circumvent due to Bayesian poisoning.

And of course, Bayesian poisoning works because computers are stupid.

We’re so used to this stupidity, but when we’re very involved in a project we also tend to forget it. It’s almost like the omnipresent security cameras in cities; we know they’re there, we know they see everything, but after a while we lose our inhibition to publicly pick our noses (or pick our underwear out of our buttcrack) as long as no “real” humans aren’t around. Why? We’re still being watched.

Back to the topic…

Use the stupidity

Fighting computer stupidity… is stupid. It doesn’t make sense to make computers understand what “spam” is in order to stop it, so we need a better way of quantifying a message by either turning it into a number or a simple searchable string.

If computers excel at anything, it’s raw number-crunching, particularly arithmetic. So it makes sense that rolling hashes are the way to go when it comes to turning a message into a basic number or string that can be scanned for commonalities. There are many ways to turn a sentence or paragraph into a rolling hash, but one that particularly caught my eye was the Rabin-Karp algorithm. I read that page a few times, but the words just blurred after a while (probably due to lack of coffee). Although this part did stick out…

A practical application of Rabin–Karp is detecting plagiarism. Given source material, Rabin–Karp can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

And that makes sense because words can be written in any number of ways and spammers often use punctuation and other special characters to obfuscate what they’re pushing. They also add extra random text, which makes pure Bayesian filtering a problem.

People don’t read punctuation

We don’t when gist of the message is all that matters. Periods, question marks, commas etc… represent clarity in language. For the purpose of pattern matching too, these are irrelevant since most spam messages have no need for these as long as the product name gets through.

So let’s start with a function that does the following:

  • Strips punctuation
  • Removes line-breaks and special whitespace characters
  • Strips special characters (@#$%^&/[]{}-_ etc…)
  • Converts unicode accented characters into their base characters (á into a etc…)
/// <summary>
/// Helper function removes all punctuation and newline characters
/// </summary>
/// <param name="source">Original raw text</param>
/// <returns>Cleaned code</returns>
public string RemoveNoise(string source)
	if (String.IsNullOrEmpty(source))
		return String.Empty;

	StringBuilder sb = new StringBuilder();

	// Normalize the string and convert accents etc...
	char[] chars = source.Normalize(NormalizationForm.FormD)
		.Where(c => CharUnicodeInfo.GetUnicodeCategory(c)
			!= UnicodeCategory.NonSpacingMark).ToArray();

	// Append only characters to the StringBuilder
	for (int i = 0; i < chars.Length; i++)
			(char.IsPunctuation(chars[i]) ||
			char.IsSeparator(chars[i]) ||
			char.IsControl(chars[i]) ||
			char.IsWhiteSpace(chars[i]) ||
			char.IsSymbol(chars[i])) ?
			' ' : chars[i]

	// Lowercase trimmed text
	return sb.ToString()


We now need a way to calculate the distance of one word from another. The best method I’ve found so far to do this (and works with most languages) is the Damerau-Levenshtein distance algorithm. That Wikipedia article was once again a bit of a blur, but another part stood out to me…

…the original motivation was to measure distance between human misspellings to improve applications such as spell checkers…

So this would be used in those nifty “suggestions” for misspelled words. It sounds and looks awfully complicated, but essentially it calculates the minimum number of steps needed to turn the word “Hello”, for example, into the word “Goodbye” so we can check the distance between each word in a sentence. It makes sense to only get the ones that have a smallest number of steps between words for a spellcheck app, but for our purposes, we just need a consistent number in terms of distance between words.

This is the first step in building our “hash”.

/// <summary>
/// Damerau - Levenshtein distance algorithm
/// </summary>
/// <param name="source">Original text</param>
/// <param name="target">Checking text</param>
/// <param name="limit">Optional maximum word size</param>
/// <returns>Match distance between source and target</returns>
public int Distance(string source, string target, int limit = 50)
	if (source.Equals(target))
		return 0;

	if (String.IsNullOrEmpty(source) ||
		return (source ?? "").Length + (target ?? "").Length;

	if (source.Length > target.Length)
		var t = source;
		source = target;
		target = t;

	if (target.Contains(source))
		return target.Length - source.Length;

	int sLen = source.Length;
	int tLen = target.Length;

	int[,] d = new int[sLen + 1, tLen + 1];

	// Load the matrix
	for (var i = 0; i <= sLen; i++)
		d[i, 0] = i;

	for (var i = 0; i <= tLen; i++)
		d[0, i] = i;

	for (var i = 1; i <= sLen; i++)
		var min = limit;

		for (var j = 1; j <= tLen; j++)
			var cost =
				(source[i - 1] == target[j - 1]) ? 0 : 1;

			d[i, j] =
				Math.Min(d[i - 1, j] + 1,
				Math.Min(d[i, j - 1] + 1,
				d[i - 1, j - 1] + cost));

			if (i > 1 &&
				j > 1 &&
				source[i - 1] == target[j - 2] &&
				source[i - 2] == target[j - 1])
				d[i, j] =
					Math.Min(d[i, j], d[i - 2, j - 2] + cost);

			if (d[i, j] < min)
				min = d[i, j];

		if (min > limit)
			return int.MaxValue;

	return (d[sLen, tLen] > limit)? int.MaxValue : d[sLen, tLen];


And of course we need to actually build our hash using the above two functions. We first need to clean it and build our hash by adding the distance between each word and its previous neighbor in a sentence.

/// <summary>
/// A simple distance aggregation function checks
/// the distance between each word in a block of text
/// and builds a rudimentary hash
/// </summary>
/// <param name="source">Source text</param>
/// <returns>Hash</returns>
public string RollingHash(string source)
	if (string.IsNullOrEmpty(source))
		return String.Empty;

	StringBuilder sb = new StringBuilder();

	string[] data = RemoveNoise(source)
		.Split(new char[] { ' ' },

	// Placeholder to check distance with current string
	string previous = "";

	foreach (string current in data)
		sb.Append(Distance(previous, current).ToString());
		previous = current;

	return sb.ToString();

How does this work

Let’s take a typical spam sentence :

Buy Viagra and Cialis today

If we use the above RollingHash function on this, we end up with the hash : 36556.

Now let’s throw this a curve ball. We all know how much spammers love to obfuscate their products with nonsense padding and odd characters. Let’s see what one of those messages may look like…

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Búy viagrÆ and Çiâlis today non tincidunt ipsum porta vel.

Turning this into a hash produces the following the hash : 5455391194655647755
Notice the common block after the “4” with the above hash of the original unobfuscated string.

Let’s take another example :

Vestibulum quis massa turpis. Ut buy ..viägra.. and *&&ciÅlis!! today vel laoreet dolor. Integer euismod, lectus a buy {[ViÃgRa@$]]. and***ciálÏS*** TôDaÿ faucibus congue.

After turning the above into a hash, we’re still able to pick out the original hash : 1094652655656667763655687. The two instances of “Buy Viagra and Cialis today” can be inferred even with the ridiculous amount of obfuscation.

Here’s a side-by-side example that shows the pattern match more clearly.

The first number in the hash doesn't match because there's no word before "buy", but the rest do.


Instead of checking word by word for commonalities with spam text, which is what most Bayesian filters do, it would be more practical to convert the entire block of text into a hash and search that instead. For the discussion forum I’m writing, I’m thinking of creating a hash of each post and storing it in the database along with the text content so filtering would be a lot easier.


I overlooked something in the Rabin-Karp algorithm in that it not only takes into account that each word is a hash, but that each hash has a good collision ratio. For this to work more effectively, we need something more than just the distance from one word to its neighbor. If we also include the lengths of each word in the hash along with the distance between them, we instantly get much better collision resistance.

E.G. By changing the StringBuilder Append in RollingHash…

foreach (string current in data)
		QuickHash(previous, current, Distance(previous, current))

	previous = current;

Where QuickHash is as follows…

private static string QuickHash(string c, string s, int d)
	return String.Concat(c.Length.ToString(),
		s.Length.ToString(), d.ToString());

We will not only make the hash signature of the search text longer, it will make it less likely to collide.

Hash matching improved


The creepiest spam comment ever!

I really don’t enjoy seeing things like this, especially before I’ve had coffee. I don’t know if it was sheer desperation for attention or whether this was just a randomly scanned bit of text, but sincerely hope it’s the latter.

And then when the blood started coming in gushes and gushes I new that the pregnancy was all over and that the little soul had already gone to the next world, and that there was nothing I could do.





… Also, ew!

See this is why I’m voting for the Spammers Tarred and Feathered party. Don’t know about the rest of their platform policies (something about free LSD and marijuana victory gardens), but I’m all for them!