Better spam filtering

The classic method for spam detection used to be Bayesian filtering, but honestly, filters using it are getting easier and easier to circumvent due to Bayesian poisoning.

And of course, Bayesian poisoning works because computers are stupid.

We’re so used to this stupidity, but when we’re very involved in a project we also tend to forget it. It’s almost like the omnipresent security cameras in cities; we know they’re there, we know they see everything, but after a while we lose our inhibition to publicly pick our noses (or pick our underwear out of our buttcrack) as long as no “real” humans aren’t around. Why? We’re still being watched.

Back to the topic…

Use the stupidity

Fighting computer stupidity… is stupid. It doesn’t make sense to make computers understand what “spam” is in order to stop it, so we need a better way of quantifying a message by either turning it into a number or a simple searchable string.

If computers excel at anything, it’s raw number-crunching, particularly arithmetic. So it makes sense that rolling hashes are the way to go when it comes to turning a message into a basic number or string that can be scanned for commonalities. There are many ways to turn a sentence or paragraph into a rolling hash, but one that particularly caught my eye was the Rabin-Karp algorithm. I read that page a few times, but the words just blurred after a while (probably due to lack of coffee). Although this part did stick out…

A practical application of Rabin–Karp is detecting plagiarism. Given source material, Rabin–Karp can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

And that makes sense because words can be written in any number of ways and spammers often use punctuation and other special characters to obfuscate what they’re pushing. They also add extra random text, which makes pure Bayesian filtering a problem.

People don’t read punctuation

We don’t when gist of the message is all that matters. Periods, question marks, commas etc… represent clarity in language. For the purpose of pattern matching too, these are irrelevant since most spam messages have no need for these as long as the product name gets through.

So let’s start with a function that does the following:

  • Strips punctuation
  • Removes line-breaks and special whitespace characters
  • Strips special characters (@#$%^&/[]{}-_ etc…)
  • Converts unicode accented characters into their base characters (á into a etc…)
/// <summary>
/// Helper function removes all punctuation and newline characters
/// </summary>
/// <param name="source">Original raw text</param>
/// <returns>Cleaned code</returns>
public string RemoveNoise(string source)
{
	if (String.IsNullOrEmpty(source))
		return String.Empty;

	StringBuilder sb = new StringBuilder();

	// Normalize the string and convert accents etc...
	char[] chars = source.Normalize(NormalizationForm.FormD)
		.Where(c => CharUnicodeInfo.GetUnicodeCategory(c)
			!= UnicodeCategory.NonSpacingMark).ToArray();

	// Append only characters to the StringBuilder
	for (int i = 0; i < chars.Length; i++)
	{
		sb.Append(
			(char.IsPunctuation(chars[i]) ||
			char.IsSeparator(chars[i]) ||
			char.IsControl(chars[i]) ||
			char.IsWhiteSpace(chars[i]) ||
			char.IsSymbol(chars[i])) ?
			' ' : chars[i]
			);
	}

	// Lowercase trimmed text
	return sb.ToString()
		.ToLowerInvariant()
		.Trim();
}

 

We now need a way to calculate the distance of one word from another. The best method I’ve found so far to do this (and works with most languages) is the Damerau-Levenshtein distance algorithm. That Wikipedia article was once again a bit of a blur, but another part stood out to me…

…the original motivation was to measure distance between human misspellings to improve applications such as spell checkers…

So this would be used in those nifty “suggestions” for misspelled words. It sounds and looks awfully complicated, but essentially it calculates the minimum number of steps needed to turn the word “Hello”, for example, into the word “Goodbye” so we can check the distance between each word in a sentence. It makes sense to only get the ones that have a smallest number of steps between words for a spellcheck app, but for our purposes, we just need a consistent number in terms of distance between words.

This is the first step in building our “hash”.

/// <summary>
/// Damerau - Levenshtein distance algorithm
/// </summary>
/// <param name="source">Original text</param>
/// <param name="target">Checking text</param>
/// <param name="limit">Optional maximum word size</param>
/// <returns>Match distance between source and target</returns>
public int Distance(string source, string target, int limit = 50)
{
	if (source.Equals(target))
		return 0;

	if (String.IsNullOrEmpty(source) ||
		String.IsNullOrEmpty(target))
		return (source ?? "").Length + (target ?? "").Length;

	if (source.Length > target.Length)
	{
		var t = source;
		source = target;
		target = t;
	}

	if (target.Contains(source))
		return target.Length - source.Length;

	int sLen = source.Length;
	int tLen = target.Length;

	int[,] d = new int[sLen + 1, tLen + 1];

	// Load the matrix
	for (var i = 0; i <= sLen; i++)
		d[i, 0] = i;

	for (var i = 0; i <= tLen; i++)
		d[0, i] = i;

	for (var i = 1; i <= sLen; i++)
	{
		var min = limit;

		for (var j = 1; j <= tLen; j++)
		{
			var cost =
				(source[i - 1] == target[j - 1]) ? 0 : 1;

			d[i, j] =
				Math.Min(d[i - 1, j] + 1,
				Math.Min(d[i, j - 1] + 1,
				d[i - 1, j - 1] + cost));

			if (i > 1 &&
				j > 1 &&
				source[i - 1] == target[j - 2] &&
				source[i - 2] == target[j - 1])
				d[i, j] =
					Math.Min(d[i, j], d[i - 2, j - 2] + cost);

			if (d[i, j] < min)
				min = d[i, j];
		}

		if (min > limit)
			return int.MaxValue;
	}

	return (d[sLen, tLen] > limit)? int.MaxValue : d[sLen, tLen];
}

 

And of course we need to actually build our hash using the above two functions. We first need to clean it and build our hash by adding the distance between each word and its previous neighbor in a sentence.

/// <summary>
/// A simple distance aggregation function checks
/// the distance between each word in a block of text
/// and builds a rudimentary hash
/// </summary>
/// <param name="source">Source text</param>
/// <returns>Hash</returns>
public string RollingHash(string source)
{
	if (string.IsNullOrEmpty(source))
		return String.Empty;

	StringBuilder sb = new StringBuilder();

	string[] data = RemoveNoise(source)
		.Split(new char[] { ' ' },
		StringSplitOptions.RemoveEmptyEntries);

	// Placeholder to check distance with current string
	string previous = "";

	foreach (string current in data)
	{
		sb.Append(Distance(previous, current).ToString());
		previous = current;
	}

	return sb.ToString();
}

How does this work

Let’s take a typical spam sentence :

Buy Viagra and Cialis today

If we use the above RollingHash function on this, we end up with the hash : 36556.

Now let’s throw this a curve ball. We all know how much spammers love to obfuscate their products with nonsense padding and odd characters. Let’s see what one of those messages may look like…

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Búy viagrÆ and Çiâlis today non tincidunt ipsum porta vel.

Turning this into a hash produces the following the hash : 5455391194655647755
Notice the common block after the “4” with the above hash of the original unobfuscated string.

Let’s take another example :

Vestibulum quis massa turpis. Ut buy ..viägra.. and *&&ciÅlis!! today vel laoreet dolor. Integer euismod, lectus a buy {[ViÃgRa@$]]. and***ciálÏS*** TôDaÿ faucibus congue.

After turning the above into a hash, we’re still able to pick out the original hash : 1094652655656667763655687. The two instances of “Buy Viagra and Cialis today” can be inferred even with the ridiculous amount of obfuscation.

Here’s a side-by-side example that shows the pattern match more clearly.

The first number in the hash doesn't match because there's no word before "buy", but the rest do.

 

Instead of checking word by word for commonalities with spam text, which is what most Bayesian filters do, it would be more practical to convert the entire block of text into a hash and search that instead. For the discussion forum I’m writing, I’m thinking of creating a hash of each post and storing it in the database along with the text content so filtering would be a lot easier.

Update

I overlooked something in the Rabin-Karp algorithm in that it not only takes into account that each word is a hash, but that each hash has a good collision ratio. For this to work more effectively, we need something more than just the distance from one word to its neighbor. If we also include the lengths of each word in the hash along with the distance between them, we instantly get much better collision resistance.

E.G. By changing the StringBuilder Append in RollingHash…

foreach (string current in data)
{
	sb.Append(
		QuickHash(previous, current, Distance(previous, current))
		);

	previous = current;
}

Where QuickHash is as follows…

private static string QuickHash(string c, string s, int d)
{
	return String.Concat(c.Length.ToString(),
		s.Length.ToString(), d.ToString());
}

We will not only make the hash signature of the search text longer, it will make it less likely to collide.

Hash matching improved

Advertisements

3 thoughts on “Better spam filtering

  1. This is very interesting. Though as a general rule, I cannot stand natural language processing; consequently, my knowledge in the area is somewhat limited.

    First, you could easily apply that preprocessing step to a naive Bayesian classifier to improve its accuracy–removing whitespace, punctuation (unless very intelligently included in the model), and other characters that spammers like to throw in are steps that just about any classifier should be employing. But I think your main point is that the Bayesian classifier itself is underarmed for outsmarting the craftier spammers, a point I would tend to agree with.

    Second, the edit distance you mentioned–measuring the “distance” between two words–has a lot of roots in information theory and is central to genetic algorithms. The idea behind the latter is to use random “evolution” of a hypothesis in order to explore a search space that is otherwise intractably large for an iterative search. Hamming distance, for example, is a binary version that literally counts the number of bits that differ in two strings. I’m not familiar with the theoretical underpinnings of the DL metric, but it would appear to be a reasonable one to use in your rolling-hash classifier.

    All in all, very interesting read. While it doesn’t quite overlap with my own work, it’s awfully close :)

    • Thanks!

      Oh yes, this is meant to go right in front of a Bayesian filter. In fact we were in the middle of a retrofit (the previous dev placed filtering right in the core) so I had to find a way to improve the hit/miss ratio until the upcoming rewrite.

      It’s funny how these kinds of problems pop up in unexpected places.

      Case in point, Hamming distance came up the other day too because we were fixing some DIY crypto (always a bad idea) and realized that the previously computed hashes were predictable. Kinda defeats the whole purpose of crypto.

      Your post on the lecture reminded me of how I had to explain Bayesian filtering to my boss yesterday. After the first 5 minutes, I reverted to…

      (Good Words / Bad Words) > Minimum = OK :D

  2. Pingback: Discussion Forum Update (tables and classes) « This page intentionally left ugly

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s