Whitelist HTML sanitizing with PHP

The following is a single class written to perform comprehensive HTML input filtering with minimal dependencies (basically only Tidy) and should work in PHP 5.3+. This will be included in my forum script as the default filter.

This version captures URL encoded XSS attempts with deep attribute inspection (to a decoding depth of 6 by default) as well as scrubbing all non-whitelisted attributes, tags and conversion of surviving attribute data into HTML entities.

In addition, it will attempt to capture directory traversal attempts ( ../ or \\ or /~/ etc… ) which may give access to restricted areas of a site. Your web server should deny access to these URLs by default, however that won’t stop someone from posting links pointing elsewhere. This will reduce your liability should such a link be included in your site content by a user.

You can post sourcecode within <code> tags and it will be encoded by default.

<?php

/**
 * HTML parsing, filtering and sanitization
 * This class depends on Tidy which is included in the core since PHP 5.3
 *
 * @author Eksith Rodrigo <reksith at gmail.com>
 * @license http://opensource.org/licenses/ISC ISC License
 * @version 0.2
 */

class Html {
	
	/**
	 * @var array HTML filtering options
	 */
	public static $options = array( 
		'rx_url'	=> // URLs over 255 chars can cause problems
			'~^(http|ftp)(s)?\:\/\/((([a-z|0-9|\-]{1,25})(\.)?){2,7})($|/.*$){4,255}$~i',
		
		'rx_js'		=> // Questionable attributes
			'/((java)?script|eval|document)/ism',
		
		'rx_xss'	=> // XSS (<style> can also be a vector. Stupid IE 6!)
			'/(<(s(?:cript|tyle)).*?)/ism',
		
		'rx_xss2'	=> // More potential XSS
			'/(document\.|window\.|eval\(|\(\))/ism',
		
		'rx_esc'	=> // Directory traversal/escaping/injection
			'/(\\~\/|\.\.|\\\\|\-\-)/sm'	,
		
		'scrub_depth'	=> 6, // URL Decoding depth (fails on exceeding this)
		
		'nofollow'	=> true // Set rel='nofollow' on all links

	);
	
	/**
	 * @var array List of HTML Tidy output settings
	 * @link http://tidy.sourceforge.net/docs/quickref.html
	 */
	private static $tidy = array(
		// Preserve whitespace inside tags
		'add-xml-space'			=> true,
		
		// Remove proprietary markup (E.G. og:tags)
		'bare'				=> true,
		
		// More proprietary markup
		'drop-proprietary-attributes'	=> true,
		
		// Remove blank (E.G. <p></p>) paragraphs
		'drop-empty-paras'		=> true,
		
		// Wraps bare text in <p> tags
		'enclose-text'			=> true,
		
		// Removes illegal/invalid characters in URIs
		'fix-uri'			=> true,
		
		// Removes <!-- Comments -->
		'hide-comments'			=> true,
		
		// Removing indentation saves storage space
		'indent'			=> false,
		
		// Combine individual formatting styles
		'join-styles'			=> true,
		
		// Converts <i> to <em> & <b> to <strong>
		'logical-emphasis'		=> true,
		
		// Byte Order Mark isn't really needed
		'output-bom'			=> false,
		
		// Ensure UTF-8 characters are preserved
		'output-encoding'		=> 'utf8',
		
		// W3C standards compliant markup
		'output-xhtml'			=> true,
		
		// Had some unexpected behavior with this
		//'markup'			=> true,

		// Merge multiple <span> tags into one		
		'merge-spans'			=> true,
		
		// Only outputs <body> (<head> etc... not needed)
		'show-body-only'		=> true,
		
		// Removing empty lines saves storage
		'vertical-space'		=> false,
		
		// Wrapping tags not needed (saves bandwidth)
		'wrap'				=> 0
	);
	
	
	/**
	 * @var array Whitelist of tags. Trim or expand these as necessary
	 * @example 'tag' => array( of, allowed, attributes )
	 */
	private static $whitelist = array(
		'p'		=> array( 'style', 'class', 'align' ),
		'div'		=> array( 'style', 'class', 'align' ),
		'span'		=> array( 'style', 'class' ),
		'br'		=> array( 'style', 'class' ),
		'hr'		=> array( 'style', 'class' ),
		
		'h1'		=> array( 'style', 'class' ),
		'h2'		=> array( 'style', 'class' ),
		'h3'		=> array( 'style', 'class' ),
		'h4'		=> array( 'style', 'class' ),
		'h5'		=> array( 'style', 'class' ),
		'h6'		=> array( 'style', 'class' ),
		
		'strong'	=> array( 'style', 'class' ),
		'em'		=> array( 'style', 'class' ),
		'u'		=> array( 'style', 'class' ),
		'strike'	=> array( 'style', 'class' ),
		'del'		=> array( 'style', 'class' ),
		'ol'		=> array( 'style', 'class' ),
		'ul'		=> array( 'style', 'class' ),
		'li'		=> array( 'style', 'class' ),
		'code'		=> array( 'style', 'class' ),
		'pre'		=> array( 'style', 'class' ),
		
		'sup'		=> array( 'style', 'class' ),
		'sub'		=> array( 'style', 'class' ),
		
		// Took out 'rel' and 'title', because we're using those below
		'a'		=> array( 'style', 'class', 'href' ),
		
		'img'		=> array( 'style', 'class', 'src', 'height', 
					  'width', 'alt', 'longdesc', 'title', 
					  'hspace', 'vspace' ),
		
		'table'		=> array( 'style', 'class', 'border-collapse', 
					  'cellspacing', 'cellpadding' ),
					
		'thead'		=> array( 'style', 'class' ),
		'tbody'		=> array( 'style', 'class' ),
		'tfoot'		=> array( 'style', 'class' ),
		'tr'		=> array( 'style', 'class' ),
		'td'		=> array( 'style', 'class', 
					'colspan', 'rowspan' ),
		'th'		=> array( 'style', 'class', 'scope', 'colspan', 
					  'rowspan' ),
		
		'q'		=> array( 'style', 'class', 'cite' ),
		'cite'		=> array( 'style', 'class' ),
		'abbr'		=> array( 'style', 'class' ),
		'blockquote'	=> array( 'style', 'class' ),
		
		// Stripped out
		'body'		=> array()
	);
	
	
	
	/**#@+
	 * HTML Filtering
	 */
	
	
	/**
	 * Convert content between code blocks into code tags
	 * 
	 * @param $val string Value to encode to entities
	 */
	protected function escapeCode( $val ) {
		
		if ( is_array( $val ) ) {
			$out = self::entities( $val[1] );
			return '<code>' . $out . '</code>';
		}
		
	}
	
	
	/**
	 * Convert an unformatted text block to paragraphs
	 * 
	 * @link http://stackoverflow.com/a/2959926
	 * @param $val string Filter variable
	 */
	protected function makeParagraphs( $val ) {
		
		/**
		 * Convert newlines to linebreaks first
		 * This is why PHP both sucks and is awesome at the same time
		 */
		$out = nl2br( $val );
		
		/**
		 * Turn consecutive <br>s to paragraph breaks and wrap the 
		 * whole thing in a paragraph
		 */
		$out = '<p>' . preg_replace('#(?:<br\s*/?>\s*?){2,}#', 
			'<p></p><p>', $out ) . '</p>';
		
		/**
		 * Remove <br> abnormalities
		 */
		$out = preg_replace( '#<p>(\s*<br\s*/?>)+#', '</p><p>', $out );
		$out = preg_replace( '#<br\s*/?>(\s*</p>)+#', '<p></p>', $out );
		
		return $out;
	}
	
	
	/**
	 * Filters HTML content through whitelist of tags and attributes
	 * 
	 * @param $val string Value filter
	 */
	public function filter( $val ) {
		
		if ( !isset( $val ) || empty( $val ) ) {
			return '';
		}
		
		/**
		 * Escape the content of any code blocks before we parse HTML or 
		 * they will get stripped
		 */
		$out	= preg_replace_callback( "/\<code\>(.*)\<\/code\>/imu", 
				array( $this, 'escapeCode' ) , $val
			);
		
		/**
		 * Convert to paragraphs and begin
		 */
		$out	= $this->makeParagraphs( $out );
		$dom	= new DOMDocument();
		
		/**
		 * Hide parse warnings since we'll be cleaning the output anyway
		 */
		$err	= libxml_use_internal_errors( true );
		
		$dom->loadHTML( $out );
		$dom->encoding = 'utf-8';
		
		$body	= $dom->getElementsByTagName( 'body' )->item( 0 );
		$this->cleanNodes( $body, $badTags );
		
		/**
		 * Iterate through bad tags found above and convert them to 
		 * harmless text
		 */
		foreach ( $badTags as $node ) {
			if( $node->nodeName != "#text" ) {
				$ctext = $dom->createTextNode( 
						$dom->saveHTML( $node )
					);
				$node->parentNode->replaceChild( 
					$ctext, $node 
				);
			}
		}
		
		
		/**
		 * Filter the junk and return only the contents of the body tag
		 */
		$out = tidy_repair_string( 
				$dom->saveHTML( $body ), 
				self::$tidy
			);
		
		
		/**
		 * Reset errors
		 */
		libxml_clear_errors();
		libxml_use_internal_errors( $err );
		
		return $out;
	}
	
	
	protected function cleanAttributeNode( 
		&$node, 
		&$attr, 
		&$goodAttributes, 
		&$href 
	) {
		/**
		 * Why the devil is an attribute name called "nodeName"?!
		 */
		$name = $attr->nodeName;
		
		/**
		 * And an attribute value is still "nodeValue"?? Damn you PHP!
		 */
		$val = $attr->nodeValue;
		
		/**
		 * Default action is to remove the attribute completely
		 * It's reinstated only if it's allowed and only after 
		 * it's filtered
		 */
		$node->removeAttributeNode( $attr );
		
		if ( in_array( $name, $goodAttributes ) ) {
			
			switch ( $name ) {
				
				/**
				 * Validate URL attribute types
				 */
				case 'url':
				case 'src':
				case 'href':
				case 'longdesc':
					if ( self::urlFilter( $val ) ) {
						$href = $val;
					} else {
						$val = '';
					}
					break;
				
				/**
				 * Everything else gets default scrubbing
				 */
				default:
					if ( self::decodeScrub( $val ) ) {
						$val = self::entities( $val );
					} else {
						$val = '';
					}
			}
			
			if ( '' !== $val ) {
				$node->setAttribute( $name, $val );
			}
		}
	}
	
	
	/**
	 * Modify links to display their domains and add 'nofollow'.
	 * Also puts the linked domain in the title as well as the file name
	 */
	protected static function linkAttributes( &$node, $href ) {
		try {
			if ( !self::$options['nofollow'] ) {
				return;
			}
			
			$parsed	= parse_url( $href );
			$title	= $parsed['host'] . ' ';
			
			$f	= pathinfo( $parsed['path'] );
			$title	.= ' ( /' . $f['basename'] . ' ) ';
				
			$node->setAttribute( 
				'title', $title
			);
			
			if ( self::$options['nofollow'] ) {
				$node->setAttribute(
					'rel', 'nofollow'
				);
			}
			
		} catch ( Exception $e ) { }
	}
	
	
	/**
	 * Iterate through each tag and add non-whitelisted tags to the 
	 * bad list. Also filter the attributes and remove non-whitelisted ones.
	 * 
	 * @param htmlNode $node Current HTML node
	 * @param array $badTags Cumulative list of tags for deletion
	 */
	protected function cleanNodes( $node, &$badTags = array() ) {
		
		if ( array_key_exists( $node->nodeName, self::$whitelist ) ) {
			
			if ( $node->hasAttributes() ) {
				
				/**
				 * Prepare for href attribute which gets special 
				 * treatment
				 */
				$href = '';
				
				/**
				 * Filter through attribute whitelist for this 
				 * tag
				 */
				$goodAttributes = 
					self::$whitelist[$node->nodeName];
				
				
				/**
				 * Check out each attribute in this tag
				 */
				foreach ( 
					iterator_to_array( $node->attributes ) 
					as $attr ) {
					$this->cleanAttributeNode( 
						$node, $attr, $goodAttributes, 
						$href
					);
				}
				
				/**
				 * This is a link. Treat it accordingly
				 */
				if ( 'a' === $node->nodeName && '' !== $href ) {
					self::linkAttributes( $node, $href );
				}
				
			} // End if( $node->hasAttributes() )
			
			/**
			 * If we have childnodes, recursively call cleanNodes 
			 * on those as well
			 */
			if ( $node->childNodes ) {
				foreach ( $node->childNodes as $child ) {
					$this->cleanNodes( $child, $badTags );
				}
			}
			
		} else {
			
			/**
			 * Not in whitelist so no need to check its child nodes. 
			 * Simply add to array of nodes pending deletion.
			 */
			$badTags[] = $node;
			
		} // End if array_key_exists( $node->nodeName, self::$whitelist )
		
	}
	
	/**#@-*/
	
	
	/**
	 * Returns true if the URL passed value is harmless.
	 * This regex takes into account Unicode domain names however, it 
	 * doesn't check for TLD (.com, .net, .mobi, .museum etc...) as that 
	 * list is too long.
	 * The purpose is to ensure your visitors are not harmed by invalid 
	 * markup, not that they get a functional domain name.
	 * 
	 * @param string $v Raw URL to validate
	 * @returns boolean
	 */
	public static function urlFilter( $v ) {
		
		$v = strtolower( $v );
		$out = false;
		
		if ( filter_var( $v, 
			FILTER_VALIDATE_URL, FILTER_FLAG_SCHEME_REQUIRED ) ) {
			
			/**
			 * PHP's native filter isn't restrictive enough.
			 */
			if ( preg_match( self::$options['rx_url'], $v ) ) {
				$out = true;
			} else {
				$out = false;
			}
			
			if ( $out ) {
				$out = self::decodeScrub( $v );
			}
		} else {
			$out = false;
		}
		
		return $out;
	}
	
	
	/**
	 * Regular expressions don't work well when used for validating HTML.
	 * It really shines when evaluating text so that's what we're doing here
	 * 
	 * @param string $v string Attribute name
	 * @param int $depth Number of times to URL decode
	 * @returns boolean True if nothing unsavory was found.
	 */
	public static function decodeScrub( $v ) {
		if ( empty( $v ) ) {
			return true;
		}
		
		$depth		= self::$options['scrub_depth'];
		$i		= 1;
		$success	= false;
		$old		= '';
		
		
		while( $i <= $depth && !empty( $v ) ) {
			// Check for any JS and other shenanigans
			if (
				preg_match( self::$options['rx_xss'], $v ) || 
				preg_match( self::$options['rx_xss2'], $v ) || 
				preg_match( self::$options['rx_esc'], $v )
			) {
				$success = false;
				break;
			} else {
				$old	= $v;
				$v	= self::utfdecode( $v );
				
				/**
				 * We found the the lowest decode level.
				 * No need to continue decoding.
				 */
				if ( $old === $v ) {
					$success = true;
					break;
				}
			}
			
			$i++;
		}
		
		
		/**
		 * If after decoding a number times, we still couldn't get to 
		 * the original string, then there's something still wrong
		 */
		if ( $old !== $v && $i === $depth ) {
			return false;
		}
		
		return $success;
	}
	
	
	/**
	 * UTF-8 compatible URL decoding
	 * 
	 * @link http://www.php.net/manual/en/function.urldecode.php#79595
	 * @returns string
	 */
	public static function utfdecode( $v ) {
		$v = urldecode( $v );
		$v = preg_replace( '/%u([0-9a-f]{3,4})/i', '&#x\\1;', $v );
		return html_entity_decode( $v, null, 'UTF-8' );
	}
	
	
	/**
	 * HTML safe character entitites in UTF-8
	 * 
	 * @returns string
	 */
	public static function entities( $v ) {
		return htmlentities( 
			iconv( 'UTF-8', 'UTF-8', $v ), 
			ENT_NOQUOTES | ENT_SUBSTITUTE, 
			'UTF-8'
		);
	}	
}

Usage is pretty simple:

$data = $_POST['body'];
$html = new Html();
$data = $html->filter( $data );

Better spam filtering

The classic method for spam detection used to be Bayesian filtering, but honestly, filters using it are getting easier and easier to circumvent due to Bayesian poisoning.

And of course, Bayesian poisoning works because computers are stupid.

We’re so used to this stupidity, but when we’re very involved in a project we also tend to forget it. It’s almost like the omnipresent security cameras in cities; we know they’re there, we know they see everything, but after a while we lose our inhibition to publicly pick our noses (or pick our underwear out of our buttcrack) as long as no “real” humans aren’t around. Why? We’re still being watched.

Back to the topic…

Use the stupidity

Fighting computer stupidity… is stupid. It doesn’t make sense to make computers understand what “spam” is in order to stop it, so we need a better way of quantifying a message by either turning it into a number or a simple searchable string.

If computers excel at anything, it’s raw number-crunching, particularly arithmetic. So it makes sense that rolling hashes are the way to go when it comes to turning a message into a basic number or string that can be scanned for commonalities. There are many ways to turn a sentence or paragraph into a rolling hash, but one that particularly caught my eye was the Rabin-Karp algorithm. I read that page a few times, but the words just blurred after a while (probably due to lack of coffee). Although this part did stick out…

A practical application of Rabin–Karp is detecting plagiarism. Given source material, Rabin–Karp can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

And that makes sense because words can be written in any number of ways and spammers often use punctuation and other special characters to obfuscate what they’re pushing. They also add extra random text, which makes pure Bayesian filtering a problem.

People don’t read punctuation

We don’t when gist of the message is all that matters. Periods, question marks, commas etc… represent clarity in language. For the purpose of pattern matching too, these are irrelevant since most spam messages have no need for these as long as the product name gets through.

So let’s start with a function that does the following:

  • Strips punctuation
  • Removes line-breaks and special whitespace characters
  • Strips special characters (@#$%^&/[]{}-_ etc…)
  • Converts unicode accented characters into their base characters (á into a etc…)
/// <summary>
/// Helper function removes all punctuation and newline characters
/// </summary>
/// <param name="source">Original raw text</param>
/// <returns>Cleaned code</returns>
public string RemoveNoise(string source)
{
	if (String.IsNullOrEmpty(source))
		return String.Empty;

	StringBuilder sb = new StringBuilder();

	// Normalize the string and convert accents etc...
	char[] chars = source.Normalize(NormalizationForm.FormD)
		.Where(c => CharUnicodeInfo.GetUnicodeCategory(c)
			!= UnicodeCategory.NonSpacingMark).ToArray();

	// Append only characters to the StringBuilder
	for (int i = 0; i < chars.Length; i++)
	{
		sb.Append(
			(char.IsPunctuation(chars[i]) ||
			char.IsSeparator(chars[i]) ||
			char.IsControl(chars[i]) ||
			char.IsWhiteSpace(chars[i]) ||
			char.IsSymbol(chars[i])) ?
			' ' : chars[i]
			);
	}

	// Lowercase trimmed text
	return sb.ToString()
		.ToLowerInvariant()
		.Trim();
}

 

We now need a way to calculate the distance of one word from another. The best method I’ve found so far to do this (and works with most languages) is the Damerau-Levenshtein distance algorithm. That Wikipedia article was once again a bit of a blur, but another part stood out to me…

…the original motivation was to measure distance between human misspellings to improve applications such as spell checkers…

So this would be used in those nifty “suggestions” for misspelled words. It sounds and looks awfully complicated, but essentially it calculates the minimum number of steps needed to turn the word “Hello”, for example, into the word “Goodbye” so we can check the distance between each word in a sentence. It makes sense to only get the ones that have a smallest number of steps between words for a spellcheck app, but for our purposes, we just need a consistent number in terms of distance between words.

This is the first step in building our “hash”.

/// <summary>
/// Damerau - Levenshtein distance algorithm
/// </summary>
/// <param name="source">Original text</param>
/// <param name="target">Checking text</param>
/// <param name="limit">Optional maximum word size</param>
/// <returns>Match distance between source and target</returns>
public int Distance(string source, string target, int limit = 50)
{
	if (source.Equals(target))
		return 0;

	if (String.IsNullOrEmpty(source) ||
		String.IsNullOrEmpty(target))
		return (source ?? "").Length + (target ?? "").Length;

	if (source.Length > target.Length)
	{
		var t = source;
		source = target;
		target = t;
	}

	if (target.Contains(source))
		return target.Length - source.Length;

	int sLen = source.Length;
	int tLen = target.Length;

	int[,] d = new int[sLen + 1, tLen + 1];

	// Load the matrix
	for (var i = 0; i <= sLen; i++)
		d[i, 0] = i;

	for (var i = 0; i <= tLen; i++)
		d[0, i] = i;

	for (var i = 1; i <= sLen; i++)
	{
		var min = limit;

		for (var j = 1; j <= tLen; j++)
		{
			var cost =
				(source[i - 1] == target[j - 1]) ? 0 : 1;

			d[i, j] =
				Math.Min(d[i - 1, j] + 1,
				Math.Min(d[i, j - 1] + 1,
				d[i - 1, j - 1] + cost));

			if (i > 1 &&
				j > 1 &&
				source[i - 1] == target[j - 2] &&
				source[i - 2] == target[j - 1])
				d[i, j] =
					Math.Min(d[i, j], d[i - 2, j - 2] + cost);

			if (d[i, j] < min)
				min = d[i, j];
		}

		if (min > limit)
			return int.MaxValue;
	}

	return (d[sLen, tLen] > limit)? int.MaxValue : d[sLen, tLen];
}

 

And of course we need to actually build our hash using the above two functions. We first need to clean it and build our hash by adding the distance between each word and its previous neighbor in a sentence.

/// <summary>
/// A simple distance aggregation function checks
/// the distance between each word in a block of text
/// and builds a rudimentary hash
/// </summary>
/// <param name="source">Source text</param>
/// <returns>Hash</returns>
public string RollingHash(string source)
{
	if (string.IsNullOrEmpty(source))
		return String.Empty;

	StringBuilder sb = new StringBuilder();

	string[] data = RemoveNoise(source)
		.Split(new char[] { ' ' },
		StringSplitOptions.RemoveEmptyEntries);

	// Placeholder to check distance with current string
	string previous = "";

	foreach (string current in data)
	{
		sb.Append(Distance(previous, current).ToString());
		previous = current;
	}

	return sb.ToString();
}

How does this work

Let’s take a typical spam sentence :

Buy Viagra and Cialis today

If we use the above RollingHash function on this, we end up with the hash : 36556.

Now let’s throw this a curve ball. We all know how much spammers love to obfuscate their products with nonsense padding and odd characters. Let’s see what one of those messages may look like…

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Búy viagrÆ and Çiâlis today non tincidunt ipsum porta vel.

Turning this into a hash produces the following the hash : 5455391194655647755
Notice the common block after the “4” with the above hash of the original unobfuscated string.

Let’s take another example :

Vestibulum quis massa turpis. Ut buy ..viägra.. and *&&ciÅlis!! today vel laoreet dolor. Integer euismod, lectus a buy {[ViÃgRa@$]]. and***ciálÏS*** TôDaÿ faucibus congue.

After turning the above into a hash, we’re still able to pick out the original hash : 1094652655656667763655687. The two instances of “Buy Viagra and Cialis today” can be inferred even with the ridiculous amount of obfuscation.

Here’s a side-by-side example that shows the pattern match more clearly.

The first number in the hash doesn't match because there's no word before "buy", but the rest do.

 

Instead of checking word by word for commonalities with spam text, which is what most Bayesian filters do, it would be more practical to convert the entire block of text into a hash and search that instead. For the discussion forum I’m writing, I’m thinking of creating a hash of each post and storing it in the database along with the text content so filtering would be a lot easier.

Update

I overlooked something in the Rabin-Karp algorithm in that it not only takes into account that each word is a hash, but that each hash has a good collision ratio. For this to work more effectively, we need something more than just the distance from one word to its neighbor. If we also include the lengths of each word in the hash along with the distance between them, we instantly get much better collision resistance.

E.G. By changing the StringBuilder Append in RollingHash…

foreach (string current in data)
{
	sb.Append(
		QuickHash(previous, current, Distance(previous, current))
		);

	previous = current;
}

Where QuickHash is as follows…

private static string QuickHash(string c, string s, int d)
{
	return String.Concat(c.Length.ToString(),
		s.Length.ToString(), d.ToString());
}

We will not only make the hash signature of the search text longer, it will make it less likely to collide.

Hash matching improved