Whitelist HTML sanitizing with PHP

The following is a single class written to perform comprehensive HTML input filtering with minimal dependencies (basically only Tidy) and should work in PHP 5.3+. This will be included in my forum script as the default filter.

This version captures URL encoded XSS attempts with deep attribute inspection (to a decoding depth of 6 by default) as well as scrubbing all non-whitelisted attributes, tags and conversion of surviving attribute data into HTML entities.

In addition, it will attempt to capture directory traversal attempts ( ../ or \\ or /~/ etc… ) which may give access to restricted areas of a site. Your web server should deny access to these URLs by default, however that won’t stop someone from posting links pointing elsewhere. This will reduce your liability should such a link be included in your site content by a user.

You can post sourcecode within <code> tags and it will be encoded by default.

<?php

/**
 * HTML parsing, filtering and sanitization
 * This class depends on Tidy which is included in the core since PHP 5.3
 *
 * @author Eksith Rodrigo <reksith at gmail.com>
 * @license http://opensource.org/licenses/ISC ISC License
 * @version 0.2
 */

class Html {
	
	/**
	 * @var array HTML filtering options
	 */
	public static $options = array( 
		'rx_url'	=> // URLs over 255 chars can cause problems
			'~^(http|ftp)(s)?\:\/\/((([a-z|0-9|\-]{1,25})(\.)?){2,7})($|/.*$){4,255}$~i',
		
		'rx_js'		=> // Questionable attributes
			'/((java)?script|eval|document)/ism',
		
		'rx_xss'	=> // XSS (<style> can also be a vector. Stupid IE 6!)
			'/(<(s(?:cript|tyle)).*?)/ism',
		
		'rx_xss2'	=> // More potential XSS
			'/(document\.|window\.|eval\(|\(\))/ism',
		
		'rx_esc'	=> // Directory traversal/escaping/injection
			'/(\\~\/|\.\.|\\\\|\-\-)/sm'	,
		
		'scrub_depth'	=> 6, // URL Decoding depth (fails on exceeding this)
		
		'nofollow'	=> true // Set rel='nofollow' on all links

	);
	
	/**
	 * @var array List of HTML Tidy output settings
	 * @link http://tidy.sourceforge.net/docs/quickref.html
	 */
	private static $tidy = array(
		// Preserve whitespace inside tags
		'add-xml-space'			=> true,
		
		// Remove proprietary markup (E.G. og:tags)
		'bare'				=> true,
		
		// More proprietary markup
		'drop-proprietary-attributes'	=> true,
		
		// Remove blank (E.G. <p></p>) paragraphs
		'drop-empty-paras'		=> true,
		
		// Wraps bare text in <p> tags
		'enclose-text'			=> true,
		
		// Removes illegal/invalid characters in URIs
		'fix-uri'			=> true,
		
		// Removes <!-- Comments -->
		'hide-comments'			=> true,
		
		// Removing indentation saves storage space
		'indent'			=> false,
		
		// Combine individual formatting styles
		'join-styles'			=> true,
		
		// Converts <i> to <em> & <b> to <strong>
		'logical-emphasis'		=> true,
		
		// Byte Order Mark isn't really needed
		'output-bom'			=> false,
		
		// Ensure UTF-8 characters are preserved
		'output-encoding'		=> 'utf8',
		
		// W3C standards compliant markup
		'output-xhtml'			=> true,
		
		// Had some unexpected behavior with this
		//'markup'			=> true,

		// Merge multiple <span> tags into one		
		'merge-spans'			=> true,
		
		// Only outputs <body> (<head> etc... not needed)
		'show-body-only'		=> true,
		
		// Removing empty lines saves storage
		'vertical-space'		=> false,
		
		// Wrapping tags not needed (saves bandwidth)
		'wrap'				=> 0
	);
	
	
	/**
	 * @var array Whitelist of tags. Trim or expand these as necessary
	 * @example 'tag' => array( of, allowed, attributes )
	 */
	private static $whitelist = array(
		'p'		=> array( 'style', 'class', 'align' ),
		'div'		=> array( 'style', 'class', 'align' ),
		'span'		=> array( 'style', 'class' ),
		'br'		=> array( 'style', 'class' ),
		'hr'		=> array( 'style', 'class' ),
		
		'h1'		=> array( 'style', 'class' ),
		'h2'		=> array( 'style', 'class' ),
		'h3'		=> array( 'style', 'class' ),
		'h4'		=> array( 'style', 'class' ),
		'h5'		=> array( 'style', 'class' ),
		'h6'		=> array( 'style', 'class' ),
		
		'strong'	=> array( 'style', 'class' ),
		'em'		=> array( 'style', 'class' ),
		'u'		=> array( 'style', 'class' ),
		'strike'	=> array( 'style', 'class' ),
		'del'		=> array( 'style', 'class' ),
		'ol'		=> array( 'style', 'class' ),
		'ul'		=> array( 'style', 'class' ),
		'li'		=> array( 'style', 'class' ),
		'code'		=> array( 'style', 'class' ),
		'pre'		=> array( 'style', 'class' ),
		
		'sup'		=> array( 'style', 'class' ),
		'sub'		=> array( 'style', 'class' ),
		
		// Took out 'rel' and 'title', because we're using those below
		'a'		=> array( 'style', 'class', 'href' ),
		
		'img'		=> array( 'style', 'class', 'src', 'height', 
					  'width', 'alt', 'longdesc', 'title', 
					  'hspace', 'vspace' ),
		
		'table'		=> array( 'style', 'class', 'border-collapse', 
					  'cellspacing', 'cellpadding' ),
					
		'thead'		=> array( 'style', 'class' ),
		'tbody'		=> array( 'style', 'class' ),
		'tfoot'		=> array( 'style', 'class' ),
		'tr'		=> array( 'style', 'class' ),
		'td'		=> array( 'style', 'class', 
					'colspan', 'rowspan' ),
		'th'		=> array( 'style', 'class', 'scope', 'colspan', 
					  'rowspan' ),
		
		'q'		=> array( 'style', 'class', 'cite' ),
		'cite'		=> array( 'style', 'class' ),
		'abbr'		=> array( 'style', 'class' ),
		'blockquote'	=> array( 'style', 'class' ),
		
		// Stripped out
		'body'		=> array()
	);
	
	
	
	/**#@+
	 * HTML Filtering
	 */
	
	
	/**
	 * Convert content between code blocks into code tags
	 * 
	 * @param $val string Value to encode to entities
	 */
	protected function escapeCode( $val ) {
		
		if ( is_array( $val ) ) {
			$out = self::entities( $val[1] );
			return '<code>' . $out . '</code>';
		}
		
	}
	
	
	/**
	 * Convert an unformatted text block to paragraphs
	 * 
	 * @link http://stackoverflow.com/a/2959926
	 * @param $val string Filter variable
	 */
	protected function makeParagraphs( $val ) {
		
		/**
		 * Convert newlines to linebreaks first
		 * This is why PHP both sucks and is awesome at the same time
		 */
		$out = nl2br( $val );
		
		/**
		 * Turn consecutive <br>s to paragraph breaks and wrap the 
		 * whole thing in a paragraph
		 */
		$out = '<p>' . preg_replace('#(?:<br\s*/?>\s*?){2,}#', 
			'<p></p><p>', $out ) . '</p>';
		
		/**
		 * Remove <br> abnormalities
		 */
		$out = preg_replace( '#<p>(\s*<br\s*/?>)+#', '</p><p>', $out );
		$out = preg_replace( '#<br\s*/?>(\s*</p>)+#', '<p></p>', $out );
		
		return $out;
	}
	
	
	/**
	 * Filters HTML content through whitelist of tags and attributes
	 * 
	 * @param $val string Value filter
	 */
	public function filter( $val ) {
		
		if ( !isset( $val ) || empty( $val ) ) {
			return '';
		}
		
		/**
		 * Escape the content of any code blocks before we parse HTML or 
		 * they will get stripped
		 */
		$out	= preg_replace_callback( "/\<code\>(.*)\<\/code\>/imu", 
				array( $this, 'escapeCode' ) , $val
			);
		
		/**
		 * Convert to paragraphs and begin
		 */
		$out	= $this->makeParagraphs( $out );
		$dom	= new DOMDocument();
		
		/**
		 * Hide parse warnings since we'll be cleaning the output anyway
		 */
		$err	= libxml_use_internal_errors( true );
		
		$dom->loadHTML( $out );
		$dom->encoding = 'utf-8';
		
		$body	= $dom->getElementsByTagName( 'body' )->item( 0 );
		$this->cleanNodes( $body, $badTags );
		
		/**
		 * Iterate through bad tags found above and convert them to 
		 * harmless text
		 */
		foreach ( $badTags as $node ) {
			if( $node->nodeName != "#text" ) {
				$ctext = $dom->createTextNode( 
						$dom->saveHTML( $node )
					);
				$node->parentNode->replaceChild( 
					$ctext, $node 
				);
			}
		}
		
		
		/**
		 * Filter the junk and return only the contents of the body tag
		 */
		$out = tidy_repair_string( 
				$dom->saveHTML( $body ), 
				self::$tidy
			);
		
		
		/**
		 * Reset errors
		 */
		libxml_clear_errors();
		libxml_use_internal_errors( $err );
		
		return $out;
	}
	
	
	protected function cleanAttributeNode( 
		&$node, 
		&$attr, 
		&$goodAttributes, 
		&$href 
	) {
		/**
		 * Why the devil is an attribute name called "nodeName"?!
		 */
		$name = $attr->nodeName;
		
		/**
		 * And an attribute value is still "nodeValue"?? Damn you PHP!
		 */
		$val = $attr->nodeValue;
		
		/**
		 * Default action is to remove the attribute completely
		 * It's reinstated only if it's allowed and only after 
		 * it's filtered
		 */
		$node->removeAttributeNode( $attr );
		
		if ( in_array( $name, $goodAttributes ) ) {
			
			switch ( $name ) {
				
				/**
				 * Validate URL attribute types
				 */
				case 'url':
				case 'src':
				case 'href':
				case 'longdesc':
					if ( self::urlFilter( $val ) ) {
						$href = $val;
					} else {
						$val = '';
					}
					break;
				
				/**
				 * Everything else gets default scrubbing
				 */
				default:
					if ( self::decodeScrub( $val ) ) {
						$val = self::entities( $val );
					} else {
						$val = '';
					}
			}
			
			if ( '' !== $val ) {
				$node->setAttribute( $name, $val );
			}
		}
	}
	
	
	/**
	 * Modify links to display their domains and add 'nofollow'.
	 * Also puts the linked domain in the title as well as the file name
	 */
	protected static function linkAttributes( &$node, $href ) {
		try {
			if ( !self::$options['nofollow'] ) {
				return;
			}
			
			$parsed	= parse_url( $href );
			$title	= $parsed['host'] . ' ';
			
			$f	= pathinfo( $parsed['path'] );
			$title	.= ' ( /' . $f['basename'] . ' ) ';
				
			$node->setAttribute( 
				'title', $title
			);
			
			if ( self::$options['nofollow'] ) {
				$node->setAttribute(
					'rel', 'nofollow'
				);
			}
			
		} catch ( Exception $e ) { }
	}
	
	
	/**
	 * Iterate through each tag and add non-whitelisted tags to the 
	 * bad list. Also filter the attributes and remove non-whitelisted ones.
	 * 
	 * @param htmlNode $node Current HTML node
	 * @param array $badTags Cumulative list of tags for deletion
	 */
	protected function cleanNodes( $node, &$badTags = array() ) {
		
		if ( array_key_exists( $node->nodeName, self::$whitelist ) ) {
			
			if ( $node->hasAttributes() ) {
				
				/**
				 * Prepare for href attribute which gets special 
				 * treatment
				 */
				$href = '';
				
				/**
				 * Filter through attribute whitelist for this 
				 * tag
				 */
				$goodAttributes = 
					self::$whitelist[$node->nodeName];
				
				
				/**
				 * Check out each attribute in this tag
				 */
				foreach ( 
					iterator_to_array( $node->attributes ) 
					as $attr ) {
					$this->cleanAttributeNode( 
						$node, $attr, $goodAttributes, 
						$href
					);
				}
				
				/**
				 * This is a link. Treat it accordingly
				 */
				if ( 'a' === $node->nodeName && '' !== $href ) {
					self::linkAttributes( $node, $href );
				}
				
			} // End if( $node->hasAttributes() )
			
			/**
			 * If we have childnodes, recursively call cleanNodes 
			 * on those as well
			 */
			if ( $node->childNodes ) {
				foreach ( $node->childNodes as $child ) {
					$this->cleanNodes( $child, $badTags );
				}
			}
			
		} else {
			
			/**
			 * Not in whitelist so no need to check its child nodes. 
			 * Simply add to array of nodes pending deletion.
			 */
			$badTags[] = $node;
			
		} // End if array_key_exists( $node->nodeName, self::$whitelist )
		
	}
	
	/**#@-*/
	
	
	/**
	 * Returns true if the URL passed value is harmless.
	 * This regex takes into account Unicode domain names however, it 
	 * doesn't check for TLD (.com, .net, .mobi, .museum etc...) as that 
	 * list is too long.
	 * The purpose is to ensure your visitors are not harmed by invalid 
	 * markup, not that they get a functional domain name.
	 * 
	 * @param string $v Raw URL to validate
	 * @returns boolean
	 */
	public static function urlFilter( $v ) {
		
		$v = strtolower( $v );
		$out = false;
		
		if ( filter_var( $v, 
			FILTER_VALIDATE_URL, FILTER_FLAG_SCHEME_REQUIRED ) ) {
			
			/**
			 * PHP's native filter isn't restrictive enough.
			 */
			if ( preg_match( self::$options['rx_url'], $v ) ) {
				$out = true;
			} else {
				$out = false;
			}
			
			if ( $out ) {
				$out = self::decodeScrub( $v );
			}
		} else {
			$out = false;
		}
		
		return $out;
	}
	
	
	/**
	 * Regular expressions don't work well when used for validating HTML.
	 * It really shines when evaluating text so that's what we're doing here
	 * 
	 * @param string $v string Attribute name
	 * @param int $depth Number of times to URL decode
	 * @returns boolean True if nothing unsavory was found.
	 */
	public static function decodeScrub( $v ) {
		if ( empty( $v ) ) {
			return true;
		}
		
		$depth		= self::$options['scrub_depth'];
		$i		= 1;
		$success	= false;
		$old		= '';
		
		
		while( $i <= $depth && !empty( $v ) ) {
			// Check for any JS and other shenanigans
			if (
				preg_match( self::$options['rx_xss'], $v ) || 
				preg_match( self::$options['rx_xss2'], $v ) || 
				preg_match( self::$options['rx_esc'], $v )
			) {
				$success = false;
				break;
			} else {
				$old	= $v;
				$v	= self::utfdecode( $v );
				
				/**
				 * We found the the lowest decode level.
				 * No need to continue decoding.
				 */
				if ( $old === $v ) {
					$success = true;
					break;
				}
			}
			
			$i++;
		}
		
		
		/**
		 * If after decoding a number times, we still couldn't get to 
		 * the original string, then there's something still wrong
		 */
		if ( $old !== $v && $i === $depth ) {
			return false;
		}
		
		return $success;
	}
	
	
	/**
	 * UTF-8 compatible URL decoding
	 * 
	 * @link http://www.php.net/manual/en/function.urldecode.php#79595
	 * @returns string
	 */
	public static function utfdecode( $v ) {
		$v = urldecode( $v );
		$v = preg_replace( '/%u([0-9a-f]{3,4})/i', '&#x\\1;', $v );
		return html_entity_decode( $v, null, 'UTF-8' );
	}
	
	
	/**
	 * HTML safe character entitites in UTF-8
	 * 
	 * @returns string
	 */
	public static function entities( $v ) {
		return htmlentities( 
			iconv( 'UTF-8', 'UTF-8', $v ), 
			ENT_NOQUOTES | ENT_SUBSTITUTE, 
			'UTF-8'
		);
	}	
}

Usage is pretty simple:

$data = $_POST['body'];
$html = new Html();
$data = $html->filter( $data );
Advertisement

AntiXss 4.2 Breaks everything

This is one of those situations where none of your available options are good and your least harmful alternative is to shoot yourself in the foot at a slightly odd angle so as to only lose the little toe and not the big one.

All of this happened when Microsoft revealed January that their AntiXss library, now known as the Microsoft Web Protection Library (never seen a more ironic combination of words), had a vulnerability and like all obedient drones, we must update immediately to avoid shooting ourselves in our big toe. The problem is that updating will cause you to loose your little toe.

You see, the new library BREAKS EVERYTHING and eats your children.

Update 11/14/2013:
A new HTML sanitizer is now available for PHP.

I WILL EAT ALL YOUR TAGS!!!

I think the problem is best described by someone who left a comment at the project discussion board.

I was using an old version of Anti-XSS with a rich text editor (CkEditor). It was working very great. But when upgrading to latest version, I discovered the new sanitized is way too much aggressive and is removing almost everything “rich” in the rich editor, specially colors, backgrounds, font size, etc… It’s a disaster for my CMS!

Is there any migration path I can use to keep some of the features of the rich text editor and having at least minimal XSS protection ?

Lovely eh?

Here’s the response from the coordinator.

CSS will always be stripped now – it’s too dangerous, but in other cases it is being too greedy, dropping hrefs from a tags for example. That is being looked at.

I know this may be a strange idea to comprehend for the good folks who developed the library, but you see in the civilized world, many people tend to use WYSIWYG in their projects so as to not burden their users with tags. These days more people are familiar with rudimentary HTML, but when you just want to quickly make a post, comment or otherwise share something, it’s nice to know there’s an editor that can accommodate rich formatting. This is especially true on a mobile device, where switching from text to special characters for tags is still annoying.

Those WYSIWYGs invariably use CSS and inline styles to accomplish this rich formatting, thereby making your assertion ridiculous and this library now completely impractical.

A very quick test on the 4.2 Sanitizer shows that it totally removes strong tags, h1 tags, section tags and as mentioned above strips href attributes from anchor tags. At this rate the output will soon be string.Empty. I hope that the next version will allow basic markup tags and restore the href to anchors.

So in other words, AntiXss is now like an antidepressant. You’ll feel a lot better after taking it, but you may end up killing yourself.

And that’s not all…

I would have kept my mouth shut about this even though I’ve had my doubts about depending on the library over something DIY, but since I work with a bunch of copycat monkeys, I have to use whatever everyone else deems worthy of being included in a project (common sense be damned). I thought, surely there would at least be the older versions available, but no

It’s company policy I’m afraid. The source will remain though, so if you desperately wanted you could download and compile your own versions of older releases.

Of course, I lost my temper at that. Since I’m forced to use this library and one of the devs went ahead and upgraded without backing up the old version or finding out exactly how the vulnerability would affect us. I now had to go treasure hunting across three computers to find 4.0 after just getting home.

AntiXss 4.2 is stupid and so is Microsoft.

Here’s my current workaround until MS comes up with a usable alternative. I’m also using the HtmlAgilityPack which at the moment hasn’t contracted rabies, thankfully, and the 4.0 library.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace Arcturus.Helpers
{
	/// <summary>
	/// This is an HTML cleanup utility combining the benefits of the
	/// HtmlAgilityPack to parse raw HTML and the AntiXss library
	/// to remove potentially dangerous user input.
	///
	/// Additionally it uses a list created by Robert Beal to limit
	/// the number of allowed tags and attributes to a sensible level
	/// </summary>
	public sealed class HtmlUtility
	{
		private static volatile HtmlUtility _instance;
		private static object _root = new object();

		private HtmlUtility() { }

		public static HtmlUtility Instance
		{
			get
			{
				if (_instance == null)
					lock (_root)
						if (_instance == null)
							_instance = new HtmlUtility();

				return _instance;
			}
		}

		// Original list courtesy of Robert Beal :
		// http://www.robertbeal.com/

		private static readonly Dictionary<string, string[]> ValidHtmlTags =
			new Dictionary<string, string[]>
        {
            {"p", new string[]          {"style", "class", "align"}},
            {"div", new string[]        {"style", "class", "align"}},
            {"span", new string[]       {"style", "class"}},
            {"br", new string[]         {"style", "class"}},
            {"hr", new string[]         {"style", "class"}},
            {"label", new string[]      {"style", "class"}},

            {"h1", new string[]         {"style", "class"}},
            {"h2", new string[]         {"style", "class"}},
            {"h3", new string[]         {"style", "class"}},
            {"h4", new string[]         {"style", "class"}},
            {"h5", new string[]         {"style", "class"}},
            {"h6", new string[]         {"style", "class"}},

            {"font", new string[]       {"style", "class",
				"color", "face", "size"}},
            {"strong", new string[]     {"style", "class"}},
            {"b", new string[]          {"style", "class"}},
            {"em", new string[]         {"style", "class"}},
            {"i", new string[]          {"style", "class"}},
            {"u", new string[]          {"style", "class"}},
            {"strike", new string[]     {"style", "class"}},
            {"ol", new string[]         {"style", "class"}},
            {"ul", new string[]         {"style", "class"}},
            {"li", new string[]         {"style", "class"}},
            {"blockquote", new string[] {"style", "class"}},
            {"code", new string[]       {"style", "class"}},
			{"pre", new string[]       {"style", "class"}},

            {"a", new string[]          {"style", "class", "href", "title"}},
            {"img", new string[]        {"style", "class", "src", "height",
				"width", "alt", "title", "hspace", "vspace", "border"}},

            {"table", new string[]      {"style", "class"}},
            {"thead", new string[]      {"style", "class"}},
            {"tbody", new string[]      {"style", "class"}},
            {"tfoot", new string[]      {"style", "class"}},
            {"th", new string[]         {"style", "class", "scope"}},
            {"tr", new string[]         {"style", "class"}},
            {"td", new string[]         {"style", "class", "colspan"}},

            {"q", new string[]          {"style", "class", "cite"}},
            {"cite", new string[]       {"style", "class"}},
            {"abbr", new string[]       {"style", "class"}},
            {"acronym", new string[]    {"style", "class"}},
            {"del", new string[]        {"style", "class"}},
            {"ins", new string[]        {"style", "class"}}
        };

		/// <summary>
		/// Takes raw HTML input and cleans against a whitelist
		/// </summary>
		/// <param name="source">Html source</param>
		/// <returns>Clean output</returns>
		public string SanitizeHtml(string source)
		{
			HtmlDocument html = GetHtml(source);
			if (html == null) return String.Empty;

			// All the nodes
			HtmlNode allNodes = html.DocumentNode;

			// Select whitelist tag names
			string[] whitelist = (from kv in ValidHtmlTags
								  select kv.Key).ToArray();

			// Scrub tags not in whitelist
			CleanNodes(allNodes, whitelist);

			// Filter the attributes of the remaining
			foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags)
			{
				IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf()
											   where n.Name == tag.Key
											   select n);

				// No nodes? Skip.
				if (nodes == null) continue;

				foreach (var n in nodes)
				{
					// No attributes? Skip.
					if (!n.HasAttributes) continue;

					// Get all the allowed attributes for this tag
					HtmlAttribute[] attr = n.Attributes.ToArray();
					foreach (HtmlAttribute a in attr)
					{
						if (!tag.Value.Contains(a.Name))
						{
							a.Remove(); // Attribute wasn't in the whitelist
						}
						else
						{
							// *** New workaround. This wasn't necessary with the old library
							if (a.Name == "href" || a.Name == "src") {
								a.Value = (!string.IsNullOrEmpty(a.Value))? a.Value.Replace("\r", "").Replace("\n", "") : "";
								a.Value =
									(!string.IsNullOrEmpty(a.Value) &&
									(a.Value.IndexOf("javascript") < 10 || a.Value.IndexOf("eval") < 10)) ?
									a.Value.Replace("javascript", "").Replace("eval", "") : a.Value;
							}
							else if (a.Name == "class" || a.Name == "style")
							{
								a.Value =
									Microsoft.Security.Application.Encoder.CssEncode(a.Value);
							}
							else
							{
								a.Value =
									Microsoft.Security.Application.Encoder.HtmlAttributeEncode(a.Value);
							}
						}
					}
				}
			}

			// *** New workaround (DO NOTHING HAHAHA! Fingers crossed)
			return allNodes.InnerHtml;

			// *** Original code below

			/*
			// Anything we missed will get stripped out
			return
				Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(allNodes.InnerHtml);
			 */
		}

		/// <summary>
		/// Takes a raw source and removes all HTML tags
		/// </summary>
		/// <param name="source"></param>
		/// <returns></returns>
		public string StripHtml(string source)
		{
			source = SanitizeHtml(source);

			// No need to continue if we have no clean Html
			if (String.IsNullOrEmpty(source))
				return String.Empty;

			HtmlDocument html = GetHtml(source);
			StringBuilder result = new StringBuilder();

			// For each node, extract only the innerText
			foreach (HtmlNode node in html.DocumentNode.ChildNodes)
				result.Append(node.InnerText);

			return result.ToString();
		}

		/// <summary>
		/// Recursively delete nodes not in the whitelist
		/// </summary>
		private static void CleanNodes(HtmlNode node, string[] whitelist)
		{
			if (node.NodeType == HtmlNodeType.Element)
			{
				if (!whitelist.Contains(node.Name))
				{
					node.ParentNode.RemoveChild(node);
					return; // We're done
				}
			}

			if (node.HasChildNodes)
				CleanChildren(node, whitelist);
		}

		/// <summary>
		/// Apply CleanNodes to each of the child nodes
		/// </summary>
		private static void CleanChildren(HtmlNode parent, string[] whitelist)
		{
			for (int i = parent.ChildNodes.Count - 1; i >= 0; i--)
				CleanNodes(parent.ChildNodes[i], whitelist);
		}

		/// <summary>
		/// Helper function that returns an HTML document from text
		/// </summary>
		private static HtmlDocument GetHtml(string source)
		{
			HtmlDocument html = new HtmlDocument();
			html.OptionFixNestedTags = true;
			html.OptionAutoCloseOnEnd = true;
			html.OptionDefaultStreamEncoding = Encoding.UTF8;

			html.LoadHtml(source);

			// Encode any code blocks independently so they won't
			// be stripped out completely when we do a final cleanup
			foreach (var n in html.DocumentNode.DescendantNodesAndSelf())
			{
				if (n.Name == "code") {
					//** Code tag attribute vulnerability fix 28-9-12 (thanks to Natd)
					HtmlAttribute[] attr = n.Attributes.ToArray();
					foreach (HtmlAttribute a in attr) {
						if (a.Name != "style" && a.Name != "class")  { a.Remove(); }
					} //** End fix
					n.InnerHtml =
						Microsoft.Security.Application.Encoder.HtmlEncode(n.InnerHtml);
				}
			}

			return html;
		}
	}
}

This is a singleton class, so you need to call Instance to initiate.

E.G.

HtmlUtility util = HtmlUtility.Instance;

7:40AM… Bedtime!

Update : September 28.

Natd discovered a vulnerability in this code that allowed onclick attributes to be added to the code tag itself. Fixed.

ASP.Net BBCode (C#)

Update

This code has now been superceeded by a better alternative that will allow you to use an off-the-shelf WYSIWYG and still allow custom tags.

This problem comes up if you find yourself creating a forum from scratch or implementing some sort of comment system and want to make sure you can introduce some formatting functionality without compromising security.

Well, there are plenty of regular expressions examples out there, but few deal directly with BBCode and of those, most don’t go beyond the basic Bold, Italic, and Strike formatting plus HTML links, images etc…

This example not only formats the above basic stuff, but also does quotes, alignment, Google search links, Wikipedia article links, as well embedded videos for several video sharing sites. You can always add more tags by following the same pattern. All the content is formatted into paragraphs (<p>) for proper validation. It checks for nested quotes up to a specified depth.

There is no extensive input cleanup to prevent XSS attacks through tags. I’m just showing the basics like tag stripping which can be circumvented by clever attakers, so it’s up to you to implement a more thorough system. The reason I’m excluding it is because there are already many, many, many examples out there that do a wonderful job at it.

This does set a limited set of formatting options.

To make everything easier to read in the code file, I used multiple Regex replacements instead of one super duper pattern. This also makes adding quick tags for something else much simpler.

A sample of the rendered markup :

  • [b]Bold[/b] = Bold
  • [i]Italic[/i] = Italic
  • [del]Strike[/de] = Strike
  • [color=blue]Blue[/blue] = Blue
  • [color=#FA9A99]Pinkish[/blue] = Pinkish
  • [size=2]Larger text[/size] (between 2 & 5) = Larger text
  • [google]once in a blue moon[/google] = once in a blue moon
  • [wikipedia]Captain Haddock[/wikipedia] = Captain Haddock (Remember, Wikipedia articles are case sensitive)
  • [img]http://www.google.com/logos/Logo_40blk.gif[/img] =
  • [img=Google Logo]http://www.google.com/logos/Logo_40blk.gif[/img] = Google Logo

One major difference than other BBCode functions is the ability to embed YouTube, Metacafe and LiveVideo media players. You only need to specify the URL within the tags.

E.G. For youtube :


For LiveVideo
[livevideo]http://www.livevideo.com/video/Megalis/E7C2EEE8A7C740379F40B5ECA56ACE8A/momma-sophia-s-dream.aspx[/livevideo]

I don’t think WordPress supports embedding Metacafe videos at the time of this post, but if you want to include those videos :
[metacafe]http://www.metacafe.com/watch/685732/the_diet/[/metacafe]

To find the exact URL of the LiveVideo page, click on “Get Codes” link underneath the video player.
I tried to keep this as simple as possible for users as they just need to wrap the url around [tag][/tag] markers.

You can embed quotes following a similar convention to phpBB. But there are slight differences as this was for a custom application.

[quote]This is a quote[/quote]

This is a quote

[quote=Author]This is a quote[/quote]

Author wrote


This is a quote

And so on…

This version also deals with Headers
([h#]Header[/h#] becomes <h#>Header</h#> from 1 to 6.)

[h3]Header[/h3]

[h4]Header[/h4]

And so on…

Of course, you would want to apply special formatting via CSS to keep the look of the page consistent with the rest of the site.

This particular excerpt was written for a .Net 3.5 app, but this portion should work on 2.0+ with no alterations since it doesn’t use anything unique to the newer framework.

This class is by no means meant to be a comprehensive bbcode plugin, but it should get you on the way to create your own custom tags.

Once again, this code has no usage restrictions. I’m just including a disclaimer like all other code samples I’ve posted here. You don’t have to ask me permission to use it for any purpose and I only ask that you abide by the disclaimer.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

The tag function

/// <summary>
/// Converts the input plain-text BBCode to HTML output and replacing carriage returns
/// and spaces with <br /> and   etc...
/// Recommended: Use this function only during storage and updates.
/// Keep a seperate field in your database for HTML formatted content and raw text.
/// An optional third, plain text field, with no formatting info will make full text searching
/// more accurate.
/// E.G. BodyText(with BBCode for textarea/WYSIWYG), BodyPlain(plain text for searching),
/// BodyHtml(formatted HTML for output pages)
/// </summary>
public static string ConvertToHtml(string content)
{
    // Clean your content here... E.G.:
    // content = CleanText(content);

    // Basic tag stripping for this example (PLEASE EXTEND THIS!)
    content = StripTags(content);

    content = MatchReplace(@"\[b\]([^\]]+)\[\/b\]", "<strong>$1</strong>", content);
    content = MatchReplace(@"\[i\]([^\]]+)\[\/i\]", "<em>$1</em>", content);
    content = MatchReplace(@"\[u\]([^\]]+)\[\/u\]", "<span style=""text-decoration:underline"">$1</span>", content);
    content = MatchReplace(@"\[del\]([^\]]+)\[\/del\]", "<span style=""text-decoration:line-through"">$1</span>", content);

    // Colors and sizes
    content = MatchReplace(@"\[color=(#[0-9a-fA-F]{6}|[a-z-]+)]([^\]]+)\[\/color\]", "<span style=""color:$1;"">$2</span>", content);
    content = MatchReplace(@"\[size=([2-5])]([^\]]+)\[\/size\]", "<span style=""font-size:$1em; font-weight:normal;"">$2</span>", content);

    // Text alignment
    content = MatchReplace(@"\[left\]([^\]]+)\[\/left\]", "<span style=""text-align:left"">$1</span>", content);
    content = MatchReplace(@"\[right\]([^\]]+)\[\/right\]", "<span style=""text-align:right"">$1</span>", content);
    content = MatchReplace(@"\[center\]([^\]]+)\[\/center\]", "<span style=""text-align:center"">$1</span>", content);
    content = MatchReplace(@"\[justify\]([^\]]+)\[\/justify\]", "<span style=""text-align:justify"">$1</span>", content);

    // HTML Links
    content = MatchReplace(@"\[url\]([^\]]+)\[\/url\]", "<a href=""$1"">$1</a>", content);
    content = MatchReplace(@"\[url=([^\]]+)]([^\]]+)\[\/ur\l]", "<a href=""$1"">$2</a>", content);

    // Images
    content = MatchReplace(@"\[img\]([^\]]+)\[\/img\]", "<img src=""$1"" alt="""" />", content);
    content = MatchReplace(@"\[img=([^\]]+)]([^\]]+)\[\/img\]", "<img src=""$2"" alt=""$1"" />", content);

    // Lists
    content = MatchReplace(@"\[*\]([^\[]+)", "<li>$1</li>", content);
    content = MatchReplace(@"\[list\]([^\]]+)\[\/list\]", "<ul>$1</ul><p>", content);
    content = MatchReplace(@"\[list=1\]([^\]]+)\[\/list\]", "</p><ol>$1</ol><p>", content);

    // Headers
    content = MatchReplace(@"\[h1\]([^\]]+)\[\/h1\]", "<h1>$1</h1>", content);
    content = MatchReplace(@"\[h2\]([^\]]+)\[\/h2\]", "<h2>$1</h2>", content);
    content = MatchReplace(@"\[h3\]([^\]]+)\[\/h3\]", "<h3>$1</h3>", content);
    content = MatchReplace(@"\[h4\]([^\]]+)\[\/h4\]", "<h4>$1</h4>", content);
    content = MatchReplace(@"\[h5\]([^\]]+)\[\/h5\]", "<h5>$1</h5>", content);
    content = MatchReplace(@"\[h6\]([^\]]+)\[\/h6\]", "<h6>$1</h6>", content);

    // Horizontal rule
    content = MatchReplace(@"\[hr\]", "<hr />", content);

    // Set a maximum quote depth (In this case, hard coded to 3)
    for (int i = 1; i < 3; i++)
    {
        // Quotes
        content = MatchReplace(@"\[quote=([^\]]+)@([^\]]+)|([^\]]+)]([^\]]+)\[\/quote\]", "</p><div class=""block""><blockquote><cite>$1 <a href=""" + QuoteUrl("$3") + """>wrote</a> on $2</cite><hr /><p>$4</p></blockquote></div></p><p>", content);
        content = MatchReplace(@"\[quote=([^\]]+)@([^\]]+)]([^\]]+)\[\/quote\]", "</p><div class=""block""><blockquote><cite>$1 wrote on $2</cite><hr /><p>$3</p></blockquote></div><p>", content);
        content = MatchReplace(@"\[quote=([^\]]+)]([^\]]+)\[\/quote\]", "</p><div class=""block""><blockquote><cite>$1 wrote</cite><hr /><p>$2</p></blockquote></div><p>", content);
        content = MatchReplace(@"\[quote\]([^\]]+)\[\/quote\]", "</p><div class=""block""><blockquote><p>$1</p></blockquote></div><p>", content);
    }

    // The following markup is for embedded video -->

    // YouTube
    content = MatchReplace(@"\http:\/\/([a-zA-Z]+.)youtube.com\/watch\?v=([a-zA-Z0-9_\-]+)\[\/youtube\]",
        "<object width=""425"" height=""344""><param name=""movie"" value=""http://www.youtube.com/v/$2""></param><param name=""allowFullScreen"" value=""true""></param><embed src=""http://www.youtube.com/v/$2"" type=""application/x-shockwave-flash"" allowfullscreen=""true"" width=""425"" height=""344""></embed></object>", content);

    // LiveVideo
    content = MatchReplace(@"\[livevideo\]http:\/\/([a-zA-Z]+.)livevideo.com\/video\/([a-zA-Z0-9_\-]+)\/([a-zA-Z0-9]+)\/([a-zA-Z0-9_\-]+).aspx\[\/livevideo\]",
        "<object width=""445"" height=""369""><embed src=""http://www.livevideo.com/flvplayer/embed/$3"" type=""application/x-shockwave-flash"" quality=""high"" width=""445"" height=""369"" wmode=""transparent""></embed></object>", content);

    // LiveVideo (There are two types of links for LV)
    content = MatchReplace(@"\[livevideo\]http:\/\/([a-zA-Z]+.)livevideo.com\/video\/([a-zA-Z0-9]+)\/([a-zA-Z0-9_\-]+).aspx\[\/livevideo\]",
        "<object width=""445"" height=""369""><embed src=""http://www.livevideo.com/flvplayer/embed/$2&autostart=0"" type=""application/x-shockwave-flash"" quality=""high"" width=""445"" height=""369"" wmode=""transparent""></embed></object>", content);

    // Metacafe
    content = MatchReplace(@"\[metacafe\]http\:\/\/([a-zA-Z]+.)metacafe.com\/watch\/([0-9]+)\/([a-zA-Z0-9_]+)/\[\/metacafe\]",
        "<object width=""400"" height=""345""><embed src=""http://www.metacafe.com/fplayer/$2/$3.swf"" width=""400"" height=""345"" wmode=""transparent"" pluginspage=""http://www.macromedia.com/go/getflashplayer"" type=""application/x-shockwave-flash""></embed></object>", content);

    // LiveLeak
    content = MatchReplace(@"\[liveleak\]http:\/\/([a-zA-Z]+.)liveleak.com\/view\?i=([a-zA-Z0-9_]+)\[\/liveleak\]",
        "<object width=""450"" height=""370""><param name=""movie"" value=""http://www.liveleak.com/e/$2""></param><param name=""wmode"" value=""transparent""></param><embed src=""http://www.liveleak.com/e/59a_1231807882"" type=""application/x-shockwave-flash"" wmode=""transparent"" width=""450"" height=""370""></embed></object>", content);

    // < -- End video markup

    // Google and Wikipedia page links
    content = MatchReplace(@"\[google\]([^\]]+)\[\/google\]", "<a href=""http://www.google.com/search?q=$1"">$1", content);
    content = MatchReplace(@"\[wikipedia\]([^\]]+)\[\/wikipedia\]", "<a href=""http://www.wikipedia.org/wiki/$1"">$1</a>", content);

    // Put the content in a paragraph
    content = "</p><p>" + content + "</p>";

    // Clean up a few potential markup problems
    content = content.Replace("t", "    ")
        .Replace("  ", "  ")
        .Replace("<br />", "")
        .Replace("<p><br />", "</p><p>")
        .Replace("</p><p><blockquote>", "<blockquote>")
        .Replace("</blockquote></blockquote></p>", "")
        .Replace("<p></p>", "")
        .Replace("<p><ul></ul></p>", "<ul>")
        .Replace("<p></p></ul>", "")
        .Replace("<p><ol></ol></p>", "<ol>")
        .Replace("<p></p></ol>", "")
        .Replace("<p><li>", "</li><li><p>")
        .Replace("</p></li></p>", "");

    return content;
}

StripTags and Match Replace functions

/// <summary>
/// Strip any existing HTML tags
/// </summary>
///
<param name="content">Raw input from user</param>
/// <returns>Tag stripped storage safe text</returns>
public static string StripTags(string content)
{
	return MatchReplace(@"< [^>]+>", "", content, true, true, true);
}

public static string MatchReplace(string pattern, string match, string content)
{
	return MatchReplace(pattern, match, content, false, false, false);
}

public static string MatchReplace(string pattern, string match, string content, bool multi)
{
	return MatchReplace(pattern, match, content, multi, false, false);
}

public static string MatchReplace(string pattern, string match, string content, bool multi, bool white)
{
	return MatchReplace(pattern, match, content, multi, white);
}

/// <summary>
/// Match and replace a specific pattern with formatted text
/// </summary>
///
<param name="pattern">Regular expression pattern</param>
///
<param name="match">Match replacement</param>
///
<param name="content">Text to format</param>
///
<param name="multi">Multiline text (optional)</param>
///
<param name="white">Ignore white space (optional)</param>
/// <returns>HTML Formatted from the original BBCode</returns>
public static string MatchReplace(string pattern, string match, string content, bool multi, bool white, bool cult)
{
	if (multi && white && cult)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
	else if (multi && white)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnoreCase);
	else if (multi && cult)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant);
	else if (white && cult)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant);
	else if (multi)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.Multiline);
	else if (white)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
	else if (cult)
		return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);

	// Default
	return Regex.Replace(content, pattern, match, RegexOptions.IgnoreCase);
}

Enjoy!

Addendum…
Robert Beal, in particular has created a wonderful HtmlUtility class for C# 3.0+ that will only allow certain tags and tag attributes. If your visitors make use of extensive HTML tags, then that is a better option than my system. If you want to implement a feedback system for guest writers with strong HTML support, Robert’s example is highly recommended.

My example is really only for people who post in plain text most of the time, would only post formatted text and videos semi-rarely.

In fact, it’s probably best to avoid extensive HTML support for ordinary comments, as that will only encourage users to abuse the system. You’re better off using this for something minimal.