Microsoft has a different definition for “Open Source”

The Microsoft Web Protection Library, commonly referred to as AntiXss (the previous class name) has had a bit of a vulnerability. Now we all know that vulnerability and Microsoft usually go together like chicken and eggs, but you see what makes this different is that it’s an apparent earnest effort at an Open Source project. Unfortunately, the newest iteration of this project, 4.2 breaks everything. This isn’t unusual by itself either as that too is a Microsoft staple and indeed other projects have faced similar issues with “fixes”. What is unusual is that MS has reinterpreted the meaning of Open Source and removed all previous binaries to the library that actually worked, even with the vulnerability (thereby making it non-Open), and the sources for the 4.2 “fixes” are still unavailable.

Now I’ve read the FAQ for Open Source, but I couldn’t find single instance where this behavior would fit under the term. I scanned through the FAQ again, and found a behavior similar to what MS is actually doing and came across the following :

What if I do not want to distribute my program in source code form? Or what if I don’t want to distribute it in either source or binary form?

If you don’t distribute source code, then what you are distributing cannot meaningfully be called “Open Source”. And if you don’t distribute at all, then by definition you’re not distributing source code, so you’re not distributing anything Open Source.

Interesting…

What’s more, “Barry” the coordinator of the project on CodePlex has stated the following when another user wanted the sources listed :

The source branch for 4.0, release, is available – the dates don’t ever match due to the way we publish.

The source for 4.2 is not available – it takes a bit of cleaning before publication (we usually have a 1-2 week gap), and as we’re working on getting the sanitizer functional again for 4.3 taking the time to publish the 4.2 code would remove effort from tracking down what is going on.

That’s not quite what it means to be Open Source, Barry.

But earlier on the same thread he had said :

It’s company policy I’m afraid. The source will remain though, so if you desperately wanted you could download and compile your own versions of older releases.

Notice it wasn’t project policy or community policy. But it was company policy; meaning Microsoft has a different definition of the term Open Source.

So AntiXss is no longer an Open Source project.

AntiXss 4.2 Breaks everything

This is one of those situations where none of your available options are good and your least harmful alternative is to shoot yourself in the foot at a slightly odd angle so as to only lose the little toe and not the big one.

All of this happened when Microsoft revealed January that their AntiXss library, now known as the Microsoft Web Protection Library (never seen a more ironic combination of words), had a vulnerability and like all obedient drones, we must update immediately to avoid shooting ourselves in our big toe. The problem is that updating will cause you to loose your little toe.

You see, the new library BREAKS EVERYTHING and eats your children.

Update 11/14/2013:
A new HTML sanitizer is now available for PHP.

I WILL EAT ALL YOUR TAGS!!!

I think the problem is best described by someone who left a comment at the project discussion board.

I was using an old version of Anti-XSS with a rich text editor (CkEditor). It was working very great. But when upgrading to latest version, I discovered the new sanitized is way too much aggressive and is removing almost everything “rich” in the rich editor, specially colors, backgrounds, font size, etc… It’s a disaster for my CMS!

Is there any migration path I can use to keep some of the features of the rich text editor and having at least minimal XSS protection ?

Lovely eh?

Here’s the response from the coordinator.

CSS will always be stripped now – it’s too dangerous, but in other cases it is being too greedy, dropping hrefs from a tags for example. That is being looked at.

I know this may be a strange idea to comprehend for the good folks who developed the library, but you see in the civilized world, many people tend to use WYSIWYG in their projects so as to not burden their users with tags. These days more people are familiar with rudimentary HTML, but when you just want to quickly make a post, comment or otherwise share something, it’s nice to know there’s an editor that can accommodate rich formatting. This is especially true on a mobile device, where switching from text to special characters for tags is still annoying.

Those WYSIWYGs invariably use CSS and inline styles to accomplish this rich formatting, thereby making your assertion ridiculous and this library now completely impractical.

A very quick test on the 4.2 Sanitizer shows that it totally removes strong tags, h1 tags, section tags and as mentioned above strips href attributes from anchor tags. At this rate the output will soon be string.Empty. I hope that the next version will allow basic markup tags and restore the href to anchors.

So in other words, AntiXss is now like an antidepressant. You’ll feel a lot better after taking it, but you may end up killing yourself.

And that’s not all…

I would have kept my mouth shut about this even though I’ve had my doubts about depending on the library over something DIY, but since I work with a bunch of copycat monkeys, I have to use whatever everyone else deems worthy of being included in a project (common sense be damned). I thought, surely there would at least be the older versions available, but no

It’s company policy I’m afraid. The source will remain though, so if you desperately wanted you could download and compile your own versions of older releases.

Of course, I lost my temper at that. Since I’m forced to use this library and one of the devs went ahead and upgraded without backing up the old version or finding out exactly how the vulnerability would affect us. I now had to go treasure hunting across three computers to find 4.0 after just getting home.

AntiXss 4.2 is stupid and so is Microsoft.

Here’s my current workaround until MS comes up with a usable alternative. I’m also using the HtmlAgilityPack which at the moment hasn’t contracted rabies, thankfully, and the 4.0 library.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace Arcturus.Helpers
{
	/// <summary>
	/// This is an HTML cleanup utility combining the benefits of the
	/// HtmlAgilityPack to parse raw HTML and the AntiXss library
	/// to remove potentially dangerous user input.
	///
	/// Additionally it uses a list created by Robert Beal to limit
	/// the number of allowed tags and attributes to a sensible level
	/// </summary>
	public sealed class HtmlUtility
	{
		private static volatile HtmlUtility _instance;
		private static object _root = new object();

		private HtmlUtility() { }

		public static HtmlUtility Instance
		{
			get
			{
				if (_instance == null)
					lock (_root)
						if (_instance == null)
							_instance = new HtmlUtility();

				return _instance;
			}
		}

		// Original list courtesy of Robert Beal :
		// http://www.robertbeal.com/

		private static readonly Dictionary<string, string[]> ValidHtmlTags =
			new Dictionary<string, string[]>
        {
            {"p", new string[]          {"style", "class", "align"}},
            {"div", new string[]        {"style", "class", "align"}},
            {"span", new string[]       {"style", "class"}},
            {"br", new string[]         {"style", "class"}},
            {"hr", new string[]         {"style", "class"}},
            {"label", new string[]      {"style", "class"}},

            {"h1", new string[]         {"style", "class"}},
            {"h2", new string[]         {"style", "class"}},
            {"h3", new string[]         {"style", "class"}},
            {"h4", new string[]         {"style", "class"}},
            {"h5", new string[]         {"style", "class"}},
            {"h6", new string[]         {"style", "class"}},

            {"font", new string[]       {"style", "class",
				"color", "face", "size"}},
            {"strong", new string[]     {"style", "class"}},
            {"b", new string[]          {"style", "class"}},
            {"em", new string[]         {"style", "class"}},
            {"i", new string[]          {"style", "class"}},
            {"u", new string[]          {"style", "class"}},
            {"strike", new string[]     {"style", "class"}},
            {"ol", new string[]         {"style", "class"}},
            {"ul", new string[]         {"style", "class"}},
            {"li", new string[]         {"style", "class"}},
            {"blockquote", new string[] {"style", "class"}},
            {"code", new string[]       {"style", "class"}},
			{"pre", new string[]       {"style", "class"}},

            {"a", new string[]          {"style", "class", "href", "title"}},
            {"img", new string[]        {"style", "class", "src", "height",
				"width", "alt", "title", "hspace", "vspace", "border"}},

            {"table", new string[]      {"style", "class"}},
            {"thead", new string[]      {"style", "class"}},
            {"tbody", new string[]      {"style", "class"}},
            {"tfoot", new string[]      {"style", "class"}},
            {"th", new string[]         {"style", "class", "scope"}},
            {"tr", new string[]         {"style", "class"}},
            {"td", new string[]         {"style", "class", "colspan"}},

            {"q", new string[]          {"style", "class", "cite"}},
            {"cite", new string[]       {"style", "class"}},
            {"abbr", new string[]       {"style", "class"}},
            {"acronym", new string[]    {"style", "class"}},
            {"del", new string[]        {"style", "class"}},
            {"ins", new string[]        {"style", "class"}}
        };

		/// <summary>
		/// Takes raw HTML input and cleans against a whitelist
		/// </summary>
		/// <param name="source">Html source</param>
		/// <returns>Clean output</returns>
		public string SanitizeHtml(string source)
		{
			HtmlDocument html = GetHtml(source);
			if (html == null) return String.Empty;

			// All the nodes
			HtmlNode allNodes = html.DocumentNode;

			// Select whitelist tag names
			string[] whitelist = (from kv in ValidHtmlTags
								  select kv.Key).ToArray();

			// Scrub tags not in whitelist
			CleanNodes(allNodes, whitelist);

			// Filter the attributes of the remaining
			foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags)
			{
				IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf()
											   where n.Name == tag.Key
											   select n);

				// No nodes? Skip.
				if (nodes == null) continue;

				foreach (var n in nodes)
				{
					// No attributes? Skip.
					if (!n.HasAttributes) continue;

					// Get all the allowed attributes for this tag
					HtmlAttribute[] attr = n.Attributes.ToArray();
					foreach (HtmlAttribute a in attr)
					{
						if (!tag.Value.Contains(a.Name))
						{
							a.Remove(); // Attribute wasn't in the whitelist
						}
						else
						{
							// *** New workaround. This wasn't necessary with the old library
							if (a.Name == "href" || a.Name == "src") {
								a.Value = (!string.IsNullOrEmpty(a.Value))? a.Value.Replace("\r", "").Replace("\n", "") : "";
								a.Value =
									(!string.IsNullOrEmpty(a.Value) &&
									(a.Value.IndexOf("javascript") < 10 || a.Value.IndexOf("eval") < 10)) ?
									a.Value.Replace("javascript", "").Replace("eval", "") : a.Value;
							}
							else if (a.Name == "class" || a.Name == "style")
							{
								a.Value =
									Microsoft.Security.Application.Encoder.CssEncode(a.Value);
							}
							else
							{
								a.Value =
									Microsoft.Security.Application.Encoder.HtmlAttributeEncode(a.Value);
							}
						}
					}
				}
			}

			// *** New workaround (DO NOTHING HAHAHA! Fingers crossed)
			return allNodes.InnerHtml;

			// *** Original code below

			/*
			// Anything we missed will get stripped out
			return
				Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(allNodes.InnerHtml);
			 */
		}

		/// <summary>
		/// Takes a raw source and removes all HTML tags
		/// </summary>
		/// <param name="source"></param>
		/// <returns></returns>
		public string StripHtml(string source)
		{
			source = SanitizeHtml(source);

			// No need to continue if we have no clean Html
			if (String.IsNullOrEmpty(source))
				return String.Empty;

			HtmlDocument html = GetHtml(source);
			StringBuilder result = new StringBuilder();

			// For each node, extract only the innerText
			foreach (HtmlNode node in html.DocumentNode.ChildNodes)
				result.Append(node.InnerText);

			return result.ToString();
		}

		/// <summary>
		/// Recursively delete nodes not in the whitelist
		/// </summary>
		private static void CleanNodes(HtmlNode node, string[] whitelist)
		{
			if (node.NodeType == HtmlNodeType.Element)
			{
				if (!whitelist.Contains(node.Name))
				{
					node.ParentNode.RemoveChild(node);
					return; // We're done
				}
			}

			if (node.HasChildNodes)
				CleanChildren(node, whitelist);
		}

		/// <summary>
		/// Apply CleanNodes to each of the child nodes
		/// </summary>
		private static void CleanChildren(HtmlNode parent, string[] whitelist)
		{
			for (int i = parent.ChildNodes.Count - 1; i >= 0; i--)
				CleanNodes(parent.ChildNodes[i], whitelist);
		}

		/// <summary>
		/// Helper function that returns an HTML document from text
		/// </summary>
		private static HtmlDocument GetHtml(string source)
		{
			HtmlDocument html = new HtmlDocument();
			html.OptionFixNestedTags = true;
			html.OptionAutoCloseOnEnd = true;
			html.OptionDefaultStreamEncoding = Encoding.UTF8;

			html.LoadHtml(source);

			// Encode any code blocks independently so they won't
			// be stripped out completely when we do a final cleanup
			foreach (var n in html.DocumentNode.DescendantNodesAndSelf())
			{
				if (n.Name == "code") {
					//** Code tag attribute vulnerability fix 28-9-12 (thanks to Natd)
					HtmlAttribute[] attr = n.Attributes.ToArray();
					foreach (HtmlAttribute a in attr) {
						if (a.Name != "style" && a.Name != "class")  { a.Remove(); }
					} //** End fix
					n.InnerHtml =
						Microsoft.Security.Application.Encoder.HtmlEncode(n.InnerHtml);
				}
			}

			return html;
		}
	}
}

This is a singleton class, so you need to call Instance to initiate.

E.G.

HtmlUtility util = HtmlUtility.Instance;

7:40AM… Bedtime!

Update : September 28.

Natd discovered a vulnerability in this code that allowed onclick attributes to be added to the code tag itself. Fixed.

Whitelist santize with HtmlAgilityPack

For some time now, I’ve been using Robert Beal’s excellent HTML sanitizer both in my personal work and a couple of client projects and I’ve been very happy with it.

However, there were a few instances where I felt there were some potential hiccups in the implementation, namely the Regular Expressions, and I thought that Robert’s original reason to use TidyNet (that it was more mature) is now not an issue. I also felt that by getting rid of Regular Expressions, the code would be more readable and HtmlAgilityPack introduces a lot more features and essentially allows fine grained manipulation of HTML if necessary.

There is also the benefit of using Linq, which simplifies things even more.

With the exception of Robert’s snippet of the safe list above (which I turned to string arrays instead of lists for a tiny performance gain), the rest of this code is in the public domain :

Update…
Some improvements

Update 6/16
Changed the tag stripping to two seperate functions as it was only removing the first instance of an invalid tag, not any nested ones. The change was courtesy of “Meltdown” at the HtmlAgilityPack forum.

Update 11/23/2013
There’s now a new version of this available for PHP. I will no longer be maintaining this branch.

public static class HtmlUtility
{
    // Original list courtesy of Robert Beal :
    // http://www.robertbeal.com/37/sanitising-html

    private static readonly Dictionary<string, string[]> ValidHtmlTags =
        new Dictionary<string, string[]>
        {
            {"p", new string[]          {"style", "class", "align"}},
            {"div", new string[]        {"style", "class", "align"}},
            {"span", new string[]       {"style", "class"}},
            {"br", new string[]         {"style", "class"}},
            {"hr", new string[]         {"style", "class"}},
            {"label", new string[]      {"style", "class"}},

            {"h1", new string[]         {"style", "class"}},
            {"h2", new string[]         {"style", "class"}},
            {"h3", new string[]         {"style", "class"}},
            {"h4", new string[]         {"style", "class"}},
            {"h5", new string[]         {"style", "class"}},
            {"h6", new string[]         {"style", "class"}},

            {"font", new string[]       {"style", "class", "color", "face", "size"}},
            {"strong", new string[]     {"style", "class"}},
            {"b", new string[]          {"style", "class"}},
            {"em", new string[]         {"style", "class"}},
            {"i", new string[]          {"style", "class"}},
            {"u", new string[]          {"style", "class"}},
            {"strike", new string[]     {"style", "class"}},
            {"ol", new string[]         {"style", "class"}},
            {"ul", new string[]         {"style", "class"}},
            {"li", new string[]         {"style", "class"}},
            {"blockquote", new string[] {"style", "class"}},
            {"code", new string[]       {"style", "class"}},

            {"a", new string[]          {"style", "class", "href", "title"}},
            {"img", new string[]        {"style", "class", "src", "height", "width",
                "alt", "title", "hspace", "vspace", "border"}},

            {"table", new string[]      {"style", "class"}},
            {"thead", new string[]      {"style", "class"}},
            {"tbody", new string[]      {"style", "class"}},
            {"tfoot", new string[]      {"style", "class"}},
            {"th", new string[]         {"style", "class", "scope"}},
            {"tr", new string[]         {"style", "class"}},
            {"td", new string[]         {"style", "class", "colspan"}},

            {"q", new string[]          {"style", "class", "cite"}},
            {"cite", new string[]       {"style", "class"}},
            {"abbr", new string[]       {"style", "class"}},
            {"acronym", new string[]    {"style", "class"}},
            {"del", new string[]        {"style", "class"}},
            {"ins", new string[]        {"style", "class"}}
        };

    /// <summary>
    /// Takes raw HTML input and cleans against a whitelist
    /// </summary>
    /// <param name="source">Html source</param>
    /// <returns>Clean output</returns>
    public static string SanitizeHtml(string source)
    {
        HtmlDocument html = GetHtml(source);
        if (html == null) return String.Empty;

        // All the nodes
        HtmlNode allNodes = html.DocumentNode;

        // Select whitelist tag names
        string[] whitelist = (from kv in ValidHtmlTags
                              select kv.Key).ToArray();

        // Scrub tags not in whitelist
        CleanNodes(allNodes, whitelist);

        // Filter the attributes of the remaining
        foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags)
        {
            IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf()
                                           where n.Name == tag.Key
                                           select n);

            if (nodes == null) continue;

            foreach (var n in nodes)
            {
                if (!n.HasAttributes) continue;

                // Get all the allowed attributes for this tag
                HtmlAttribute[] attr = n.Attributes.ToArray();
                foreach (HtmlAttribute a in attr)
                {
                    if (!tag.Value.Contains(a.Name))
                    {
                        a.Remove(); // Wasn't in the list
                    }
                    else
                    {
                        // AntiXss
                        a.Value =
                            Microsoft.Security.Application.Encoder.UrlPathEncode(a.Value);
                    }
                }
            }
        }

        return allNodes.InnerHtml;
    }

    /// <summary>
    /// Takes a raw source and removes all HTML tags
    /// </summary>
    /// <param name="source"></param>
    /// <returns></returns>
    public static string StripHtml(string source)
    {
        source = SanitizeHtml(source);

        // No need to continue if we have no clean Html
        if (String.IsNullOrEmpty(source))
            return String.Empty;

        HtmlDocument html = GetHtml(source);
        StringBuilder result = new StringBuilder();

        // For each node, extract only the innerText
        foreach (HtmlNode node in html.DocumentNode.ChildNodes)
            result.Append(node.InnerText);

        return result.ToString();
    }

    /// <summary>
    /// Recursively delete nodes not in the whitelist
    /// </summary>
    private static void CleanNodes(HtmlNode node, string[] whitelist)
    {
        if (node.NodeType == HtmlNodeType.Element)
        {
            if (!whitelist.Contains(node.Name))
            {
                node.ParentNode.RemoveChild(node);
                return; // We're done
            }
        }

        if (node.HasChildNodes)
            CleanChildren(node, whitelist);
    }

    /// <summary>
    /// Apply CleanNodes to each of the child nodes
    /// </summary>
    private static void CleanChildren(HtmlNode parent, string[] whitelist)
    {
        for (int i = parent.ChildNodes.Count - 1; i >= 0; i--)
            CleanNodes(parent.ChildNodes[i], whitelist);
    }

    /// <summary>
    /// Helper function that returns an HTML document from text
    /// </summary>
    private static HtmlDocument GetHtml(string source)
    {
        HtmlDocument html = new HtmlDocument();
        html.OptionFixNestedTags = true;
        html.OptionAutoCloseOnEnd = true;
        html.OptionDefaultStreamEncoding = Encoding.UTF8;

        html.LoadHtml(source);

        return html;
    }
}