AntiXss 4.2 Breaks everything

This is one of those situations where none of your available options are good and your least harmful alternative is to shoot yourself in the foot at a slightly odd angle so as to only lose the little toe and not the big one.

All of this happened when Microsoft revealed January that their AntiXss library, now known as the Microsoft Web Protection Library (never seen a more ironic combination of words), had a vulnerability and like all obedient drones, we must update immediately to avoid shooting ourselves in our big toe. The problem is that updating will cause you to loose your little toe.

You see, the new library BREAKS EVERYTHING and eats your children.

Update 11/14/2013:
A new HTML sanitizer is now available for PHP.

I WILL EAT ALL YOUR TAGS!!!

I think the problem is best described by someone who left a comment at the project discussion board.

I was using an old version of Anti-XSS with a rich text editor (CkEditor). It was working very great. But when upgrading to latest version, I discovered the new sanitized is way too much aggressive and is removing almost everything “rich” in the rich editor, specially colors, backgrounds, font size, etc… It’s a disaster for my CMS!

Is there any migration path I can use to keep some of the features of the rich text editor and having at least minimal XSS protection ?

Lovely eh?

Here’s the response from the coordinator.

CSS will always be stripped now – it’s too dangerous, but in other cases it is being too greedy, dropping hrefs from a tags for example. That is being looked at.

I know this may be a strange idea to comprehend for the good folks who developed the library, but you see in the civilized world, many people tend to use WYSIWYG in their projects so as to not burden their users with tags. These days more people are familiar with rudimentary HTML, but when you just want to quickly make a post, comment or otherwise share something, it’s nice to know there’s an editor that can accommodate rich formatting. This is especially true on a mobile device, where switching from text to special characters for tags is still annoying.

Those WYSIWYGs invariably use CSS and inline styles to accomplish this rich formatting, thereby making your assertion ridiculous and this library now completely impractical.

A very quick test on the 4.2 Sanitizer shows that it totally removes strong tags, h1 tags, section tags and as mentioned above strips href attributes from anchor tags. At this rate the output will soon be string.Empty. I hope that the next version will allow basic markup tags and restore the href to anchors.

So in other words, AntiXss is now like an antidepressant. You’ll feel a lot better after taking it, but you may end up killing yourself.

And that’s not all…

I would have kept my mouth shut about this even though I’ve had my doubts about depending on the library over something DIY, but since I work with a bunch of copycat monkeys, I have to use whatever everyone else deems worthy of being included in a project (common sense be damned). I thought, surely there would at least be the older versions available, but no

It’s company policy I’m afraid. The source will remain though, so if you desperately wanted you could download and compile your own versions of older releases.

Of course, I lost my temper at that. Since I’m forced to use this library and one of the devs went ahead and upgraded without backing up the old version or finding out exactly how the vulnerability would affect us. I now had to go treasure hunting across three computers to find 4.0 after just getting home.

AntiXss 4.2 is stupid and so is Microsoft.

Here’s my current workaround until MS comes up with a usable alternative. I’m also using the HtmlAgilityPack which at the moment hasn’t contracted rabies, thankfully, and the 4.0 library.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace Arcturus.Helpers
{
	/// <summary>
	/// This is an HTML cleanup utility combining the benefits of the
	/// HtmlAgilityPack to parse raw HTML and the AntiXss library
	/// to remove potentially dangerous user input.
	///
	/// Additionally it uses a list created by Robert Beal to limit
	/// the number of allowed tags and attributes to a sensible level
	/// </summary>
	public sealed class HtmlUtility
	{
		private static volatile HtmlUtility _instance;
		private static object _root = new object();

		private HtmlUtility() { }

		public static HtmlUtility Instance
		{
			get
			{
				if (_instance == null)
					lock (_root)
						if (_instance == null)
							_instance = new HtmlUtility();

				return _instance;
			}
		}

		// Original list courtesy of Robert Beal :
		// http://www.robertbeal.com/

		private static readonly Dictionary<string, string[]> ValidHtmlTags =
			new Dictionary<string, string[]>
        {
            {"p", new string[]          {"style", "class", "align"}},
            {"div", new string[]        {"style", "class", "align"}},
            {"span", new string[]       {"style", "class"}},
            {"br", new string[]         {"style", "class"}},
            {"hr", new string[]         {"style", "class"}},
            {"label", new string[]      {"style", "class"}},

            {"h1", new string[]         {"style", "class"}},
            {"h2", new string[]         {"style", "class"}},
            {"h3", new string[]         {"style", "class"}},
            {"h4", new string[]         {"style", "class"}},
            {"h5", new string[]         {"style", "class"}},
            {"h6", new string[]         {"style", "class"}},

            {"font", new string[]       {"style", "class",
				"color", "face", "size"}},
            {"strong", new string[]     {"style", "class"}},
            {"b", new string[]          {"style", "class"}},
            {"em", new string[]         {"style", "class"}},
            {"i", new string[]          {"style", "class"}},
            {"u", new string[]          {"style", "class"}},
            {"strike", new string[]     {"style", "class"}},
            {"ol", new string[]         {"style", "class"}},
            {"ul", new string[]         {"style", "class"}},
            {"li", new string[]         {"style", "class"}},
            {"blockquote", new string[] {"style", "class"}},
            {"code", new string[]       {"style", "class"}},
			{"pre", new string[]       {"style", "class"}},

            {"a", new string[]          {"style", "class", "href", "title"}},
            {"img", new string[]        {"style", "class", "src", "height",
				"width", "alt", "title", "hspace", "vspace", "border"}},

            {"table", new string[]      {"style", "class"}},
            {"thead", new string[]      {"style", "class"}},
            {"tbody", new string[]      {"style", "class"}},
            {"tfoot", new string[]      {"style", "class"}},
            {"th", new string[]         {"style", "class", "scope"}},
            {"tr", new string[]         {"style", "class"}},
            {"td", new string[]         {"style", "class", "colspan"}},

            {"q", new string[]          {"style", "class", "cite"}},
            {"cite", new string[]       {"style", "class"}},
            {"abbr", new string[]       {"style", "class"}},
            {"acronym", new string[]    {"style", "class"}},
            {"del", new string[]        {"style", "class"}},
            {"ins", new string[]        {"style", "class"}}
        };

		/// <summary>
		/// Takes raw HTML input and cleans against a whitelist
		/// </summary>
		/// <param name="source">Html source</param>
		/// <returns>Clean output</returns>
		public string SanitizeHtml(string source)
		{
			HtmlDocument html = GetHtml(source);
			if (html == null) return String.Empty;

			// All the nodes
			HtmlNode allNodes = html.DocumentNode;

			// Select whitelist tag names
			string[] whitelist = (from kv in ValidHtmlTags
								  select kv.Key).ToArray();

			// Scrub tags not in whitelist
			CleanNodes(allNodes, whitelist);

			// Filter the attributes of the remaining
			foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags)
			{
				IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf()
											   where n.Name == tag.Key
											   select n);

				// No nodes? Skip.
				if (nodes == null) continue;

				foreach (var n in nodes)
				{
					// No attributes? Skip.
					if (!n.HasAttributes) continue;

					// Get all the allowed attributes for this tag
					HtmlAttribute[] attr = n.Attributes.ToArray();
					foreach (HtmlAttribute a in attr)
					{
						if (!tag.Value.Contains(a.Name))
						{
							a.Remove(); // Attribute wasn't in the whitelist
						}
						else
						{
							// *** New workaround. This wasn't necessary with the old library
							if (a.Name == "href" || a.Name == "src") {
								a.Value = (!string.IsNullOrEmpty(a.Value))? a.Value.Replace("\r", "").Replace("\n", "") : "";
								a.Value =
									(!string.IsNullOrEmpty(a.Value) &&
									(a.Value.IndexOf("javascript") < 10 || a.Value.IndexOf("eval") < 10)) ?
									a.Value.Replace("javascript", "").Replace("eval", "") : a.Value;
							}
							else if (a.Name == "class" || a.Name == "style")
							{
								a.Value =
									Microsoft.Security.Application.Encoder.CssEncode(a.Value);
							}
							else
							{
								a.Value =
									Microsoft.Security.Application.Encoder.HtmlAttributeEncode(a.Value);
							}
						}
					}
				}
			}

			// *** New workaround (DO NOTHING HAHAHA! Fingers crossed)
			return allNodes.InnerHtml;

			// *** Original code below

			/*
			// Anything we missed will get stripped out
			return
				Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(allNodes.InnerHtml);
			 */
		}

		/// <summary>
		/// Takes a raw source and removes all HTML tags
		/// </summary>
		/// <param name="source"></param>
		/// <returns></returns>
		public string StripHtml(string source)
		{
			source = SanitizeHtml(source);

			// No need to continue if we have no clean Html
			if (String.IsNullOrEmpty(source))
				return String.Empty;

			HtmlDocument html = GetHtml(source);
			StringBuilder result = new StringBuilder();

			// For each node, extract only the innerText
			foreach (HtmlNode node in html.DocumentNode.ChildNodes)
				result.Append(node.InnerText);

			return result.ToString();
		}

		/// <summary>
		/// Recursively delete nodes not in the whitelist
		/// </summary>
		private static void CleanNodes(HtmlNode node, string[] whitelist)
		{
			if (node.NodeType == HtmlNodeType.Element)
			{
				if (!whitelist.Contains(node.Name))
				{
					node.ParentNode.RemoveChild(node);
					return; // We're done
				}
			}

			if (node.HasChildNodes)
				CleanChildren(node, whitelist);
		}

		/// <summary>
		/// Apply CleanNodes to each of the child nodes
		/// </summary>
		private static void CleanChildren(HtmlNode parent, string[] whitelist)
		{
			for (int i = parent.ChildNodes.Count - 1; i >= 0; i--)
				CleanNodes(parent.ChildNodes[i], whitelist);
		}

		/// <summary>
		/// Helper function that returns an HTML document from text
		/// </summary>
		private static HtmlDocument GetHtml(string source)
		{
			HtmlDocument html = new HtmlDocument();
			html.OptionFixNestedTags = true;
			html.OptionAutoCloseOnEnd = true;
			html.OptionDefaultStreamEncoding = Encoding.UTF8;

			html.LoadHtml(source);

			// Encode any code blocks independently so they won't
			// be stripped out completely when we do a final cleanup
			foreach (var n in html.DocumentNode.DescendantNodesAndSelf())
			{
				if (n.Name == "code") {
					//** Code tag attribute vulnerability fix 28-9-12 (thanks to Natd)
					HtmlAttribute[] attr = n.Attributes.ToArray();
					foreach (HtmlAttribute a in attr) {
						if (a.Name != "style" && a.Name != "class")  { a.Remove(); }
					} //** End fix
					n.InnerHtml =
						Microsoft.Security.Application.Encoder.HtmlEncode(n.InnerHtml);
				}
			}

			return html;
		}
	}
}

This is a singleton class, so you need to call Instance to initiate.

E.G.

HtmlUtility util = HtmlUtility.Instance;

7:40AM… Bedtime!

Update : September 28.

Natd discovered a vulnerability in this code that allowed onclick attributes to be added to the code tag itself. Fixed.

26 thoughts on “AntiXss 4.2 Breaks everything

  1. although i guess i dont understand the most technique part in your post, i feel you are illustrating it in a funny way, lol.

  2. Hey eksith, thanks for posting this! I thought I was going mental, I mean… H1 tags!? I was convinced that I was doing something wrong if simple H1 tag was confusingly being removed. I only just started using WPL tonight, sooo glad I’m not more entangled in it….

    • Most welcome!

      Chris, count your lucky stars you haven’t started heavily depending on this yet. Talk about a bait-and-switch; can you imagine what a nightmare it has been for people who only rely on AntiXss for their filtering? It really has been a disaster and they still don’t have a timeline for fixing all the problems.

      From another thread their discussion board :

      Yes, the way the framework did it changed, and I didn’t keep up. Strictly speaking I’m right, it’s just you’re not passing in a path, you’re passing in a full URL. Unfortunately some of the web forms controls are making assumptions about what UrlEncoding does (it’s badly named, I realise). I have a fix for this and I’ll be merged into the next version.

      As ever we don’t comment on time scales.

      So not only is quite literally everything broken, they also have no comment on when things will be fixed. Go figure.

  3. Pingback: Microsoft has a different definition for “Open Source” « This page intentionally left ugly

  4. Pingback: Discussion Forum Update (tables and classes) | This page intentionally left ugly

  5. Hi,
    I’ve tried the above code with AntiXss version 4.2.1, but all of my inline css styles are full of numbers like this: 0002D(- character) (eg: text-decoration is text 0002Ddecoration
    Should I be using Version 4.0? If so, do you have a copy of it? Could you put it up on the internet somewhere?
    I’ve replaced the EncoderCssEncode with an HtmlAttributeEncode for the time being, as it returns valid html, but I am worried that doing this could cause problems.
    Thanks,
    Edward

    • Hi Edward,

      Unfortunately both 4.0 and 4.2 both behave the same way. I can email you 4.0, but you won’t be happy with it. HtmlAttributeEncode should be fine as long as “eval” and “javascript” are taken out. Those are the real evil content that some versions of Internet Explorer allowed to be executed even in benign attributes.

  6. What should this line be?
    a.Value = (!string.IsNullOrEmpty(a.Value)? : a.Value.Replace(“\r”, “”).Replace(“\n”, “”);

  7. Pingback: Giving up on ASP.Net | This page intentionally left ugly

    • Oh, I’ve left AntiXSS out of any of the new projects. It’s pretty much a dead library now.

      We’re using something else at work, which sadly I can’t release here because it’s in-house and therefore proprietary, but using HtmlAgilityPack gets you 95% of the way to filter out unnecessary tags and attributes. The rest is easily DIY.

      Best of all, unlike AntiXSS, HAP actually is open source as in you can view the bloody source!

  8. Great job. Thanks for posting this! I had a quick question. Would it be a bad idea to, instead of removing tags and attributes that are not white listed, simply encode them?

    Eg, line 204 becomes: node.ParentNode.ReplaceChild(HtmlNode.CreateNode(Microsoft.Security.Application.Encoder.HtmlEncode(node.OuterHtml)), node);

    • Sure that would work. I’m using encoding in this case only for the contents of code tags here since that was what I needed at the time.

      I do have to caution you that there may be times where you don’t want the content to be shown at all. E.G. It was a malicious script that included links to sites that had even more nasty stuff. To prevent showing stuff that visitors may navigate to manually, sometimes it’s best not to show it at all.

      You can’t completely prevent users from harming their own systems if they insist, but you can make it harder for them to do so.

  9. Tnx for the post.
    Little remark you do not cleaned html attributes in “code” tag wich wraping our source code what we don’t wont sinitize. It’s harmful, lets imagine:


    our encoded html...

    so i added one more dictionary wich tags need skip, but validate attributes
    private Dictionary<string, List> SourceTagList = new Dictionary<string, List>();
    SourceTagList.Add(“code”, new List());
    SourceTagList.Add(“source”, new List());

    and checking if node not in the list
    if (!SourceTagList.ContainsKey(node.Name))
    {
    //sanitize!
    }
    else
    {
    if (node.HasAttributes)
    {
    //clean attributes only in wraping tag
    CleanAttributes(node);
    }
    //encode all inside wraping tag
    node.InnerHtml = Microsoft.Security.Application.Encoder.HtmlEncode(node.InnerHtml);
    }

  10. This post is getting a large number of spam comments that I’ve been trying to keep on top of. Until the spam attack dies down, I’m temporarily closing comments.

    Edit November 16, 2012 : OK comments are re-enabled.

  11. Thank you for taking the time to write up this solution. I think the solution in AJAX Control Toolkit must have been based on this, if anyone was considering using that. I didn’t need all the extra stuff that comes with ACT so I was very happy to be able to use this solution.

    • Hi Simon, Thanks! Glad you found it useful.

      Still haven’t heard back from MS on any permanant fixes to AntiXSS, but it looks like the library will be part of .Net 4.5 as well. On their CodePlex page, there’s a post from the coordinator that it’s being “worked on”, but I’m not holding my breath.

  12. OK, I’m gonna have to close the comments on this post again, because I’m getting a ton of spam on it (1000+) as of December 3, 2012.

    I don’t think I’ll reopen comments, so if you need to contact me about it, drop me an email or comment on another post (doesn’t matter if they’re unrelated).

Comments are closed.