This is one of those situations where none of your available options are good and your least harmful alternative is to shoot yourself in the foot at a slightly odd angle so as to only lose the little toe and not the big one.
All of this happened when Microsoft revealed January that their AntiXss library, now known as the Microsoft Web Protection Library (never seen a more ironic combination of words), had a vulnerability and like all obedient drones, we must update immediately to avoid shooting ourselves in our big toe. The problem is that updating will cause you to loose your little toe.
You see, the new library BREAKS EVERYTHING and eats your children.
Update 11/14/2013:
A new HTML sanitizer is now available for PHP.
I think the problem is best described by someone who left a comment at the project discussion board.
I was using an old version of Anti-XSS with a rich text editor (CkEditor). It was working very great. But when upgrading to latest version, I discovered the new sanitized is way too much aggressive and is removing almost everything “rich” in the rich editor, specially colors, backgrounds, font size, etc… It’s a disaster for my CMS!
Is there any migration path I can use to keep some of the features of the rich text editor and having at least minimal XSS protection ?
Lovely eh?
Here’s the response from the coordinator.
CSS will always be stripped now – it’s too dangerous, but in other cases it is being too greedy, dropping hrefs from a tags for example. That is being looked at.
I know this may be a strange idea to comprehend for the good folks who developed the library, but you see in the civilized world, many people tend to use WYSIWYG in their projects so as to not burden their users with tags. These days more people are familiar with rudimentary HTML, but when you just want to quickly make a post, comment or otherwise share something, it’s nice to know there’s an editor that can accommodate rich formatting. This is especially true on a mobile device, where switching from text to special characters for tags is still annoying.
Those WYSIWYGs invariably use CSS and inline styles to accomplish this rich formatting, thereby making your assertion ridiculous and this library now completely impractical.
A very quick test on the 4.2 Sanitizer shows that it totally removes strong tags, h1 tags, section tags and as mentioned above strips href attributes from anchor tags. At this rate the output will soon be string.Empty. I hope that the next version will allow basic markup tags and restore the href to anchors.
So in other words, AntiXss is now like an antidepressant. You’ll feel a lot better after taking it, but you may end up killing yourself.
And that’s not all…
I would have kept my mouth shut about this even though I’ve had my doubts about depending on the library over something DIY, but since I work with a bunch of copycat monkeys, I have to use whatever everyone else deems worthy of being included in a project (common sense be damned). I thought, surely there would at least be the older versions available, but no…
It’s company policy I’m afraid. The source will remain though, so if you desperately wanted you could download and compile your own versions of older releases.
Of course, I lost my temper at that. Since I’m forced to use this library and one of the devs went ahead and upgraded without backing up the old version or finding out exactly how the vulnerability would affect us. I now had to go treasure hunting across three computers to find 4.0 after just getting home.
AntiXss 4.2 is stupid and so is Microsoft.
Here’s my current workaround until MS comes up with a usable alternative. I’m also using the HtmlAgilityPack which at the moment hasn’t contracted rabies, thankfully, and the 4.0 library.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; namespace Arcturus.Helpers { /// <summary> /// This is an HTML cleanup utility combining the benefits of the /// HtmlAgilityPack to parse raw HTML and the AntiXss library /// to remove potentially dangerous user input. /// /// Additionally it uses a list created by Robert Beal to limit /// the number of allowed tags and attributes to a sensible level /// </summary> public sealed class HtmlUtility { private static volatile HtmlUtility _instance; private static object _root = new object(); private HtmlUtility() { } public static HtmlUtility Instance { get { if (_instance == null) lock (_root) if (_instance == null) _instance = new HtmlUtility(); return _instance; } } // Original list courtesy of Robert Beal : // http://www.robertbeal.com/ private static readonly Dictionary<string, string[]> ValidHtmlTags = new Dictionary<string, string[]> { {"p", new string[] {"style", "class", "align"}}, {"div", new string[] {"style", "class", "align"}}, {"span", new string[] {"style", "class"}}, {"br", new string[] {"style", "class"}}, {"hr", new string[] {"style", "class"}}, {"label", new string[] {"style", "class"}}, {"h1", new string[] {"style", "class"}}, {"h2", new string[] {"style", "class"}}, {"h3", new string[] {"style", "class"}}, {"h4", new string[] {"style", "class"}}, {"h5", new string[] {"style", "class"}}, {"h6", new string[] {"style", "class"}}, {"font", new string[] {"style", "class", "color", "face", "size"}}, {"strong", new string[] {"style", "class"}}, {"b", new string[] {"style", "class"}}, {"em", new string[] {"style", "class"}}, {"i", new string[] {"style", "class"}}, {"u", new string[] {"style", "class"}}, {"strike", new string[] {"style", "class"}}, {"ol", new string[] {"style", "class"}}, {"ul", new string[] {"style", "class"}}, {"li", new string[] {"style", "class"}}, {"blockquote", new string[] {"style", "class"}}, {"code", new string[] {"style", "class"}}, {"pre", new string[] {"style", "class"}}, {"a", new string[] {"style", "class", "href", "title"}}, {"img", new string[] {"style", "class", "src", "height", "width", "alt", "title", "hspace", "vspace", "border"}}, {"table", new string[] {"style", "class"}}, {"thead", new string[] {"style", "class"}}, {"tbody", new string[] {"style", "class"}}, {"tfoot", new string[] {"style", "class"}}, {"th", new string[] {"style", "class", "scope"}}, {"tr", new string[] {"style", "class"}}, {"td", new string[] {"style", "class", "colspan"}}, {"q", new string[] {"style", "class", "cite"}}, {"cite", new string[] {"style", "class"}}, {"abbr", new string[] {"style", "class"}}, {"acronym", new string[] {"style", "class"}}, {"del", new string[] {"style", "class"}}, {"ins", new string[] {"style", "class"}} }; /// <summary> /// Takes raw HTML input and cleans against a whitelist /// </summary> /// <param name="source">Html source</param> /// <returns>Clean output</returns> public string SanitizeHtml(string source) { HtmlDocument html = GetHtml(source); if (html == null) return String.Empty; // All the nodes HtmlNode allNodes = html.DocumentNode; // Select whitelist tag names string[] whitelist = (from kv in ValidHtmlTags select kv.Key).ToArray(); // Scrub tags not in whitelist CleanNodes(allNodes, whitelist); // Filter the attributes of the remaining foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags) { IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf() where n.Name == tag.Key select n); // No nodes? Skip. if (nodes == null) continue; foreach (var n in nodes) { // No attributes? Skip. if (!n.HasAttributes) continue; // Get all the allowed attributes for this tag HtmlAttribute[] attr = n.Attributes.ToArray(); foreach (HtmlAttribute a in attr) { if (!tag.Value.Contains(a.Name)) { a.Remove(); // Attribute wasn't in the whitelist } else { // *** New workaround. This wasn't necessary with the old library if (a.Name == "href" || a.Name == "src") { a.Value = (!string.IsNullOrEmpty(a.Value))? a.Value.Replace("\r", "").Replace("\n", "") : ""; a.Value = (!string.IsNullOrEmpty(a.Value) && (a.Value.IndexOf("javascript") < 10 || a.Value.IndexOf("eval") < 10)) ? a.Value.Replace("javascript", "").Replace("eval", "") : a.Value; } else if (a.Name == "class" || a.Name == "style") { a.Value = Microsoft.Security.Application.Encoder.CssEncode(a.Value); } else { a.Value = Microsoft.Security.Application.Encoder.HtmlAttributeEncode(a.Value); } } } } } // *** New workaround (DO NOTHING HAHAHA! Fingers crossed) return allNodes.InnerHtml; // *** Original code below /* // Anything we missed will get stripped out return Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(allNodes.InnerHtml); */ } /// <summary> /// Takes a raw source and removes all HTML tags /// </summary> /// <param name="source"></param> /// <returns></returns> public string StripHtml(string source) { source = SanitizeHtml(source); // No need to continue if we have no clean Html if (String.IsNullOrEmpty(source)) return String.Empty; HtmlDocument html = GetHtml(source); StringBuilder result = new StringBuilder(); // For each node, extract only the innerText foreach (HtmlNode node in html.DocumentNode.ChildNodes) result.Append(node.InnerText); return result.ToString(); } /// <summary> /// Recursively delete nodes not in the whitelist /// </summary> private static void CleanNodes(HtmlNode node, string[] whitelist) { if (node.NodeType == HtmlNodeType.Element) { if (!whitelist.Contains(node.Name)) { node.ParentNode.RemoveChild(node); return; // We're done } } if (node.HasChildNodes) CleanChildren(node, whitelist); } /// <summary> /// Apply CleanNodes to each of the child nodes /// </summary> private static void CleanChildren(HtmlNode parent, string[] whitelist) { for (int i = parent.ChildNodes.Count - 1; i >= 0; i--) CleanNodes(parent.ChildNodes[i], whitelist); } /// <summary> /// Helper function that returns an HTML document from text /// </summary> private static HtmlDocument GetHtml(string source) { HtmlDocument html = new HtmlDocument(); html.OptionFixNestedTags = true; html.OptionAutoCloseOnEnd = true; html.OptionDefaultStreamEncoding = Encoding.UTF8; html.LoadHtml(source); // Encode any code blocks independently so they won't // be stripped out completely when we do a final cleanup foreach (var n in html.DocumentNode.DescendantNodesAndSelf()) { if (n.Name == "code") { //** Code tag attribute vulnerability fix 28-9-12 (thanks to Natd) HtmlAttribute[] attr = n.Attributes.ToArray(); foreach (HtmlAttribute a in attr) { if (a.Name != "style" && a.Name != "class") { a.Remove(); } } //** End fix n.InnerHtml = Microsoft.Security.Application.Encoder.HtmlEncode(n.InnerHtml); } } return html; } } }
This is a singleton class, so you need to call Instance to initiate.
E.G.
HtmlUtility util = HtmlUtility.Instance;
7:40AM… Bedtime!
Update : September 28.
Natd discovered a vulnerability in this code that allowed onclick attributes to be added to the code tag itself. Fixed.