Whitelist santize with HtmlAgilityPack

For some time now, I’ve been using Robert Beal’s excellent HTML sanitizer both in my personal work and a couple of client projects and I’ve been very happy with it.

However, there were a few instances where I felt there were some potential hiccups in the implementation, namely the Regular Expressions, and I thought that Robert’s original reason to use TidyNet (that it was more mature) is now not an issue. I also felt that by getting rid of Regular Expressions, the code would be more readable and HtmlAgilityPack introduces a lot more features and essentially allows fine grained manipulation of HTML if necessary.

There is also the benefit of using Linq, which simplifies things even more.

With the exception of Robert’s snippet of the safe list above (which I turned to string arrays instead of lists for a tiny performance gain), the rest of this code is in the public domain :

Update…
Some improvements

Update 6/16
Changed the tag stripping to two seperate functions as it was only removing the first instance of an invalid tag, not any nested ones. The change was courtesy of “Meltdown” at the HtmlAgilityPack forum.

Update 11/23/2013
There’s now a new version of this available for PHP. I will no longer be maintaining this branch.

public static class HtmlUtility
{
    // Original list courtesy of Robert Beal :
    // http://www.robertbeal.com/37/sanitising-html

    private static readonly Dictionary<string, string[]> ValidHtmlTags =
        new Dictionary<string, string[]>
        {
            {"p", new string[]          {"style", "class", "align"}},
            {"div", new string[]        {"style", "class", "align"}},
            {"span", new string[]       {"style", "class"}},
            {"br", new string[]         {"style", "class"}},
            {"hr", new string[]         {"style", "class"}},
            {"label", new string[]      {"style", "class"}},

            {"h1", new string[]         {"style", "class"}},
            {"h2", new string[]         {"style", "class"}},
            {"h3", new string[]         {"style", "class"}},
            {"h4", new string[]         {"style", "class"}},
            {"h5", new string[]         {"style", "class"}},
            {"h6", new string[]         {"style", "class"}},

            {"font", new string[]       {"style", "class", "color", "face", "size"}},
            {"strong", new string[]     {"style", "class"}},
            {"b", new string[]          {"style", "class"}},
            {"em", new string[]         {"style", "class"}},
            {"i", new string[]          {"style", "class"}},
            {"u", new string[]          {"style", "class"}},
            {"strike", new string[]     {"style", "class"}},
            {"ol", new string[]         {"style", "class"}},
            {"ul", new string[]         {"style", "class"}},
            {"li", new string[]         {"style", "class"}},
            {"blockquote", new string[] {"style", "class"}},
            {"code", new string[]       {"style", "class"}},

            {"a", new string[]          {"style", "class", "href", "title"}},
            {"img", new string[]        {"style", "class", "src", "height", "width",
                "alt", "title", "hspace", "vspace", "border"}},

            {"table", new string[]      {"style", "class"}},
            {"thead", new string[]      {"style", "class"}},
            {"tbody", new string[]      {"style", "class"}},
            {"tfoot", new string[]      {"style", "class"}},
            {"th", new string[]         {"style", "class", "scope"}},
            {"tr", new string[]         {"style", "class"}},
            {"td", new string[]         {"style", "class", "colspan"}},

            {"q", new string[]          {"style", "class", "cite"}},
            {"cite", new string[]       {"style", "class"}},
            {"abbr", new string[]       {"style", "class"}},
            {"acronym", new string[]    {"style", "class"}},
            {"del", new string[]        {"style", "class"}},
            {"ins", new string[]        {"style", "class"}}
        };

    /// <summary>
    /// Takes raw HTML input and cleans against a whitelist
    /// </summary>
    /// <param name="source">Html source</param>
    /// <returns>Clean output</returns>
    public static string SanitizeHtml(string source)
    {
        HtmlDocument html = GetHtml(source);
        if (html == null) return String.Empty;

        // All the nodes
        HtmlNode allNodes = html.DocumentNode;

        // Select whitelist tag names
        string[] whitelist = (from kv in ValidHtmlTags
                              select kv.Key).ToArray();

        // Scrub tags not in whitelist
        CleanNodes(allNodes, whitelist);

        // Filter the attributes of the remaining
        foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags)
        {
            IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf()
                                           where n.Name == tag.Key
                                           select n);

            if (nodes == null) continue;

            foreach (var n in nodes)
            {
                if (!n.HasAttributes) continue;

                // Get all the allowed attributes for this tag
                HtmlAttribute[] attr = n.Attributes.ToArray();
                foreach (HtmlAttribute a in attr)
                {
                    if (!tag.Value.Contains(a.Name))
                    {
                        a.Remove(); // Wasn't in the list
                    }
                    else
                    {
                        // AntiXss
                        a.Value =
                            Microsoft.Security.Application.Encoder.UrlPathEncode(a.Value);
                    }
                }
            }
        }

        return allNodes.InnerHtml;
    }

    /// <summary>
    /// Takes a raw source and removes all HTML tags
    /// </summary>
    /// <param name="source"></param>
    /// <returns></returns>
    public static string StripHtml(string source)
    {
        source = SanitizeHtml(source);

        // No need to continue if we have no clean Html
        if (String.IsNullOrEmpty(source))
            return String.Empty;

        HtmlDocument html = GetHtml(source);
        StringBuilder result = new StringBuilder();

        // For each node, extract only the innerText
        foreach (HtmlNode node in html.DocumentNode.ChildNodes)
            result.Append(node.InnerText);

        return result.ToString();
    }

    /// <summary>
    /// Recursively delete nodes not in the whitelist
    /// </summary>
    private static void CleanNodes(HtmlNode node, string[] whitelist)
    {
        if (node.NodeType == HtmlNodeType.Element)
        {
            if (!whitelist.Contains(node.Name))
            {
                node.ParentNode.RemoveChild(node);
                return; // We're done
            }
        }

        if (node.HasChildNodes)
            CleanChildren(node, whitelist);
    }

    /// <summary>
    /// Apply CleanNodes to each of the child nodes
    /// </summary>
    private static void CleanChildren(HtmlNode parent, string[] whitelist)
    {
        for (int i = parent.ChildNodes.Count - 1; i >= 0; i--)
            CleanNodes(parent.ChildNodes[i], whitelist);
    }

    /// <summary>
    /// Helper function that returns an HTML document from text
    /// </summary>
    private static HtmlDocument GetHtml(string source)
    {
        HtmlDocument html = new HtmlDocument();
        html.OptionFixNestedTags = true;
        html.OptionAutoCloseOnEnd = true;
        html.OptionDefaultStreamEncoding = Encoding.UTF8;

        html.LoadHtml(source);

        return html;
    }
}

24 thoughts on “Whitelist santize with HtmlAgilityPack

  1. Pingback: ASP.Net BBCode (C#) « This page intentionally left ugly

  2. Hi Eksith,

    Do I understand correctly that your whitelist approach aggressively strips out an invalid tag, and all of it’s children whether valid or not?

    For example:

    string dirtyInput = “Freaky word tag”;
    Expected: “Freaky word tag”
    But was: “\r\n”

    Is there a way to change so that it continues to process child nodes?

    Cheers,

    Barry.

  3. Ah – I notice in the above my words tags were stripped! ha. Input should’ve been wrapped in “‘<o:p>'”

    I changed CleanNodes() to this:

    private static void CleanNodes(HtmlNode node, string[] whitelist) {
    if (node.NodeType == HtmlNodeType.Element) {
    if (!whitelist.Contains(node.Name)) {
    node.ParentNode.AppendChildren(node.ChildNodes);
    node.ParentNode.RemoveChild(node);
    }
    }

    if (node.HasChildNodes)
    CleanChildren(node, whitelist);
    }

    • Hi Barry,

      Yes, my approach strips out all chid tags as well. For my case, this was appropriate, but your approach would be very useful for someone who wants to keep going to the child nodes.

      The one hiccup I can see is that if the child tags are valid, but the parent wasn’t, this may mess up your page formatting.

      I.E. If a user misspelled <table> as <tbale>, any <tr> tags would still remain. If the output was inside a grid, the other rows would be affected.

      But, thanks for sharing!

    • For one of my projects I also needed to leave child nodes even if tag is not in white list. In white list I included only and tags. When I tried to apply your variant of CleanNodes() method for input html:
      Some text – another text More Text
      I got:
      Some – another More Texttexttext
      As you can see, there is some problem with your solution.
      But eventually, I found working solution made using Regexp here: http://code.commongroove.com/2012/06/05/c-regular-expressions-to-filter-html-to-a-whitelist-of-allowable-tags/.
      For mentioned input it gave me the following expected output:
      Some text – another text More Text

  4. There is a “problem” with the class. If I have a string like that : “Test 1 2 Test Hello World” ” will be deleted =/…

    • Hi Mickael,

      I think some of that formatting got lost to the WordPress filter. If it’s a plain sentence with extra quotes, there shouldn’t be a problem. If there are broken tags that are causing this, then make sure you have the latest version of the HtmlAgilityPack. That is what’s used to parse the whole page for tag validity before filtering.

  5. Hi Eksith. Very helpful code. But i’m try modify your code and have couple questions.
    How i can add verification if node have parrent node with some name or type?

    Some people wont post a code in comments, and code or html should not be sanitized.
    We can chek if node has parrent node with some name
    if (n.ParentNode.Name != “code”){
    //ok sanitize here
    }

    but it’s work only in one nested level – here example http://pastebin.com/8wcueMUn

    briefly we need skip sinitize all in code and source html tags.
    Any suggestion?

    • Hi Natd, glad you found it helpful.

      Ironically, I ran into the exact same problem, so I updated the code in another post here. It also has some… er… “feedback” about AntiXSS, so you’ll need to scroll to the bottom of the post ;)

  6. Pingback: Обрезка HTML тегов с фильтрцией по белому списку в C# | Нюансы разработки

  7. Hi
    A big Thank for your work. I really appreciate it.
    I have a Question. How can I Add some Tags to the white list. I added rowspan right after colspan but it throws an error.
    here is the changed part:
    {“td”, new string[] {“style”, “class”, “colspan”}},
    {“td”, new string[] {“style”, “class”, “rowspan”}},

    what i’m doing wrong?
    thanks

    • Hi Mahdi,

      That’s two of the same ;)
      You can replace both with just one of this :

      {"td", new string[] {"style", "class", "colspan", "rowspan"}},

  8. Pingback: How to save HTML to database and retrieve it properly - Tech Forum Network

  9. Pingback: Sanitizing HTML input with .NET | Andrew Olson

  10. allNodes.InnerHtml on line 107 didn’t seem to be properly saving all the changes made. Switched it to allNodes.WriteContentTo() and it works like a charm now. May be a bug with the newest HtmlAgilityPack but just an FYI.

    • Thank you for this!

      I really need rewrite this whole thing when I have the time. There are a few new things introduced in newer HtmlAgilityPack versions that I didn’t get to use when I first put this together.

      If you come across any other bugs/improvements, please don’t hesitate to share.

      Thanks for dropping by.

  11. You should seriously consider if it is safe to have class and style in there.

    Style’s issue is easy:
    content: url(‘badguys.com’);
    background-image: url(‘google.com/tracker’);
    position: fixed;
    top 0;
    right 0;

    Class is because of this:
    $(‘.delete’).click(item.doDelete);

    That is — if the page contains scripts that are using jQuery to match based on class names, those will match in user content also. And so the attacker can add controls on the page. Later, those controls can be clicked by a higher privileged user.

    If you do allow these — you need to be extremely cautious. You more or less need a full CSS parser, and you need to white list the specific properties you’re going to allow to be set as well as the values. As the CSS working group continuously expands syntax for the properties, and valid values for the properties, there’s no way to assume what is safe today will be safe tomorrow.

    Which is the point in the white list in the first place.

    • Hi Dave, thanks for dropping by.

      Unfortunately that’s thoroughly beyond the purview of this class which I threw together for a personal project of mine. It seems MS still hasn’t come up with a suitable alternative since they broke the AntiXSS suite, which is why I had to write this and use HtmlAgilityPack. And I have a newer version of this class since this was posted

      It’s true a lot of things can change in the future, which is why the list can be replaced by anyone using the class (many already have it in production with a far fewer subset of attributes and tags).

  12. This code has a vulnerability when used in combination with IE9 or earlier: It doesn’t strip out conditional comments, so you can just stuff in some script tags inside the conditional comment. I’ll try pasting in some example markup; hopefully it won’t get mangled:

    <!–[if IE]><script type=’text/javascript’>alert(‘hacked’);</script><![endif]–>

    Of course, the problem is the crazy behaviour of IE to execute comments, but this sanitisation code should still work to remove the threat.

    Plus, the a tag’s href attribute and the img tag’s src attribute should both be sanitised to ensure they don’t have script in them.

    • Thanks for pointing that out, Trent.

      Yes this is a vulnerability that I need to fix, but it’s going to be part of a laundry list that I need to reevaluate and post sometime. Meanwhile, there’s a newer version of this since AntiXSS is now broken.

      Unfortunately, I’ve moved on from most Microsoft technologies by and large so I don’t have as much time as I used to in order to revisit all the code I’ve posted here.

  13. Pingback: No luck with HTML decoding based on safe HTML tags (vb.net or c#) - BlogoSfera

  14. The code where you got this doesnt exists anymore, is there anyway to get the original code? and see if the license is for everyone to use?

    I would like to use your code in my projects, but I dont know if you will allow it, or is all your code under the MIT license? will there be any issues?

  15. Pingback: Comment enregistrer HTML à la base de données et le récupérer correctement

Leave a comment