Whitelist santize with HtmlAgilityPack

June 14, 2011 by eksith

For some time now, I’ve been using Robert Beal’s excellent HTML sanitizer both in my personal work and a couple of client projects and I’ve been very happy with it.

However, there were a few instances where I felt there were some potential hiccups in the implementation, namely the Regular Expressions, and I thought that Robert’s original reason to use TidyNet (that it was more mature) is now not an issue. I also felt that by getting rid of Regular Expressions, the code would be more readable and HtmlAgilityPack introduces a lot more features and essentially allows fine grained manipulation of HTML if necessary.

There is also the benefit of using Linq, which simplifies things even more.

With the exception of Robert’s snippet of the safe list above (which I turned to string arrays instead of lists for a tiny performance gain), the rest of this code is in the public domain :

Update…
Some improvements

Update 6/16
Changed the tag stripping to two seperate functions as it was only removing the first instance of an invalid tag, not any nested ones. The change was courtesy of “Meltdown” at the HtmlAgilityPack forum.

Update 11/23/2013
There’s now a new version of this available for PHP. I will no longer be maintaining this branch.

public static class HtmlUtility
{
    // Original list courtesy of Robert Beal :
    // http://www.robertbeal.com/37/sanitising-html

    private static readonly Dictionary<string, string[]> ValidHtmlTags =
        new Dictionary<string, string[]>
        {
            {"p", new string[]          {"style", "class", "align"}},
            {"div", new string[]        {"style", "class", "align"}},
            {"span", new string[]       {"style", "class"}},
            {"br", new string[]         {"style", "class"}},
            {"hr", new string[]         {"style", "class"}},
            {"label", new string[]      {"style", "class"}},

            {"h1", new string[]         {"style", "class"}},
            {"h2", new string[]         {"style", "class"}},
            {"h3", new string[]         {"style", "class"}},
            {"h4", new string[]         {"style", "class"}},
            {"h5", new string[]         {"style", "class"}},
            {"h6", new string[]         {"style", "class"}},

            {"font", new string[]       {"style", "class", "color", "face", "size"}},
            {"strong", new string[]     {"style", "class"}},
            {"b", new string[]          {"style", "class"}},
            {"em", new string[]         {"style", "class"}},
            {"i", new string[]          {"style", "class"}},
            {"u", new string[]          {"style", "class"}},
            {"strike", new string[]     {"style", "class"}},
            {"ol", new string[]         {"style", "class"}},
            {"ul", new string[]         {"style", "class"}},
            {"li", new string[]         {"style", "class"}},
            {"blockquote", new string[] {"style", "class"}},
            {"code", new string[]       {"style", "class"}},

            {"a", new string[]          {"style", "class", "href", "title"}},
            {"img", new string[]        {"style", "class", "src", "height", "width",
                "alt", "title", "hspace", "vspace", "border"}},

            {"table", new string[]      {"style", "class"}},
            {"thead", new string[]      {"style", "class"}},
            {"tbody", new string[]      {"style", "class"}},
            {"tfoot", new string[]      {"style", "class"}},
            {"th", new string[]         {"style", "class", "scope"}},
            {"tr", new string[]         {"style", "class"}},
            {"td", new string[]         {"style", "class", "colspan"}},

            {"q", new string[]          {"style", "class", "cite"}},
            {"cite", new string[]       {"style", "class"}},
            {"abbr", new string[]       {"style", "class"}},
            {"acronym", new string[]    {"style", "class"}},
            {"del", new string[]        {"style", "class"}},
            {"ins", new string[]        {"style", "class"}}
        };

    /// <summary>
    /// Takes raw HTML input and cleans against a whitelist
    /// </summary>
    /// <param name="source">Html source</param>
    /// <returns>Clean output</returns>
    public static string SanitizeHtml(string source)
    {
        HtmlDocument html = GetHtml(source);
        if (html == null) return String.Empty;

        // All the nodes
        HtmlNode allNodes = html.DocumentNode;

        // Select whitelist tag names
        string[] whitelist = (from kv in ValidHtmlTags
                              select kv.Key).ToArray();

        // Scrub tags not in whitelist
        CleanNodes(allNodes, whitelist);

        // Filter the attributes of the remaining
        foreach (KeyValuePair<string, string[]> tag in ValidHtmlTags)
        {
            IEnumerable<HtmlNode> nodes = (from n in allNodes.DescendantsAndSelf()
                                           where n.Name == tag.Key
                                           select n);

            if (nodes == null) continue;

            foreach (var n in nodes)
            {
                if (!n.HasAttributes) continue;

                // Get all the allowed attributes for this tag
                HtmlAttribute[] attr = n.Attributes.ToArray();
                foreach (HtmlAttribute a in attr)
                {
                    if (!tag.Value.Contains(a.Name))
                    {
                        a.Remove(); // Wasn't in the list
                    }
                    else
                    {
                        // AntiXss
                        a.Value =
                            Microsoft.Security.Application.Encoder.UrlPathEncode(a.Value);
                    }
                }
            }
        }

        return allNodes.InnerHtml;
    }

    /// <summary>
    /// Takes a raw source and removes all HTML tags
    /// </summary>
    /// <param name="source"></param>
    /// <returns></returns>
    public static string StripHtml(string source)
    {
        source = SanitizeHtml(source);

        // No need to continue if we have no clean Html
        if (String.IsNullOrEmpty(source))
            return String.Empty;

        HtmlDocument html = GetHtml(source);
        StringBuilder result = new StringBuilder();

        // For each node, extract only the innerText
        foreach (HtmlNode node in html.DocumentNode.ChildNodes)
            result.Append(node.InnerText);

        return result.ToString();
    }

    /// <summary>
    /// Recursively delete nodes not in the whitelist
    /// </summary>
    private static void CleanNodes(HtmlNode node, string[] whitelist)
    {
        if (node.NodeType == HtmlNodeType.Element)
        {
            if (!whitelist.Contains(node.Name))
            {
                node.ParentNode.RemoveChild(node);
                return; // We're done
            }
        }

        if (node.HasChildNodes)
            CleanChildren(node, whitelist);
    }

    /// <summary>
    /// Apply CleanNodes to each of the child nodes
    /// </summary>
    private static void CleanChildren(HtmlNode parent, string[] whitelist)
    {
        for (int i = parent.ChildNodes.Count - 1; i >= 0; i--)
            CleanNodes(parent.ChildNodes[i], whitelist);
    }

    /// <summary>
    /// Helper function that returns an HTML document from text
    /// </summary>
    private static HtmlDocument GetHtml(string source)
    {
        HtmlDocument html = new HtmlDocument();
        html.OptionFixNestedTags = true;
        html.OptionAutoCloseOnEnd = true;
        html.OptionDefaultStreamEncoding = Encoding.UTF8;

        html.LoadHtml(source);

        return html;
    }
}

24 thoughts on “Whitelist santize with HtmlAgilityPack”

Pingback: ASP.Net BBCode (C#) « This page intentionally left ugly
Barry says:

March 14, 2012 at 7:46 pm

Hi Eksith,

Do I understand correctly that your whitelist approach aggressively strips out an invalid tag, and all of it’s children whether valid or not?

For example:

string dirtyInput = “Freaky word tag”;
Expected: “Freaky word tag”
But was: “\r\n”

Is there a way to change so that it continues to process child nodes?

Cheers,

Barry.

Reply
Barry says:

March 14, 2012 at 9:15 pm

Ah – I notice in the above my words tags were stripped! ha. Input should’ve been wrapped in “‘<o:p>'”

I changed CleanNodes() to this:

private static void CleanNodes(HtmlNode node, string[] whitelist) {
if (node.NodeType == HtmlNodeType.Element) {
if (!whitelist.Contains(node.Name)) {
node.ParentNode.AppendChildren(node.ChildNodes);
node.ParentNode.RemoveChild(node);
}
}

if (node.HasChildNodes)
CleanChildren(node, whitelist);
}

Reply
- eksith says:
  
  March 15, 2012 at 10:23 pm
  
  Hi Barry,
  
  Yes, my approach strips out all chid tags as well. For my case, this was appropriate, but your approach would be very useful for someone who wants to keep going to the child nodes.
  
  The one hiccup I can see is that if the child tags are valid, but the parent wasn’t, this may mess up your page formatting.
  
  I.E. If a user misspelled <table> as <tbale>, any <tr> tags would still remain. If the output was inside a grid, the other rows would be affected.
  
  But, thanks for sharing!
  
  Reply
- Aleksey says:
  
  October 3, 2012 at 4:14 am
  
  For one of my projects I also needed to leave child nodes even if tag is not in white list. In white list I included only and tags. When I tried to apply your variant of CleanNodes() method for input html:
  Some text – another text More Text
  I got:
  Some – another More Texttexttext
  As you can see, there is some problem with your solution.
  But eventually, I found working solution made using Regexp here: http://code.commongroove.com/2012/06/05/c-regular-expressions-to-filter-html-to-a-whitelist-of-allowable-tags/.
  For mentioned input it gave me the following expected output:
  Some text – another text More Text
  
  Reply
Mickael says:

August 30, 2012 at 8:41 am

There is a “problem” with the class. If I have a string like that : “Test 1 2 Test Hello World” ” will be deleted =/…

Reply
- eksith says:
  
  August 31, 2012 at 12:19 pm
  
  Hi Mickael,
  
  I think some of that formatting got lost to the WordPress filter. If it’s a plain sentence with extra quotes, there shouldn’t be a problem. If there are broken tags that are causing this, then make sure you have the latest version of the HtmlAgilityPack. That is what’s used to parse the whole page for tag validity before filtering.
  
  Reply
Natd says:

September 4, 2012 at 12:47 pm

Hi Eksith. Very helpful code. But i’m try modify your code and have couple questions.
How i can add verification if node have parrent node with some name or type?

Some people wont post a code in comments, and code or html should not be sanitized.
We can chek if node has parrent node with some name
if (n.ParentNode.Name != “code”){
//ok sanitize here
}

but it’s work only in one nested level – here example http://pastebin.com/8wcueMUn

briefly we need skip sinitize all in code and source html tags.
Any suggestion?

Reply
- eksith says:
  
  September 5, 2012 at 9:00 pm
  
  Hi Natd, glad you found it helpful.
  
  Ironically, I ran into the exact same problem, so I updated the code in another post here. It also has some… er… “feedback” about AntiXSS, so you’ll need to scroll to the bottom of the post ;)
  
  Reply
Pingback: Обрезка HTML тегов с фильтрцией по белому списку в C# | Нюансы разработки
mahdi says:

April 14, 2013 at 1:04 pm

Hi
A big Thank for your work. I really appreciate it.
I have a Question. How can I Add some Tags to the white list. I added rowspan right after colspan but it throws an error.
here is the changed part:
{“td”, new string[] {“style”, “class”, “colspan”}},
{“td”, new string[] {“style”, “class”, “rowspan”}},

what i’m doing wrong?
thanks

Reply
- eksith says:
  
  April 14, 2013 at 4:52 pm
  
  Hi Mahdi,
  
  That’s two of the same ;)
  You can replace both with just one of this :
  
  {"td", new string[] {"style", "class", "colspan", "rowspan"}},
  
  Reply
Pingback: How to save HTML to database and retrieve it properly - Tech Forum Network
Pingback: Sanitizing HTML input with .NET | Andrew Olson
aer0 says:

June 24, 2014 at 12:06 pm

allNodes.InnerHtml on line 107 didn’t seem to be properly saving all the changes made. Switched it to allNodes.WriteContentTo() and it works like a charm now. May be a bug with the newest HtmlAgilityPack but just an FYI.

Reply
- eksith says:
  
  June 25, 2014 at 11:40 pm
  
  Thank you for this!
  
  I really need rewrite this whole thing when I have the time. There are a few new things introduced in newer HtmlAgilityPack versions that I didn’t get to use when I first put this together.
  
  If you come across any other bugs/improvements, please don’t hesitate to share.
  
  Thanks for dropping by.
  
  Reply
davebac says:

August 8, 2014 at 6:27 pm

You should seriously consider if it is safe to have class and style in there.

Style’s issue is easy:
content: url(‘badguys.com’);
background-image: url(‘google.com/tracker’);
position: fixed;
top 0;
right 0;

Class is because of this:
$(‘.delete’).click(item.doDelete);

That is — if the page contains scripts that are using jQuery to match based on class names, those will match in user content also. And so the attacker can add controls on the page. Later, those controls can be clicked by a higher privileged user.

If you do allow these — you need to be extremely cautious. You more or less need a full CSS parser, and you need to white list the specific properties you’re going to allow to be set as well as the values. As the CSS working group continuously expands syntax for the properties, and valid values for the properties, there’s no way to assume what is safe today will be safe tomorrow.

Which is the point in the white list in the first place.

Reply
- eksith says:
  
  August 10, 2014 at 12:20 am
  
  Hi Dave, thanks for dropping by.
  
  Unfortunately that’s thoroughly beyond the purview of this class which I threw together for a personal project of mine. It seems MS still hasn’t come up with a suitable alternative since they broke the AntiXSS suite, which is why I had to write this and use HtmlAgilityPack. And I have a newer version of this class since this was posted
  
  It’s true a lot of things can change in the future, which is why the list can be replaced by anyone using the class (many already have it in production with a far fewer subset of attributes and tags).
  
  Reply
Trent Kerin says:

August 26, 2014 at 9:21 pm

This code has a vulnerability when used in combination with IE9 or earlier: It doesn’t strip out conditional comments, so you can just stuff in some script tags inside the conditional comment. I’ll try pasting in some example markup; hopefully it won’t get mangled:

<!–[if IE]><script type=’text/javascript’>alert(‘hacked’);</script><![endif]–>

Of course, the problem is the crazy behaviour of IE to execute comments, but this sanitisation code should still work to remove the threat.

Plus, the a tag’s href attribute and the img tag’s src attribute should both be sanitised to ensure they don’t have script in them.

Reply
- eksith says:
  
  August 27, 2014 at 10:02 am
  
  Thanks for pointing that out, Trent.
  
  Yes this is a vulnerability that I need to fix, but it’s going to be part of a laundry list that I need to reevaluate and post sometime. Meanwhile, there’s a newer version of this since AntiXSS is now broken.
  
  Unfortunately, I’ve moved on from most Microsoft technologies by and large so I don’t have as much time as I used to in order to revisit all the code I’ve posted here.
  
  Reply
Pingback: No luck with HTML decoding based on safe HTML tags (vb.net or c#) - BlogoSfera
Luis Torres says:

January 31, 2018 at 6:08 pm

The code where you got this doesnt exists anymore, is there anyway to get the original code? and see if the license is for everyone to use?

I would like to use your code in my projects, but I dont know if you will allow it, or is all your code under the MIT license? will there be any issues?

Reply
- Luis Torres says:
  
  January 31, 2018 at 6:13 pm
  
  I know you dont maiain this anymore, but if you could add the license text to the code, that would be wonderful :)
  
  Reply
Pingback: Comment enregistrer HTML à la base de données et le récupérer correctement