Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<th> is stripped even though it's an allowed tag if not part of full table markup #110

Open
robertc opened this issue Mar 2, 2017 · 6 comments

Comments

@robertc
Copy link

robertc commented Mar 2, 2017

sanitizer.Sanitize("<th>Header</th>")

Result is Header. However:

sanitizer.Sanitize("<table><tr><th>Header</th></tr></table>")

Result is <table><tbody><tr><th>Header</th></tr></tbody></table>

My expectation was that when parsing a fragment rather than a document the full structure would not be required.

@mganss
Copy link
Owner

mganss commented Mar 2, 2017

In the first example, the th is not stripped by the sanitizer but by the HTML parser because it's invalid HTML. This has nothing to do with document vs. fragment, i.e. it would be the same situation if you had a stray th in a full document. The parser behaves identically as in a browser:

var e = document.createElement("div");
e.innerHTML = "<th>Header</th>";
// e.innerHTML === "Header"

@panetta-net-au
Copy link

Are you saying that HtmlSanitizer isn't intended to parse fragments?

The context matters for a fragment because your example works fine with a tr instead of div

var e = document.createElement("tr");
e.innerHTML = "<th>Header</th>";
// e.innerHTML === "<th>Header</th>"

If fragment parsing is intended, then you would probably need to add a context parameter to the Santize method.

sanitizer.Sanitize("<th>Header</th>", context: "tr")

or

sanitizer.Sanitize("<th>Header</th>", context: new [] { "table", "tr" })

Or, a user would have to workaround by manually adding the full context to the string before calling the Sanitize method and removing it after.

I have no idea if this is feasible - just thought I'd make a comment from an API perspective.
Obviously if the parser isn't intended to handle fragments, that's a design decision and the workaround would be the way to go.

@tiesont
Copy link

tiesont commented Apr 21, 2017

@panetta-net-au Well, the HTML parser is AngleSharp, which in theory is supposed to give you the same "corrected" markup as most browsers (meaning it will try to correct fragments if possible). A <th> by itself would not be valid markup, so it gets stripped. I assume you'd see the same behavior with orphaned <li>, <dt>, or <dd> elements, for example.

@panetta-net-au
Copy link

I suppose it depends how you define a fragment. If a fragment is a node or nodes that can exist as a direct descendent of body - then yeah, I agree - but if a fragment is a node that can exist anywhere in the tree, then I'd say they're not currently supported.

Regardless, the workaround is probably to wrap your orphaned fragments in the context they'll be used in before sending it through AngleSharp.

Whether that's the responsibility of the user or this library - I think that is the debate.

IMHO, if the user has access to the anglesharp document before and after sanitisation, that's probably enough to do what I suggested earlier.

@mganss
Copy link
Owner

mganss commented Apr 21, 2017

@panetta-net-au You're right, currently the context of fragments is always <body>...</body>. If you want another context, the workaround is to wrap the input with your context before sanitization and then strip it afterwards:

var sanitizer = new HtmlSanitizer();
var html = @"<th>Header</th>";
var actual = sanitizer.Sanitize("<table>" + html + "</table>");
actual = actual.Remove(0, "<table><tbody><tr>".Length);
actual = actual.Remove(actual.Length - "</tr></tbody></table>".Length);

I have no plans to add a context feature to the library, but PRs are always welcome 😉 It might just be a couple of lines of code. I know AngleSharp accepts a context parameter when parsing fragments.

@Xyncgas
Copy link

Xyncgas commented Apr 15, 2021

I mean you can just use regex, regardless, I fee like this behavior was to prevent tag poisoning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants