Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Markdown turns </#rrggbb> into HTML comment, even while in inline code #1425

Closed
Andre601 opened this issue Jan 1, 2024 · 11 comments · Fixed by #1426
Closed

Python Markdown turns </#rrggbb> into HTML comment, even while in inline code #1425

Andre601 opened this issue Jan 1, 2024 · 11 comments · Fixed by #1426
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.

Comments

@Andre601
Copy link

Andre601 commented Jan 1, 2024

When you put </#rrggbb> as-is inside a inline code block will Python Markdown convert it into a HTML comment (<!--#rrggbb-->), which feels odd.

Is there a way to prevent this? Using HTML escape methods such as &lt;/#rrggbb&gt; will not work, as Python Markdown will then treat the content literally and show it as-is.
Additionally can and will this cause issues with inline-code highlighting used.

@Andre601
Copy link
Author

Andre601 commented Jan 1, 2024

For context am I using Python Markdown through MkDocs, but this seems to be generally an issue as pointed out in this comment.

The full discussion for context: mkdocs/mkdocs#3537

@facelessuser
Copy link
Collaborator

I imagine this is a quirk with the HTML parser. I'm not sure off the top of my head what specifically in the parser is causing this, but I would imagine this could be considered a bug.

@Andre601
Copy link
Author

Andre601 commented Jan 1, 2024

My only guess is that the parser turns invalid HTML tags into comments to avoid possible breaking of the content.
Either that or </#...> is for whatever reason considered a format for HTML comments and the parser converts it.

I really hope this can be fixed, should it be a bug, or that there is a workaround for this.
Escaping does not work as mentioned, since the output would be given as-is then, meaning the <> wouldn't be displayed.

@facelessuser
Copy link
Collaborator

What I find most concerning is the content isn't preserved for inline code:

>>> markdown.markdown('`</#rrggbb/>`')
'<p><code>&lt;!--#rrggbb/--&gt;</code></p>'

The parser should not be making changes in this way.

@mondeja
Copy link
Contributor

mondeja commented Jan 2, 2024

This is treated as a bogus comment by the HTML parser of Python:

from html.parser import HTMLParser

class Parser(HTMLParser):
    def handle_comment(self, v):
        print("Comment:", v)
    def handle_data(self, v):
        print("Data:", v)

Parser().feed('<#rgba>Foo</#rgba>')
# prints:
# Data: <
# Data: #rgba>Foo
# Comment: #rgba

See these lines: https://github.com/python/cpython/blob/936135bb97fe04223aa30ca6e98eac8f3ed6b349/Lib/html/parser.py#L277C15-L286

Perhaps Python-Markdown should parse backticks before html blocks, doing replacements as needed. If not, you can't put lazy bogus HTML comments inside code blocks:

>>> markdown.markdown("`</? p>Hi</p>`\n\n```\n</+ ups>\n```")
'<p><code>&lt;!--? p--&gt;Hi&lt;/p&gt;</code></p>\n<p>```</p>\n<!--+ ups-->\n\n<p>```</p>'

@waylan
Copy link
Member

waylan commented Jan 2, 2024

There are multiple different factors which are resulting in this unexpected output.

First, the Markdown parser passes the entire document to an HTML parser to identify and extract all raw HTML blocks. Anything that is not a raw HTML block (as defined by Markdown) is (supposedly) passed through unalerted. Later, the Markdown parser then handles anything which looks like HTML tags in the inline parser after code spans are parsed. In other words, the HTML block parser passes inline tags through as unaltered plain text, and that is parsed a second time by a much simpler parser (which just finds all content between opening and closed angle brackets). Unfortunately, in a few rare occasions, invalid HTML can cause the HTML block parser to alter the text passed through in some odd way, which appears to be the case here.

It has been our position that it is not our responsibility to properly handle invalid HTML. Bad input results in bad output, even if that output is in an unexpected format. Of course, the issue here is that the invalid HTML is in a code span, and so a user should be able to presume that it would not be parsed by any HTML parser and be preserved as-is. Unfortunately, the way the current HTML block parser is implemented, it's not possible to completely avoid passing any part of the document through the HTML block parser. All we can do is try to properly identify which parts should be passed through unaltered.

So, now lets take a look as what the HTML block parser is doing. The string </#rrggbb> is invalid HTML. Technically, the upstream HTML parser is handling the error correctly. See the HTML spec at 13.2.5.7 End tag open state. The # character is not valid here, which puts the parser into bogus comment state. So the question is what should we be doing with the bogus comment state? There are various different ways the HTML parser can get into that state, so the same solution may not work for each. That may require some additional investigation.

As noted by @mondeja above, the bogus comment state is handled by the HTML parser here:

https://github.com/python/cpython/blob/936135bb97fe04223aa30ca6e98eac8f3ed6b349/Lib/html/parser.py#L277-L286

Note that the parse_bogus_comment function accepts a report argument which, when False (or anything which may equate to False such as 0) simply returns the content of the tag as plain data rather than as a comment. In this specific case, that is likely the desired behavior. However, none of the calls to parse_bogus_comment within the HTML parser actually set report to False and I'm not sure why that is. In any event, it could be that our subclass could define this method, change the default value to False and then call the method of the parent class. I'm assuming that that should cause the HTML parser to simply pass any "bogus comments" through as plain text. The question is if that is the correct behavior for every instance of bogus comment state or only this one. If that's the correct behavior for every case, then that is an easy fix. If not, then I'm not sure how to proceed.

@waylan waylan added bug Bug report. core Related to the core parser code. confirmed Confirmed bug report or approved feature request. labels Jan 2, 2024
@waylan
Copy link
Member

waylan commented Jan 2, 2024

Looking through the HTML parser code, there are only two situations in which parse_bogus_comment gets called (the HTML spec lists a couple more, but those seem to be handled in a different way in the code so we can ignore them here).

  1. <! is followed by an invalid character (in parse_html_declaration), or
  2. </ is followed by an invalid character (in parse_endtag).

The first is obvious. The presumption is that a comment was intended, but the required -- is missing. Of course, it could be an invalid HTML declaration rather than a comment. But that is not a distinction we need to make. However, it is not clear to me how anyone would think that </ was intended to begin a comment.

In any event, when I originally worked on this, I missed the second case. My thinking was that the if we added -- to the occasional invalid comment it would do no harm as that is probably what the user intended anyway. And in the case where it was supposed to be a declaration, well that would clue the user in to the fact that they have something invalid in their declaration.

On further reflection, I think it is better to just pass either case through as plain text. In that way we don't make any modifications to the user's input. An added benefit is that when the invalid HTML is in a code span, we aren't messing with it. And for the few occasions out there where a user was taking advantage of the existing behavior to get a comment where they wanted one, well, they were abusing a bug so they shouldn't expect it to continue to work with future updates.

@waylan
Copy link
Member

waylan commented Jan 2, 2024

The solution to #1066 has a side effect of preventing various edge cases of invalid comments (tags that start with <!) from being treated as comments. So my proposed change would result in consistent behavior for all "bogus comments."

@Andre601
Copy link
Author

Andre601 commented Jan 2, 2024

However, it is not clear to me how anyone would think that </ was intended to begin a comment.

I think that's related to the spec you linked before, where it states to define this as a comment, should it be an invalid opening/closing tag from what I read.

Tho, in all honesty do I wonder why this behaviour was chosen for that spec instead of something like escaping the tag itself (so turning </#invalid> into &lt;/#invalid&gt;)... I feel like this behaviour destroys a lot more than it protects... But it's HTML, so what else should you expect here?

waylan added a commit to waylan/markdown that referenced this issue Jan 3, 2024
As with most implementations, we now pass through bogus comments (as
defined by the HTML Spec) unaltered except that they are HTML escaped.
This deviates from the reference implementation which compeltely ignores
them. As the reference implementation seems to not have even contemplated
their existance, it is not being used as a reference in this instance.
Fixes Python-Markdown#1425.
@waylan
Copy link
Member

waylan commented Jan 3, 2024

I really hope this can be fixed, should it be a bug, or that there is a workaround for this.

I have a fix in #1426. However, in the meantime, you can use raw HTML as a workaround:

<code>&lt;/#rrggbb&gt;</code>

As a reminder, wrapping your escaped HTML in raw code tags will prevent Markdown from further manipulating it. And because you are using raw HTML you need to do your own escaping anyway, which prevents the bug from triggering. As you wanted a code span anyway, this works. Of course, for instances where there is no code span involved, this doesn't help.

waylan added a commit that referenced this issue Jan 3, 2024
As with most implementations, we now pass through bogus comments (as
defined by the HTML Spec) unaltered except that they are HTML escaped.
This deviates from the reference implementation which completely ignores
them. As the reference implementation seems to not have even contemplated
their existence, it is not being used as a reference in this instance.
Fixes #1425.
@Andre601
Copy link
Author

Andre601 commented Jan 5, 2024

Something small I want to mention here real quick is an interesting exception to that behaviour discussed here...

That exception being uppercased letters it seems.
On one of the markdown files is </#RRGGBB> used, which gets parsed unchanged.... It's the exact same situation (In inline code), so not sure what the HTML parsing does different here...

Just wanted to mention this. Doesn't really change much I think.

EDIT: Actually, I was wrong. That above case was in a code block, so surrounded by pre tags in the end...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants