-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stand-alone * or _ lead to incorrect results #1300
Comments
I'm curious if the behavior changed here in #792 (version 3.2) or #629 (version 3.0). In other words, did we introduce this behavior in either of those changes or has it always been this way? @facelessuser you are the most familiar with the current cade as you made those changes. Any thoughts? I won't have time to take a closer look until next week at the earliest. |
This has nothing to do with the surrounding #629 did not cause this, I am certain of that. Could #805 (the fix for #792) have caused this? It is tough to tell without knowing what it was doing prior to this. No tests were changed in that review, only new ones added. I'm skeptical whether this changed the behavior, but I'd have to check.
Anyways, the first task would be to see if we broke any previous behavior. If we did not break existing behavior, then this would be a new request. Then we would need to determine specifically what rule we are breaking (if any), and if so, see if we could add the necessary intelligence. |
As far as #805 (the fix for #792) being the cause, I see no evidence. I reverted the changes related to ➜ markdown git:(master) ✗ python3
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import markdown
>>> markdown.markdown('**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**')
'<p><strong><em> _ </em> N _ <em> _ </em> _ <em> _ R S _ </em> N _</strong></p>'
>>> exit()
➜ markdown git:(master) ✗ python3
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import markdown
>>> markdown.markdown('**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**')
'<p><strong><em> _ </em> N _ <em> _ </em> _ <em> _ R S _ </em> N _</strong></p>' |
It is highly doubtful that #629 is the culprit either. The only extensions that were drastically changed were the link patterns. The bold and emphasis extensions began using the new processor, but their fundamental patterns did not change, nor any sequencing of the extensions. |
So bold and italic are handled in separate passes as they've always been done. At no point as Python Markdown ever not parsed |
I finally had some time to look at this. It appears that we are ignoring this provision of the rules:
And that is a bug. That is also why I wondered if we had introduced a change in any of the more recent changes. Regardless of how the current behavior was introduced, we do need to fix this. That said, the fix is not exactly clear. For example, the first and last underscore in the example input are not 'surrounded' by spaces as there is no space between the asterisk and the underscore in each instance. However, there are clearly instances where underscores which are surrounded by spaces are not being treated as literal underscores in the example. To further complicate matters, we do sometimes properly ignore underscores surrounded by spaces. Consider this simple input:
... which properly renders as...
I haven't looked at the code to try to understand what is going on exactly, but something is amiss. |
This is not a problem with my >>> markdown.markdown('**_ _ _ N _ _ _ _ _ _ _ R S _ _ N _**', extensions=['pymdownx.betterem'])
'<p><strong>_ _ _ N _ _ _ _ _ _ _ R S _ _ N _</strong></p>' I know in my package, this is done by rejecting a |
I only checked the code once. It may sound a bit cheeky, but a single underscore is no problem.
renders as
1. Thought:I wonder why in the handler of the SimpleTextInlineProcessor match group 1 (
However, this modification gives this result for the example above:
Now the first and last underscore after the double asterisk is interpreted as a markdown sign, although this does not represent the beginning and end of a text. 2. Thought:The original code renders
to
which is also wrong and leads me to suspect that something might be wrong with the joined |
The extension is found in this extension pack: https://pypi.org/project/pymdown-extensions/. Keep in mind, it may not match all corner cases if compared to some other parser, but it does aim to parse strong and italic cases in as sane a way as possible using the current methodology that Python Markdown employs. |
Yeah, this is how most implementations actually work in practice. In fact, the Commonmark spec codifies this behavior in great detail. However, the original rules explicitly state that the character should be rejected as a token if it 'surrounded' by spaces. Of course, the rule is vague and leaves out a bunch of edge cases (white space other than spaces, punctuation, etc). That said, we can't get too hung up on the original rules as we explicitly don't follow them anyway with underscores to avoid middle-word emphasis. In the end we should probably strive to follow the behavior of the reference implementation with asterisks and copy that behavior to underscores minus middle-word emphasis. |
I think it will be key to identify which missing rules specifically we are going to implement. Once we've identified that, I think we can move forward with a solution. |
The simple solution would be to adjust the markdown/markdown/inlinepatterns.py Lines 156 to 157 in e97ffeb
Note that that regex consumes the space both before and after the standalone character. That explains why the example input seems to match every other underscore (the space before was consumed by the previous character match). However, if it did not consume the space (maybe by using a negative lookahead assertion; i.e.: Another approach would be to build the whitespace checking into the various regexs directly (and eliminating the need for Upon further reflection, I think that perhaps the requirement that the character is not to be recognized as a token if it is 'surrounded' by whitespace is a red herring. The opening token not followed by whitespace and closing token not preceded by whitespace rule is going to cover any case in normal prose. Edge cases outside of that should be relatively rare and can use backslash escapes to avoid a match. And there is no backward compatibility issue here as users will have needed to use more backslash escapes in documents authored prior to any change we make. There is the issue of document authors having left extra white space in an older document which previously rendered with emphasis but will not after the change. I think we can ignore that issue as the document was not properly formatted. The user only didn't notice because of a bug in our parser. However, as our parser is clearly acting in opposition to the rules, the author should have expected a future update to alter the behavior and edited their document accordingly. So we have two possible paths forward:
I think the deciding factor here is what set of rules we want to follow. If we want to strictly match the original rules as written ('surround'), then we should go with option 1, but if we want to more closely match all other implementations (only disallow whitespace on one side of the token), then option 2 makes more sense. In the real world, I think either approach would be indistinguishable from the other in all but the most unusual edge cases. Therefore, I am inclined to accept whichever one is submitted in a PR so long as the tests demonstrate that it covers all of the expected cases. |
Btw, the minimum failing test is the following:
which renders as <p>Foo * <em> * </em></p>
<p>Bar _ <em> _ </em></p> while is should render as <p>Foo * * * *</p>
<p>Bar _ _ _ _</p> By way of explanation, we must include something on the line to ensure it is not a horizonal rule (I used the words Given how narrow of an edge case it is, I think probably the simple fix of updating |
The `NOT_STRONG_RE` regex matchs 1, 2, or 3 * or _ which are surrounded by white space to prevent them from being parsed as tokens. However, the surrounding white space should not be consumed by the regex, which is why lookhead and lookbehind assertions are used. As `^` cannot be matched in a lookbehind assertion, it is left outside the assertion, but as it is zero length, that should not matter. Tests added and/or updated to cover various edge cases. Fixes Python-Markdown#1300.
The `NOT_STRONG_RE` regex matchs 1, 2, or 3 * or _ which are surrounded by white space to prevent them from being parsed as tokens. However, the surrounding white space should not be consumed by the regex, which is why lookhead and lookbehind assertions are used. As `^` cannot be matched in a lookbehind assertion, it is left outside the assertion, but as it is zero length, that should not matter. Tests added and/or updated to cover various edge cases. Fixes #1300.
A text with several stand-alone underscores is not converted correctly.
input
output
expected
It seems that underscores with the surrounding spaces are replaced by the
placeholder
.In the original on https://daringfireball.net/projects/markdown/dingus it is converted correctly.
The text was updated successfully, but these errors were encountered: