More carefully separate inline text from block content #120

jsm28 · 2024-04-09T16:59:38Z

There are various cases in which inline text fails to be separated by (sufficiently many) newlines from adjacent block content. A paragraph needs a blank line (two newlines) separating it from prior text, as does an underlined header; an ATX header needs a single newline separating it from prior text. A list needs at least one newline separating it from prior text, but in general two newlines (for an ordered list starting other than at 1, which will only be recognized given a blank line before).

To avoid accumulation of more newlines than necessary, take care when concatenating the results of converting consecutive tags to remove redundant newlines (keeping the greater of the number ending the prior text and the number starting the subsequent text).

This is thus an alternative to #108 that tries to avoid the excess newline accumulation that was a concern there, as well as fixing more cases than just paragraphs, and updating tests.

Fixes #92

Fixes #98

There are various cases in which inline text fails to be separated by (sufficiently many) newlines from adjacent block content. A paragraph needs a blank line (two newlines) separating it from prior text, as does an underlined header; an ATX header needs a single newline separating it from prior text. A list needs at least one newline separating it from prior text, but in general two newlines (for an ordered list starting other than at 1, which will only be recognized given a blank line before). To avoid accumulation of more newlines than necessary, take care when concatenating the results of converting consecutive tags to remove redundant newlines (keeping the greater of the number ending the prior text and the number starting the subsequent text). This is thus an alternative to matthewwithanm#108 that tries to avoid the excess newline accumulation that was a concern there, as well as fixing more cases than just paragraphs, and updating tests. Fixes matthewwithanm#92 Fixes matthewwithanm#98

alexei-osipov · 2024-05-27T17:25:52Z

Not sure that all suggested changes are good.

For instance, currently

md('<h3>\n\nHello</h3>') == '### Hello\n\n'

new behavior with that MR:

md('<h3>\n\nHello</h3>') == '\n### Hello\n\n'

I don't think that this is good approach. That "\n" values should be removed. There are no reason to preserve them in such case.

However the situation should be different if something precedes <h3> tag.

md('abc<h3>Hello</h3>') == 'abc\n\n### Hello\n\n'

Please not that leading "\n" here are not significant.

jsm28 · 2024-05-27T17:32:35Z

If you want to avoid leading newlines at the start of the overall output, that could be done in convert_soup.

This fixes various issues relating to how input whitespace is handled and how wrapping handles whitespace resulting from hard line breaks. This PR uses a branch based on that for matthewwithanm#120 to avoid conflicts with the fixes and associated test changes there. My suggestion is thus first to merge matthewwithanm#120 (which fixes two open issues), then to merge the remaining changes from this PR. Wrapping paragraphs has the effect of losing all newlines including those from `<br>` tags, contrary to HTML semantics (wrapping should be a matter of pretty-printing the output; input whitespace from the HTML input should be normalized, but `<br>` should remain as a hard line break). To fix this, we need to wrap the portions of a paragraph between hard line breaks separately. For this to work, ensure that when wrapping, all input whitespace is normalized at an early stage, including turning newlines into spaces. (Only ASCII whitespace is handled this way; `\s` is not used as it's not clear Unicode whitespace should get such normalization.) When not wrapping, there is still too much input whitespace preservation. If the input contains a blank line, that ends up as a paragraph break in the output, or breaks the header formatting when appearing in a header tag, though in terms of HTML semantics such a blank line is no different from a space. In the case of an ATX header, even a single newline appearing in the output breaks the Markdown. Thus, when not wrapping, arrange for input whitespace containing at least one `\r` or `\n` to be normalized to a single newline, and in the ATX header case, normalize to a space. Fixes matthewwithanm#130 (probably, not sure exactly what the HTML input there is) Fixes matthewwithanm#88 (a related case, anyway; the actual input in matthewwithanm#88 has already been fixed)

This improves the markdownify logic for cleaning up input whitespace that has no semantic significance in HTML. This PR uses a branch based on that for matthewwithanm#150 (which in turn is based on that for matthewwithanm#120) to avoid conflicts with those fixes. The suggested order of merging is just first to merge matthewwithanm#120, then the rest of matthewwithanm#150, then the rest of this PR. Whitespace in HTML input isn't generally significant before or after block-level elements, or at the start of end of such an element other than `<pre>`. There is some limited logic in markdownify for removing it, (a) for whitespace-only nodes in conjunction with a limited list of elements (and with questionable logic that ony removes whitespace adjacent to such an element when also inside such an element) and (b) only for trailing whitespace, in certain places in relation to lists. Replace both those places with more thorough logic using a common list of block-level elements (which could be expanded more). In general, this reduces the number of unnecessary blank lines in output from markdownify (sometimes lines with just a newline, sometimes lines containing a space as well as that newline). There are open issues about cases where propagating such input whitespace to the output actually results in badly formed Markdown output (wrongly indented output), but matthewwithanm#120 (which this builds on) fixes those issues, sometimes leaving unnecessary lines with just a space on them in the output, which are dealt with fully by the present PR. There are a few testcases that are affected because they were relying on such whitespace for good output from bad HTML input that used `<p>` or `<blockquote>` inside header tags. To keep reasonable output in those cases of bad input now input whitespace adjacent to those two tags is ignored, make the `<p>` and `<blockquote>` output explicitly include leading and trailing spaces if `convert_as_inline`; such explicit spaces seem the best that can be done for such bad input. Given those fixes, all the remaining changes needed to the expectations of existing tests seem like improvements (removing useless spaces or newlines from the output).

This fixes problems with the markdownify logic for indentation inside list items. This PR uses a branch building on that for matthewwithanm#120, matthewwithanm#150 and matthewwithanm#151, so those three PRs should be merged first before merging this one. There is limited logic in markdownify for handling indentation in the case of nested lists. There are two major problems with this logic: * As it's in `convert_list`, causing a list to be indented when inside another list, it does not add indentation for any other elements such as paragraphs that may be found inside list items (or `<pre>`, `<blockquote>`, etc.), so such elements are wrongly not indented and terminate the list in the output. * It uses fixed indentation of one tab. Following CommonMark, a tab in Markdown is considered equivalent to four spaces, which is not sufficient indentation in ordered list items with a number of three or more digits. Fix both of these issues by making `convert_li` handle indentation for the contents of `<li>`, based on the length of the list item marker, rather than doing it in `convert_list` at all.

jsm28 mentioned this pull request May 24, 2024

Headers are not handled well when combined with details/summary tags #129

Closed

jsm28 mentioned this pull request Jun 12, 2024

fix unnecessary /n newlines resulting in a lot of white area in the markdown #131

Closed

jsm28 mentioned this pull request Sep 9, 2024

Space before <ol> or <ul> element incorrectly indents the first list item #144

Closed

Merge branch 'develop' into para-newlines-92-98

Loading
Loading status checks…

4399ee7

jsm28 mentioned this pull request Oct 3, 2024

Fix whitespace issues around wrapping #150

Merged

jsm28 mentioned this pull request Oct 3, 2024

More thorough cleanup of input whitespace #151

Merged

jsm28 mentioned this pull request Oct 3, 2024

Fix logic for indentation inside list items #152

Merged

AlexVonB merged commit 4399ee7 into matthewwithanm:develop Nov 24, 2024
1 check passed

AlexVonB mentioned this pull request Nov 24, 2024

ensure paragraph start tags begin a paragraph #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More carefully separate inline text from block content #120

More carefully separate inline text from block content #120

jsm28 commented Apr 9, 2024

alexei-osipov commented May 27, 2024 •

edited

Loading

jsm28 commented May 27, 2024

More carefully separate inline text from block content #120

More carefully separate inline text from block content #120

Conversation

jsm28 commented Apr 9, 2024

alexei-osipov commented May 27, 2024 • edited Loading

jsm28 commented May 27, 2024

alexei-osipov commented May 27, 2024 •

edited

Loading