Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improperly formatting text within a <pre> tag #1891

Closed
NiccoMlt opened this issue Feb 3, 2023 · 3 comments
Closed

Improperly formatting text within a <pre> tag #1891

NiccoMlt opened this issue Feb 3, 2023 · 3 comments
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@NiccoMlt
Copy link

NiccoMlt commented Feb 3, 2023

Hi,
apparently Jsoup formats the content inside a <pre> tag, resulting in a non-equivalent rendering.
Given the following HTML:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>
    <div>
        <pre><span><b><u><span>TEST</span></u></b></span></pre>
    </div>
</body>
</html>

And running the following Java code

final String html =
        "<!DOCTYPE html>\n"
        + "<html lang=\"en\">\n"
        + "<head><title>Test</title></head>\n"
        + "<body>\n"
        + "    <div>\n"
        + "        <pre><span><b><u><span>TEST</span></u></b></span></pre>\n"
        + "    </div>\n"
        + "</body>\n"
        + "</html>";
final String parsed = Jsoup.parse(html).toString();
System.out.println(parsed);

the result is

<!DOCTYPE html> 
<html lang="en"> 
<head>
 <title>Test</title>
</head> 
<body> 
 <div> 
  <pre><span><b><u>
      <span>TEST</span>
     </u></b></span></pre> 
 </div>  
</body>
</html>

I'm using latest 1.15.3 version

@jhy
Copy link
Owner

jhy commented Feb 20, 2023

I'm not able to repro this in 1.15.4. See this example, returns:

  <div>
   <pre><span><b><u><span>TEST</span></u></b></span></pre>
  </div>

jsoup does test if an element is in a <pre> (in Element#preserveWhitespace()) and will preserve textnode formatting; and should not be otherwise formatting elements. There is a limit (6 up levels) of stack depth as an optimization for serialization time, but that wouldn't be impacting in this instance. I guess this issue was resolved in one of the pretty-print fixes in 1.15.4 but haven't checked yet.

Can you review with 1.15.4? If you find other cases where's it's not working as desired, happy to take a look.

@jhy jhy added the no-repro label Feb 20, 2023
@NiccoMlt
Copy link
Author

NiccoMlt commented Feb 20, 2023

Hi, thank you for your answer, you are right about the minimum example, it seems to be fixed.

Sadly, I'm still experiencing the problem when moving to my acutal document; I cannot provide the full document, but I can provide another example:

<div>
    <pre><span><b><u><o:p>TEST</o:p></u></b></span></pre>
</div>

The following code under Jsoup 1.15.4 will be formatted as:

<html>
 <head></head>
 <body>
  <div>
   <pre><span><b><u>
       <o:p>TEST
       </o:p></u></b></span></pre>
  </div>
 </body>
</html>

Note that I replaced the <span> tag to an Office-namespaced paragraph tag <o:p>.

HTML documents with these tags are usually produced by tools like Microsoft Word and Microsoft Outlook.

@jhy jhy closed this as completed in 2f48a61 Mar 9, 2023
@jhy jhy added bug Confirmed bug that we should fix fixed and removed no-repro labels Mar 9, 2023
@jhy jhy added this to the 1.16.1 milestone Mar 9, 2023
@jhy
Copy link
Owner

jhy commented Mar 9, 2023

Thanks for the updated detail -- fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

2 participants