Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested body and html tags trip jsoup into closing the stack too early #1851

Closed
0xabadea opened this issue Oct 25, 2022 · 1 comment
Closed
Assignees
Labels
Milestone

Comments

@0xabadea
Copy link

For the attached eisenachonline-de.html.gz, jsoup 1.15.3 returns a tree different from that shown by Firefox and Chrome in their dev tools. There are nested body and especially html tags that seem to trip up the parser to close the stack prematurely.

The relevant HTML section looks like this:

<header class="entry-header">
   <h1 class="entry-title">Unfall zwischen PKW und Fußgänger</h1>
   <div class="entry-meta">...</div>
   <figure class="post-thumbnail"><img ... /><figcaption>
       <?xml encoding="utf-8" ?><html><body><p>Bildquelle: &copy; Comofoto &ndash; stock.adobe.com<br>
           Symbolbild</p>
       </body></html>
   </figcaption></figure>
   <div class="entry-date">
       <time datetime="2022-10-20 14:24">20. Oktober 2022</time>
   </div>    </header>

For this HTML, the following code:

    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.parse(new File("eisenachonline-de.html"));
        System.out.println(doc.selectFirst("header[class=entry-header]").selectFirst("div[class=entry-date]"));
    }

should print

<div class="entry-date"><time datetime="2022-10-20 14:24">20. Oktober 2022</time>
</div>

but prints null instead. The entry-date div is present in the parsed document, but it is a child of the body element.

There seems to be a deviation from the HTML spec in the jsoup implementation. In the AfterBody state, when processing an html end tag, jsoup pops the stack to close. In the "after body" insertion mode, the HTML spec only says to switch to the "after after body" mode, but it doesn't ask to close the stack. So it appears to me the stack is closed too early.

@jhy jhy self-assigned this Mar 27, 2023
@jhy jhy closed this as completed in dea4969 Mar 27, 2023
@jhy jhy added the fixed label Mar 27, 2023
@jhy jhy added this to the 1.16.1 milestone Mar 27, 2023
@jhy
Copy link
Owner

jhy commented Mar 27, 2023

Thanks, fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants