Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when adding fragment in HTML::Builder and document has an xmlns attribute #3112

Closed
bringel opened this issue Jan 28, 2024 · 8 comments · Fixed by #3116
Closed

Segfault when adding fragment in HTML::Builder and document has an xmlns attribute #3112

bringel opened this issue Jan 28, 2024 · 8 comments · Fixed by #3116
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc. upstream/libxml2

Comments

@bringel
Copy link

bringel commented Jan 28, 2024

Please describe the bug
I'm trying to build up a document to be added to an ePub file using Nokogiri::HTML::Builder. When I try to add the string of html inside the body doing doc << article[:text], the program crashes if I've added an xmlns attribute to root html node, but not if I leave that off. It's not super important to me that the namespace be there, but it is very odd

Help us reproduce what you're seeing
I'm adding the article text from this page. If I create my document like this:

Nokgiri::HTML::Builder do |doc|
  doc.html(xmlns: "http://www.w3.org/1999/xhtml") {
     ...
    doc.body {
      doc << article[:text]
    }
  }
end

then the program crashes, if I leave off the xmlns attribute, then it works fine.

Expected behavior

Environment

# Nokogiri (1.16.0)
    ---
    warnings: []
    nokogiri:
      version: 1.16.0
      cppflags:
      - "-I/Users/bringel/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.16.0-x86_64-darwin/ext/nokogiri"
      - "-I/Users/bringel/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.16.0-x86_64-darwin/ext/nokogiri/include"
      - "-I/Users/bringel/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.16.0-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.2
      platform: x86_64-darwin21
      gem_platform: x86_64-darwin-21
      description: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-darwin21]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.12.3
      loaded: 2.12.3
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.39
      loaded: 1.1.39
    other_libraries:
      zlib: '1.3'
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri

Additional context

ruby-2024-01-28-125803.txt

@bringel bringel added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jan 28, 2024
@flavorjones
Copy link
Member

flavorjones commented Jan 29, 2024

@bringel Thanks for reporting this!

I've reproduced with this script:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

doc = Nokogiri::HTML4::Builder.new do |doc|
  doc.html(xmlns: "http://www.w3.org/1999/xhtml") {
    doc.body {
      doc << "asdf"
    }
  }
end

I'll certainly fix this.

Although, I'm curious why you're adding a namespace to an HTML document. ePUB is an XHTML/XML standard. Can you say more about what you're doing here? Why not use the Nokogiri::XML::Builder?

@flavorjones
Copy link
Member

Stack walkback is

#0  0x00007ffff226e399 in xmlParserNsGrow (ctxt=0x555555d5d5c0) at parser.c:1642
#1  xmlParserNsPush (ctxt=ctxt@entry=0x555555d5d5c0, prefix=prefix@entry=0x7fffffffc000, 
    uri=uri@entry=0x7fffffffc010, saxData=saxData@entry=0x555555e52d90, defAttr=defAttr@entry=1) at parser.c:1679
#2  0x00007ffff227f4fc in xmlParseInNodeContext (node=node@entry=0x555555e76ce0, data=<optimized out>, 
    datalen=datalen@entry=15, options=<optimized out>, options@entry=4196449, lst=lst@entry=0x7fffffffc080)
    at parser.c:13267
#3  0x00007ffff224e7a8 in in_context (self=<optimized out>, _str=<optimized out>, _options=<optimized out>)
    at ../../../../ext/nokogiri/xml_node.c:2185

@flavorjones
Copy link
Member

Looks like a libxml2 bug. Doesn't reproduce with libxml 2.9.13, but does reproduce with the packaged 2.12.4. Will bisect.

@flavorjones
Copy link
Member

git bisect shows the first problematic commit is https://gitlab.gnome.org/GNOME/libxml2/-/commit/e0dd330b8fc299b26f7f7f1a3c853daee56a9987

This commit is first in libxml 2.12.0, which was first packaged in nokogiri v1.16.0. You should be able to use an older version of nokogiri to work around this for now.

I'll file a bug report upstream and attempt to provide a patch fix.

@flavorjones flavorjones added topic/memory Segfaults, memory leaks, valgrind testing, etc. upstream/libxml2 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 29, 2024
@bringel
Copy link
Author

bringel commented Jan 29, 2024

Hi @flavorjones, thanks for the quick work! I was adding the namespace mostly because a lot of the example content I had seen for ePub and documentation made me think that I needed to be using a stricter xhtml document than I think you really need to. Certainly seems to be okay for my testing at the moment without the namespace. I will probably wind up going to XML::Builder instead anyways to fix some other validation issues that I'm seeing, but I'm glad to see that this was not just me doing something strange

@flavorjones
Copy link
Member

Fun story: I already reported this bug, for XML documents, back in Sept 2023: https://gitlab.gnome.org/GNOME/libxml2/-/issues/597

The fix is at https://gitlab.gnome.org/GNOME/libxml2/-/commit/eb69c1d39d9175779844d4460e9a6afb74a14a2d but seems like it may not be complete (since it doesn't address HTML documents).

@flavorjones
Copy link
Member

Upstream bug report is https://gitlab.gnome.org/GNOME/libxml2/-/issues/672

@flavorjones
Copy link
Member

While we're here, a simpler reproduction that doesn't require Builder:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

doc = Nokogiri::HTML5::Document.parse("<html><body><math>")
math = doc.at_css("math")
math.parse("mrow")

The key bit is that we have an HTML node with a namespace, and try to parse some new markup "in the context of" that node. (In HTML5, the MathML foreign context is represented with a math node that has a default namespace of http://www.w3.org/1998/Math/MathML.)

flavorjones added a commit that referenced this issue Jan 30, 2024
**What problem is this PR intended to solve?**

Apply upstream fix for #3112

- upstream bug report
https://gitlab.gnome.org/GNOME/libxml2/-/issues/672
- upstream fix
https://gitlab.gnome.org/GNOME/libxml2/-/commit/95f2a17440568694a6df6a326c5b411e77597be2

Fixes #3112


**Have you included adequate test coverage?**

Yes

**Does this change affect the behavior of either the C or the Java
implementations?**

Only affects the C implementation.
flavorjones added a commit that referenced this issue Feb 4, 2024
- fix is in libxml 2.12.5
- also update test to not rely on the presence of the patch
flavorjones added a commit that referenced this issue Feb 4, 2024
- fix is in libxml 2.12.5
- also update test to not rely on the presence of the patch

(cherry picked from commit f33a25f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc. upstream/libxml2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants