Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Nokogiri::XML::Reader.from_io.each misidentifies character encoding? #2882

Closed
koshigoe opened this issue May 18, 2023 · 6 comments · Fixed by #2883
Closed

[bug] Nokogiri::XML::Reader.from_io.each misidentifies character encoding? #2882

koshigoe opened this issue May 18, 2023 · 6 comments · Fixed by #2883

Comments

@koshigoe
Copy link

Please describe the bug

Nokogiri::XML::Reader.from_io.each cause exception Nokogiri::XML::SyntaxError when XML node contain long non-ascii characters.
The XML node contain only valid UTF-8 characters, but cause error FATAL: Input is not proper UTF-8, indicate encoding !.

Help us reproduce what you're seeing

require 'nokogiri'
require 'stringio'

NON_ASCII = "\u{3042}"
XML_TEMPLATE =<<~XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <a>%<content>s</a>
</root>
XML

[325, 326].each do |length|
  io = StringIO.new(format(XML_TEMPLATE, content: NON_ASCII * length))
  Nokogiri::XML::Reader.from_io(io).tap { |x| pp x  }.each { }
  puts "OK: #{length}"
rescue => e
  puts "NG: #{length} (#{e.inspect})"
end

__END__

ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]

### nokogiri 1.14.4

OK: 325
OK: 326

### nokogiri 1.15.0

OK: 325
NG: 326 (#<Nokogiri::XML::SyntaxError: 3:332: FATAL: Input is not proper UTF-8, indicate encoding !>)

Expected behavior

Do not raise error.

Environment

# Nokogiri (1.15.0)
    ---
    warnings: []
    nokogiri:
      version: 1.15.0
      cppflags:
      - "-I/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri"
      - "-I/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.4
      platform: arm64-darwin22
      gem_platform: arm64-darwin-22
      description: ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      libxml2_path: "/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.11.3
      loaded: 2.11.3
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.38
      loaded: 1.1.38
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri

Additional context

ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]
@koshigoe koshigoe added the state/needs-triage Inbox for non-installation-related bug reports or help requests label May 18, 2023
@flavorjones
Copy link
Member

@koshigoe Thank you for reporting this! This error message is being generated by libxml2. I have reproduced the issue and will investigate.

@flavorjones
Copy link
Member

Git bisect shows that this is the commit that introduced the new behavior:

https://gitlab.gnome.org/GNOME/libxml2/-/commit/3582b07bd24d438be7dd08ab57e3f9e635373e32

commit 3582b07bd24d438be7dd08ab57e3f9e635373e32
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Sun Nov 13 22:57:32 2022 +0100

    parser: Fix content parser progress checks
    
    This is another attempt at fixing parser progress checks. Instead of
    relying on in->consumed, which could overflow, change some content
    parser functions to make guaranteed progress on certain byte sequences.

@flavorjones flavorjones added upstream/libxml2 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels May 18, 2023
@flavorjones
Copy link
Member

I've narrowed this down to specific changes in libxml2 chunk parsing that may be a bug. I'll open an issue upstream and link to it here.

@flavorjones
Copy link
Member

flavorjones commented May 18, 2023

Neat! This was already reported upstream at https://gitlab.gnome.org/GNOME/libxml2/-/issues/542 and was fixed about an hour ago in https://gitlab.gnome.org/GNOME/libxml2/-/commit/e0f3016f71297314502a3620a301d7e064cbb612

I expect it'll be fixed shortly in a libxml2 release. I'll leave this open until that happens and I can ship a new nokogiri release.

@flavorjones
Copy link
Member

libxml2 v2.11.4 is out with the fix: https://gitlab.gnome.org/GNOME/libxml2/-/releases/v2.11.4

I'll try to get a release out in the next day.

@flavorjones
Copy link
Member

Nokogiri v1.15.1 is out with this upstream fix. https://github.com/sparklemotion/nokogiri/releases/tag/v1.15.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants