Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Nokogiri::XML::Reader#inner_xml returns NCR encoded attributes even if the encoding is set to utf-8 in #from_memory call. #2891

Closed
torce opened this issue May 24, 2023 · 5 comments · Fixed by #3084

Comments

@torce
Copy link

torce commented May 24, 2023

Please describe the bug

Nokogiri::XML::Reader#inner_xml returns NCR encoded attributes even if the encoding is set to utf-8 in #from_memory call.
It does not happen if the XML input sets the encoding with <?xml version="1.0" encoding="UTF-8"?>.
It only happens to attributes, elements and text nodes are correctly encoded.

Help us reproduce what you're seeing

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'nokogiri', '1.13.8'
end

require 'nokogiri'

xml = <<~XML
  <test><anotación tipo="inspiración">(inspiración)</anotación></test>
XML

reader = Nokogiri::XML::Reader.from_memory(xml, nil, 'utf-8')
puts reader.inner_xml while reader.read && reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT

# Output with NCR encoded attributes:
# <anotación tipo="inspiraci&#xF3;n">(inspiración)</anotación>
# (inspiración)

xml_with_encoding = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <test><anotación tipo="inspiración">(inspiración)</anotación></test>
XML

reader = Nokogiri::XML::Reader.from_memory(xml_with_encoding, nil, 'utf-8')
puts reader.inner_xml while reader.read && reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT

# Output with correct encoding:
# <anotación tipo="inspiración">(inspiración)</anotación>
# (inspiración)

Environment

# Nokogiri (1.13.8)
    ---
    warnings: []
    nokogiri:
      version: 1.13.8
      cppflags:
      - "-I/home/david/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.8-x86_64-linux/ext/nokogiri"
      - "-I/home/david/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.8-x86_64-linux/ext/nokogiri/include"
      - "-I/home/david/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.8-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.2
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - '0008-htmlParseComment-handle-abruptly-closed-comments.patch'
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/home/david/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.8-x86_64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.14
      loaded: 2.9.14
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.35
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.12
      libgumbo: 1.0.0-nokogiri

Additional context

@torce torce added the state/needs-triage Inbox for non-installation-related bug reports or help requests label May 24, 2023
@flavorjones
Copy link
Member

flavorjones commented May 25, 2023

Thanks for reporting this. The behavior of the DOM parsing methods is slightly different, which is interesting:

#! /usr/bin/env ruby

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'nokogiri', '1.13.8'
end

require 'nokogiri'

xml = <<~XML
  <test><anotación tipo="inspiración">(inspiración)</anotación></test>
XML

Nokogiri::XML::Document.parse(xml).to_xml
# => "<?xml version=\"1.0\"?>\n" +
#    "<test>\n" +
#    "  <anotación tipo=\"inspiraci&#xF3;n\">(inspiraci&#xF3;n)</anotación>\n" +
#    "</test>\n"

Nokogiri::XML::Document.parse(xml, nil, "UTF-8").to_xml
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
#    "<test>\n" +
#    "  <anotación tipo=\"inspiración\">(inspiración)</anotación>\n" +
#    "</test>\n"

I'll investigate!

@flavorjones flavorjones added topic/encoding and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels May 26, 2023
@flavorjones
Copy link
Member

Sorry for not updating sooner. This appears to be intentional libxml2 behavior when a document does not declare an encoding.

However, I have some changes on a branch to work around this behavior within Nokogiri. Just need an hour or two to explore edge cases and build confidence that it's not breaking anything else encoding-related.

@nwellnhof
Copy link

The behavior of the DOM parsing methods is slightly different, which is interesting:

The issue is in libxml2's serializer. If it wasn't told to use a specific encoding, it will use the one from the XML declaration. If there's no encoding in the XML declaration, it will encode non-ASCII characters with NCRs. (This is confusing since UTF-8 is the XML default and shouldn't make a difference. It probably comes from a time when UTF-8 wasn't as ubiquitous.) The best solution is use "UTF-8" instead of NULL when serializing documents with doc->encoding == NULL.

@flavorjones
Copy link
Member

@nwellnhof Thanks for validating -- my working branch is doing exactly that: setting doc->encoding = xmlStrdup(BAD_CAST "UTF-8") if it's NULL. I'll rebase and merge it!

flavorjones added a commit that referenced this issue Dec 28, 2023
when it's not specified either as a method param or in the document

Fixes #2891
@flavorjones flavorjones added this to the v1.16.x patch releases milestone Dec 28, 2023
flavorjones added a commit that referenced this issue Dec 28, 2023
when it's not specified either as a method param or in the document

Fixes #2891
@flavorjones
Copy link
Member

See #3084

flavorjones added a commit that referenced this issue Dec 28, 2023
when it's not specified either as a method param or in the document

Fixes #2891
flavorjones added a commit that referenced this issue Dec 28, 2023
when it's not specified either as a method param or in the document

Fixes #2891
flavorjones added a commit that referenced this issue Dec 28, 2023
when it's not specified either as a method param or in the document

Fixes #2891
flavorjones added a commit that referenced this issue Dec 28, 2023
when it's not specified either as a method param or in the document

Fixes #2891
flavorjones added a commit that referenced this issue Dec 29, 2023
**What problem is this PR intended to solve?**

default Reader node encoding to UTF-8 when it's not specified either as
a method param or in the document

Fixes #2891


**Have you included adequate test coverage?**

Yes, I've added coverage.


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, this updates the C implementation but does not update the Java
implementation because Reader encoding is already wonky there in a few
edge cases.
@flavorjones flavorjones removed this from the v1.16.x patch releases milestone Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants