Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Empty HTML4 DocumentFragment serialization doesn't respect encoding #2649

Closed
sgoedecke opened this issue Sep 19, 2022 · 5 comments · Fixed by #2650
Closed

[bug] Empty HTML4 DocumentFragment serialization doesn't respect encoding #2649

sgoedecke opened this issue Sep 19, 2022 · 5 comments · Fixed by #2650

Comments

@sgoedecke
Copy link

Please describe the bug

Parsing and serializing a HTML4::DocumentFragment will produce a UTF-8 string by default. However, when the input string is empty, it produces a US-ASCII encoded string instead, regardless of the passed encoding option.

Let me know if this is actually expected behaviour and something I should be working around!

Help us reproduce what you're seeing

This script produces a failing test:

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "HTML4::DocumentFragment" do
    it "should respect UTF-8 encoding for empty strings" do
      doc = Nokogiri::HTML::DocumentFragment.parse("", "UTF-8")
      
      assert_equal "UTF-8", doc.to_html.encoding
    end
  end
end

Output:

# Running:

F

Finished in 0.001808s, 553.0974 runs/s, 553.0974 assertions/s.

  1) Failure:
HTML4::DocumentFragment#test_0001_should respect UTF-8 encoding for empty strings [test.rb:11]:
Expected: "UTF-8"
  Actual: #<Encoding:US-ASCII>


Environment

# Nokogiri (1.12.5)
    ---
    warnings: []
    nokogiri:
      version: 1.12.5
      cppflags:
      - "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
      - "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include"
      - "-I/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.6.3
      platform: universal.x86_64-darwin20
      gem_platform: universal-darwin-20
      description: ruby 2.6.3p62 (2019-04-16 revision 67580) [universal.x86_64-darwin20]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - 0007-Fix-XPath-recursion-limit.patch
      libxml2_path: "/Library/Ruby/Gems/2.6.0/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      datetime_enabled: true
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
      libiconv: '1.15'
      libgumbo: 1.0.0-nokogiri

Additional information

From what I can tell, the issue comes from here: https://github.com/sparklemotion/nokogiri/blob/main/lib/nokogiri/xml/node_set.rb#L283

When there's a single node (which there always is for proper documents), that node sets the encoding properly. But when there are no nodes, we're effectively returning [].join, which is US-ASCII encoding.

@sgoedecke sgoedecke added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Sep 19, 2022
@flavorjones
Copy link
Member

Hello! Thank you for opening this issue, this is most certainly a bug and I appreciate you reporting it and diagnosing it.

I'll schedule some time to fix it!

@flavorjones flavorjones added topic/encoding and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Sep 19, 2022
@flavorjones flavorjones added this to the v1.14.0 milestone Sep 19, 2022
@flavorjones
Copy link
Member

And your diagnosis seems right on:

>> "".encoding
=> #<Encoding:UTF-8>
>> [].join.encoding
=> #<Encoding:US-ASCII>

@flavorjones
Copy link
Member

flavorjones added a commit that referenced this issue Sep 19, 2022
and improve test coverage around fragment encoding

Closes #2649
@flavorjones
Copy link
Member

See #2650 for a proposed fix.

flavorjones added a commit that referenced this issue Sep 19, 2022
and improve test coverage around fragment encoding

Closes #2649
flavorjones added a commit that referenced this issue Sep 19, 2022
and improve test coverage around fragment encoding

Closes #2649
@sgoedecke
Copy link
Author

Wow, so quick! Thank you ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants