Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization differences between IDNA::Native and IDNA::Pure #408

Closed
brasic opened this issue Feb 9, 2021 · 13 comments
Closed

Normalization differences between IDNA::Native and IDNA::Pure #408

brasic opened this issue Feb 9, 2021 · 13 comments
Labels

Comments

@brasic
Copy link
Contributor

brasic commented Feb 9, 2021

Hello and thanks for your work building and maintaining this useful library!

We use Addressable and IDNA::Pure at GitHub for a number of URL parsing and generating tasks. The pure ruby IDN implementation is a bottleneck in some areas (see #407) so we are currently evaluating a switch to libidn via IDNA::Native. Our test suite found a few interesting differences between the two implementations when it comes to path normalization of percent-encoded NUL bytes. Here's an example:

Addressable::URI.parse("http://github.com/foo/bar/.%00./lol").normalized_path # Addressable::IDNA::Pure
# => "/foo/bar/.%00./lol"

Addressable::URI.parse("http://github.com/foo/bar/.%00./lol").normalized_path # Addressable::IDNA::Native
# => "/foo/bar/lol"

The behavior change is ultimately due to the following lower-level difference:

irb(main):004:0> Addressable::IDNA.unicode_normalize_kc(".\u0000.") # libidn
=> "."
irb(main):006:0> Addressable::IDNA.unicode_normalize_kc(".\u0000.") # pure
=> ".\u0000."

Unfortunately in our testing it seems browsers are split on which is the right way to deal with NUL bytes. RFC3986 has a discussion of %00 but leaves it up to the application (emphasis mine):

Percent-encoded octets must be decoded at some point during the dereference process. Applications must split the URI into its components and subcomponents prior to decoding the octets, as otherwise the decoded octets might be mistaken for delimiters. Security checks of the data within a URI should be applied after decoding the octets. Note, however, that the "%00" percent-encoding (NUL) may require special handling and should be rejected if the application is not expecting to receive raw data within a component.

Are you interested in harmonizing this difference in normalization between the two IDNA backends and which do you think is the appropriate behavior?

@sporkmonger
Copy link
Owner

sporkmonger commented Feb 11, 2021

These should definitely be made consistent. Since we likely can't easily influence the behavior of IDNA::Native, I'm inclined to believe it should be considered the canonical result where behavior is undefined.

@brasic
Copy link
Contributor Author

brasic commented Sep 1, 2021

@sporkmonger @dentarg gentle ping. Have you given any more thought to this? If you're interested in a PR to resolve the discrepancies we found between IDNA::Pure and IDNA::Native by changing IDNA::Pure I'm happy to open one, but IMO it would necessitate a major version bump.

@dentarg
Copy link
Collaborator

dentarg commented Sep 1, 2021

@brasic I interpret the answer from @sporkmonger that a PR resolving this would be accepted! I agree with it being a major version bump when released.

@jarthod
Copy link
Contributor

jarthod commented Jan 18, 2023

Quick question, I encountered a similar issue where IDNA::Pure normalization was not matching browser behavior (and after testing, not matching IDNA::Native either):

irb(main):004:0> s1 = "https://l♥️h.ws"
=> "https://l♥️h.ws"
irb(main):005:0> s2 = "https://l♥h.ws"
=> "https://l♥h.ws"
irb(main):006:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x243d8 URI:https://xn--lh-t0xz926h.ws/>
irb(main):007:0> Addressable::URI.parse(s2).normalize
=> #<Addressable::URI:0x25b5c URI:https://xn--lh-t0x.ws/>
irb(main):008:0> s1.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 65039, 104, 46, 119, 115]
irb(main):009:0> s2.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 104, 46, 119, 115]

These two string may look alike depending on your OS/browser but as we can see there's a difference:
While the second one has the simple heart Unicode character (U+2665), the first one has a variant modifier attached (U+2665, U+FE0F).

But in browsers both URL loads https://xn--lh-t0x.ws/ (the modifier is ignored), so I tried with IDNA::Native too (idn-ruby gem installed) and sure enough the modifier is ignored this time:

irb(main):004:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x6130 URI:https://xn--lh-t0x.ws/

This behavior is apparently indicated in the official IDNA mapping table: https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt at this line:

FE00..FE0F    ; ignored                                # 3.2  VARIATION SELECTOR-1..VARIATION SELECTOR-16

So from what I read above and my analysis I suppose we should fix this in IDNA::Pure and PR would be appreciated?
If so I'm not sure what is the best way to fix this, I saw the UNICODE_TABLE file which seems to list character behavior. I hopped it would be an outdated/shortened IdnaMappingTable.txt that I would simply have to modify but unfortuantely it's not ^^ It's a custom marshalled array, which does have a column named UNICODE_DATA_EXCLUSION so maybe this is what I should use to add an ignored range or characters? But if that's the case, how shall I modify and regenerate this file? I couldn't find any rake task or script to do that. Shall I just do it manually in a console?

Thanks for you help, if you could just confirm the best place to do it I'll write the PR.

ps: 🤔 would it make sense to use the full, official IDNA mapping file in this library maybe? It looks easy to parse, would be very easy to update when new versions are released (latest is v15.1.0), and would greatly help reducing the gap with Native implementation. I'm not saying we should do it nor that I want to do it, but just asking in case. If you think that is a good idea, I might give it a try too (separately).

@dentarg
Copy link
Collaborator

dentarg commented Jan 22, 2023

It's a custom marshalled array, which does have a column named UNICODE_DATA_EXCLUSION so maybe this is what I should use to add an ignored range or characters?

Nice spelunking. That does sounds like the solution to me.

But if that's the case, how shall I modify and regenerate this file? I couldn't find any rake task or script to do that. Shall I just do it manually in a console?

I have no idea, but yes, unless @sporkmonger have some old scripts laying around I very much think we need to do it manually. :)

It is pretty cool that we have been using this file with serialised data for 10+ years 😄 Here's from when it was still in Ruby (added in 3db3329) (then it was in YAML for a short while before being marshalled)

# This is a sparse Unicode table. Codepoints without entries are
# assumed to have the value: [0, 0, nil, nil, nil, nil, nil]
UNICODE_DATA = {

would it make sense to use the full, official IDNA mapping file in this library maybe? It looks easy to parse, would be very easy to update when new versions are released (latest is v15.1.0), and would greatly help reducing the gap with Native implementation. I'm not saying we should do it nor that I want to do it, but just asking in case. If you think that is a good idea, I might give it a try too (separately).

I think that would be worthwhile to try.

@dentarg
Copy link
Collaborator

dentarg commented Jan 22, 2023

Your comment above prompted me to look around some ("isn't someone else doing this?") and I found some libraries I've looked at in the past, but had forgotten about: https://github.com/janlelis/unicode-x#unicodex-15-

I'm not saying that addressable should use a third-party for this, but I think it is interesting to peak elsewhere. The author, Jan Lelis seems to have been doing Unicode things for a great while. I also found his site https://character.construction/emoji-vs-text, which explains emoji and variation selectors. Thought I should share that (for other readers and my future self).

@dentarg
Copy link
Collaborator

dentarg commented Jan 22, 2023

would it make sense to use the full, official IDNA mapping file in this library maybe? It looks easy to parse

I suspect we would take a performance hit (a bit slower to load addressable, gem takes up more storage), but I wonder if it would matter these days.

@jarthod
Copy link
Contributor

jarthod commented Jan 23, 2023

@dentarg thanks for your input 🙇‍♂️

So I started adding the spec for my case and then looked at the code to find where to.. OH MY GOD WHAT THE HELL 😱
I know this is ported from C so much harder to read obviously and with useless stuff like array allocation, etc.. That's fine for me though because I know C. But this encoding/decoding algorithm made no sense to me even after reading it 20 times, so then I looked at https://en.wikipedia.org/wiki/Punycode and I understood 🤦‍♂️ Who the hell thought this encoding was a good idea 😢 for a public facing thing on top of that, that many will have to reimplement.

Anyway so after spending a couple hours reading the same code and the wiki page over and over again, I kind of understand how it works now, but then I noticed there's actually no place in this code where characters can be skipped, and the UNICODE_DATA_EXCLUSION column I saw before, it's only used in the COMPOSITION_TABLE for unicode_compose_pair (which default to simply returning the caracters unchanged when there's nothing in the table), so it doesn't matter whether I add the characters I'm talking about in the table, they won't be removed from the punnycode output...
The only way to remove the modifiers with the existing code by just touching the table, would be to list every possible emoji+modified combination and then return the emoji only as canonical version (which is obviously tremendously inefficient).

So I am down the rabbit hole at this point 🐇
I had a look at https://github.com/janlelis/unicode-x#unicodex-15- too, it's great stuff but I couldn't find anything related to IDNA in here, nor the mapping table. So not sure we can use this even if we wanted.
I also found https://github.com/mmriis/simpleidn/blob/master/lib/simpleidn.rb, which has a much simpler and much more readable Ruby implementation, it also happen to use the official mapping table as I suggested: https://github.com/mmriis/simpleidn/tree/master/tables, with a script generating a static and optimized Ruby mapping from it: https://github.com/mmriis/simpleidn/blob/master/lib/simpleidn/uts46mapping.rb. And it does give the correct encoding for the string I had problem with:

> SimpleIDN.to_ascii("l♥️h.ws")
=> "xn--lh-t0x.ws"

It's using a compiled depdency on unf for unicode normalization though 😐 which is not great, but then I thought: ruby does that well nowadays doesn't it? Yes it does for a while apprently: https://idiosyncratic-ruby.com/73-unicode-version-mapping.html. And then I thought: but if ruby does this, they must have some kind of unicode table built-in, maybe we can use that? Yes they do: https://github.com/ruby/ruby/blob/master/lib/unicode_normalize/tables.rb but this does not include the IDNA mapping part so we can't use them to fix this problem.

So at this point I am thinking:

A. I see a lot of complexity around unicode normalization and downcasing which appears to be properly supported by Native ruby now, I understand it wasn't the case before which may explain the history. Shall we try to use the ruby version now? it'll probably be better now and more importantly will keep updating in future versions as we can see in https://idiosyncratic-ruby.com/73-unicode-version-mapping.html. This is an unrelated problem but as I dived into it, better list it now while it's in my head. FTR I had a look at the special case reported by @brasic above with ".\u0000." and using Ruby version instead of IDNA::Pure here wouldn't change anything. Ruby version returns the string unchanged like IDNA::Pure. So not breaking and not helping either as it's still different from IDNA::Native.

  1. About the implementation, I can probably add some code to cleanup the ignored characters in a quick and dirty™ way (e.g. a regexp with just the range i'm talking about, and then we add more when we encounter the case). That would probably be the smallest diff way to fix this, but it certainly won't be very future proof nor complete. Or I can do something slightly better and generate a complete regexp with a script from the official mapping file including all ignored ranges. It doesn't have to be a regexp if the list is huge I can use a Set also.

  2. Or we could use simpleidn as dependency so we don't repeat the logic and profit from their updates, but I understand this can bring other problems (dependency lagging, abandonned, etc...), especially if this dependency brings with it some other dependencies with native extensions. Maybe we could get simpleidn to drop the unf dependency though if this path is an option.

  3. Finally we could keep this code without dependency but use simpleidn as an example to simplify greatly the code and use the official mapping table like they do.

I am fine with all 3 options, but I don't want to spend too much time on this if the PR is never gonna be merged so I would like your opinion on what would you consider. And also if you have other suggestion of course.

Sorry for the long message, I knew I shouldn't have looked into this 🤣 but now that I'm in it, I might as well try to help more users.

@dentarg
Copy link
Collaborator

dentarg commented Feb 1, 2023

Thanks for going down the 🐰 hole @jarthod 😄 . I have not been able to do so myself (but even this short reply took some time to compose, always another thing to look into...) but thought I should reply here rather sooner than later. There's a lot going on here, and I don't have all (or any?) answers. To be honest, I'm not familiar with large parts of the Addressable code base. It is a work of others before me. I'm mainly here tending the garden, or something... :-)

I do have opinions though. To keep it short, I like alternative 1 and 3 more than 2. Looks like unf isn't really maintained (knu/ruby-unf#13) and it also sounds like Ruby has the same capabilities (although slower?).

What are we trying to solve though? Backing up a bit, I'm wondering about what @sporkmonger wrote at #408 (comment)

These should definitely be made consistent. Since we likely can't easily influence the behavior of IDNA::Native, I'm inclined to believe it should be considered the canonical result where behavior is undefined.

Is that possible? Looks like "native" is IDNA 2003 and "pure" is IDNA 2008? #247 (comment) (I'm basing this of my example I just did). I wonder what the intention of "pure" is, is it IDNA 2008?

Though I guess those questions doesn't have that much with the things pointed out here. I think it would be great to find a way for "pure" to do the right thing in regards to the emoji variation selector.

Re: NUL that @brasic pointed out, wouldn't it be interesting to understand why libidn normalize differently?

irb(main):015:0> IDN::Stringprep.nfkc_normalize(".\u0000.")
=> "."
irb(main):001:0> ".\u0000.".unicode_normalize
=> ".\u0000."
irb(main):002:0> ".\u0000.".unicode_normalize :nfc
=> ".\u0000."
irb(main):003:0> ".\u0000.".unicode_normalize :nfd
=> ".\u0000."
irb(main):004:0> ".\u0000.".unicode_normalize :nfkc
=> ".\u0000."
irb(main):005:0> ".\u0000.".unicode_normalize :nfkd
=> ".\u0000."

@jarthod
Copy link
Contributor

jarthod commented Feb 7, 2023

I really shouldn't have opened this pandora box 😅

Sooooo, as I've spent the weekend reading the UTS#46 specification, I'll try to summarise the situation.

  • IDNA2003 is the older standard (which was design for unicode 3.2), it's pretty permissive.
  • IDNA2008 is the newer one, which supports all unicode versions up to now (more characters supported, but also makes a lot of string invalid while IDNA2003 was pretty permissive, to reduce confusion/phishing risk and stuff like that).
  • UTS#46 is some kind of standard character mapping used in conjuction with IDNA2008, in order to avoid breaking compatibility with older domains to make the transision smoother.
  • There are unfortunately 4 "Deviations" chars between IDNA2003 and IDNA2008 which can't be smoothly transitionned, they are simply converted differently in both version. So clients using IDNA2008+UTS#46 have to choose between the old (called "Transitional") and new (called "Non-Transitional") behavior.

AFAICS the expectation is that while all registrar are upgrading to IDNA2008 and only allow valid hostnames (=approximately forever), browsers and web clients in general are encouraged to use IDNA2008+UTS#46 to widen support. So basically unless you're running a registrar, IDNA2008+UTS#46 is the target.

That's why libidn2 is implementing IDNA2008 + UTS#46 and default to the Non-Transitional (=new) mode. Which is also used by curl for example and probably many other web clients. Firefox and Safari also seems to do IDNA2008+UTS#46 Non-Transitional. Chrome was lagging a bit as it was still using Transitional mode up until very recently, apparently they juuust changed this to Non-Transitional in Chome 110. I can't verify this yet as I only have Chome 109 on Linux ^^

As you said, libidn (the current "native") implements IDNA2003 standard (the "older" one). So IMO we should upgrade to libidn2, yes, as this implementation is now incompatible with every major browsers. We can discuss that further in #247, I'll post some suggestions there.

The "pure" implementation is IDNA2008iiiisssshhhhh, but not compliant of course. As we can see in my example with the emoji modifier, if we compare that to the official Unicode test website):
image
https://xn--lh-t0xz926h.ws (returned by current "pure" implementation) is not even an option, no matter what standard we use, it's either xn--lh-t0x.ws or invalid (IDNA2008)

So overall what we are trying to solve here is to go from IDNA2003 and IDNA2008ish to both implementation being IDNA2008+UTS#46 compliant. For native it'll be through libidn2 (#247), and for pure we'll have to rewrite some of it (or bring in a dependency). I want to bring one good news though: the Unicode team provide some awesome comformance testing file with thousands of input string and the desired output for IDNA2008+UTS#46, for every version of Unicode, example: https://www.unicode.org/Public/idna/15.0.0/IdnaTestV2.txt

This means we should be able to easily add a very extensive test suite to make sure our pure implementation is (and stays) compliant. I'll create another ticket to discuss this (edit: #491) actually because I kind of hijacked this one thinking it was the same problem, but it's not really (see next part)


Now getting back to the initial NUL that @brasic pointed out because it's actually a different problem: AFAIK this is "simply" a C extension (or libidn) shortcoming because the C string is probably NULL-terminated (instead of passing the length) and so injecting a NULL inside it is not planned for and libidn considers the string as finished earlier:

> IDN::Stringprep.nfkc_normalize("abcd\u0000efgh")
=> "abcd"

Ruby implementation is more correct in this regard, and considering it's handling much more recent versions of Unicode, IMO we should switch to using Ruby normalize instead of this one. Or at least use it for the other part of the URL (it's the path which get broken for @brasic, doesn't even have anything to do with IDNA), and only use native libidn normalize function for hostname.

I believe this is also likely to be much faster and more complete than the IDNA::Pure normalize implementation. I can make a first separate PR for this maybe to eliminate one problem, while we discuss the others.

@dentarg
Copy link
Collaborator

dentarg commented Feb 7, 2023

Thanks for really going deep on these topics @jarthod 🙏

Re: NUL – I agree with you and any PR would be appreciated ❤️

@brasic are you able to share anything about what you do at GitHub today? Still using Addressable? Still using pure or did you make the switch to native and libidn? Using something else entirely?

@jarthod
Copy link
Contributor

jarthod commented Mar 14, 2023

@brasic as #492 has just been merged, this issue is now fixed in the main branch.

@dentarg dentarg closed this as completed Apr 1, 2023
@dentarg
Copy link
Collaborator

dentarg commented Apr 1, 2023

Re: older comments from me and @brasic above about this change being a major version bump

If you're interested in a PR to resolve the discrepancies we found between IDNA::Pure and IDNA::Native by changing IDNA::Pure I'm happy to open one, but IMO it would necessitate a major version bump.

and my reply

@brasic I interpret the answer from @sporkmonger that a PR resolving this would be accepted! I agree with it being a major version bump when released.

Me and @jarthod touched on that in the PR at #492 (comment) and concluded that this is a bug fix, which means it will go out in the next patch version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants