-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix invalid byte sequence in UTF-8
exception when unencoding URLs containing non UTF-8 characters
#459
Conversation
Did you see #224? Sounds (a bit) like the same issue? |
Oops no I didn't see this one, it looks a bit like the same issue indeed, although it's about the hostname part in #224. My PR here does not fix #224 because there's other potential problems with encoding in the code handling the hostname part:
My PR helps for the path and query parts which are more likely to contain non UTF-8 chars in my experience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I will merge if you can please the 🐶 and rebase :)
5ab1213
to
aef7db0
Compare
…ontaining non UTF-8 characters
aef7db0
to
e3b04eb
Compare
Ok I think I have please the 🐶 (although I'm not happy with the result), rebased the branch on main and squashed the multiple fixes into a single commit ready to merge. |
I'm not sure what's going on with GitHub Actions, but I can't merge this as status hasn't been reported for a number of jobs (same problem in #469 but there it makes a bit sense). Do you mind opening this as a new PR? |
Maybe wait with that... I saw this now |
Indeed, according to actions/runner-images#5583 the brownout ends in 4 hours so retrying the run after that might pass. The image will have to be updated though because end of August macOS 10.15 will be completely gone ^^ Edit 10h later: I just tried but don't have the rights to re-start the build so I'll let you do it. |
Thanks for being thorough and checking 150k extra URL parses. 😁 |
Thanks @sporkmonger & @dentarg ! Happy to do it again if you need some validation for other changes. |
Since `PublicSuffix` v4.0.3, it is possible to parse the URL `http://+%D5d.some_site.net`. It is also possible to normalize the URL since `Addressable` v2.8.1, which includes this fix: sporkmonger/addressable#459. Hence, this is now a valid URL, which means that it should be moved to the `valid_urls` array in the specs. `PublicSuffix` 4.x is supported by `Addressable` since v2.7.0 (see https://github.com/sporkmonger/addressable/blob/main/CHANGELOG.md#addressable-270).
Hi 👋
First of all I've been using
addressable
for some time in my product to deal with complex URL transformations and it's been super helpfull. Thanks 🙇Recently I started getting one
invalid byte sequence in UTF-8
exception ingsub
when parsing and normalizing some weird URL containing non UTF-8 compatible characters (ISO-8859-1). So I looked at the code and found that it is supposed to change the encoding back to ASCII-8BIT during this phase to avoid any encoding issue (good idea):https://github.com/sporkmonger/addressable/blob/addressable-2.8.0/lib/addressable/uri.rb#L576-L580
BUT a couple lines later in the
unencode
method it actually forces back to UTF-8 right BEFORE the gsub:https://github.com/sporkmonger/addressable/blob/addressable-2.8.0/lib/addressable/uri.rb#L472-L480
(this is comming from this change: e4f2bd6 following this issue: #154)
So this change back to UTF-8 before the gsub is breaking again this workflow, the spec I added in this PR gives this failure using the original code:
My fix simply removes some of the
force_encoding
and changes slightly the one inside the gsub (to avoid breaking the other issue fixed before). The test suite passes entirely and I have also checked this version on the 150k+ URLs present in my product (parse + normalize) without any error. I am already using this version in production for about a week.Let me know if you have any doubt or questions.
Suggested line for the changelog: