Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect encoding detected in 3.3.1 #371

Closed
jefferyto opened this issue Oct 26, 2023 · 2 comments · Fixed by #378
Closed

[BUG] Incorrect encoding detected in 3.3.1 #371

jefferyto opened this issue Oct 26, 2023 · 2 comments · Fixed by #378
Labels
bug Something isn't working detection Related to the charset detection mechanism, chaos/mess/coherence

Comments

@jefferyto
Copy link

I'm updating the charset-normalizer package in OpenWrt (with Python 3.11.6) and tried the example in https://charset-normalizer.readthedocs.io/en/latest/user/handling_result.html#handling-result:

my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251')

# Assign return value so we can fully exploit result
result = from_bytes(
    my_byte_str
).best()

print(result.encoding)  # cp1251

In 3.3.0 this would print cp1251 but in 3.3.1 this prints cp1257 (str(result) returns 'Bńåźč ÷īāåź čģą ļšąāī ķą īįšąēīāąķčå.').

I also tried the French phrase from https://charset-normalizer.readthedocs.io/en/latest/index.html#introduction:

my_byte_str = 'Bonjour, je suis à la recherche d\'une aide sur les étoiles'.encode('cp1252')

and from_bytes(my_byte_str).best() also has the encoding cp1257.

I have compiled the package for arm, aarch64 and x86_64 and I get the same results.

@jefferyto jefferyto added bug Something isn't working help wanted Extra attention is needed labels Oct 26, 2023
@Ousret
Copy link
Owner

Ousret commented Oct 31, 2023

I can reproduce this. And I am working on a fix.
This is due to new encodings being supported.

@Ousret Ousret added detection Related to the charset detection mechanism, chaos/mess/coherence and removed help wanted Extra attention is needed labels Oct 31, 2023
Ousret added a commit that referenced this issue Oct 31, 2023
)

and added noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)
@Ousret
Copy link
Owner

Ousret commented Oct 31, 2023

A solution was found for the first one, the second one is a little more problematic but no longer return cp1257.
I will pin it for later.

Ousret added a commit that referenced this issue Oct 31, 2023
) (#378)

and added noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detection Related to the charset detection mechanism, chaos/mess/coherence
Projects
None yet
2 participants