[BUG] Incorrect encoding detected in 3.3.1 #371

jefferyto · 2023-10-26T15:22:32Z

I'm updating the charset-normalizer package in OpenWrt (with Python 3.11.6) and tried the example in https://charset-normalizer.readthedocs.io/en/latest/user/handling_result.html#handling-result:

my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251')

# Assign return value so we can fully exploit result
result = from_bytes(
    my_byte_str
).best()

print(result.encoding)  # cp1251

In 3.3.0 this would print cp1251 but in 3.3.1 this prints cp1257 (str(result) returns 'Bńåźč ÷īāåź čģą ļšąāī ķą īįšąēīāąķčå.').

I also tried the French phrase from https://charset-normalizer.readthedocs.io/en/latest/index.html#introduction:

my_byte_str = 'Bonjour, je suis à la recherche d\'une aide sur les étoiles'.encode('cp1252')

and from_bytes(my_byte_str).best() also has the encoding cp1257.

I have compiled the package for arm, aarch64 and x86_64 and I get the same results.

The text was updated successfully, but these errors were encountered:

Ousret · 2023-10-31T19:46:06Z

I can reproduce this. And I am working on a fix.
This is due to new encodings being supported.

) and added noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)

Ousret · 2023-10-31T20:15:01Z

A solution was found for the first one, the second one is a little more problematic but no longer return cp1257.
I will pin it for later.

) (#378) and added noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)

jefferyto added bug Something isn't working help wanted Extra attention is needed labels Oct 26, 2023

eclipseo mentioned this issue Oct 27, 2023

Test failures in 2.5.0 with Python 3.12 pudo/normality#20

Open

Ousret added detection Related to the charset detection mechanism, chaos/mess/coherence and removed help wanted Extra attention is needed labels Oct 31, 2023

Ousret mentioned this issue Oct 31, 2023

🐛 Regression on some detection case showcased in the documentation (#371) #378

Merged

Ousret closed this as completed in #378 Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Incorrect encoding detected in 3.3.1 #371

[BUG] Incorrect encoding detected in 3.3.1 #371

jefferyto commented Oct 26, 2023

Ousret commented Oct 31, 2023

Ousret commented Oct 31, 2023 •

edited

[BUG] Incorrect encoding detected in 3.3.1 #371

[BUG] Incorrect encoding detected in 3.3.1 #371

Comments

jefferyto commented Oct 26, 2023

Ousret commented Oct 31, 2023

Ousret commented Oct 31, 2023 • edited

Ousret commented Oct 31, 2023 •

edited