Replace cChardet with chardetng_py. #7559

john-parton · 2023-08-26T19:57:27Z

What do these changes do?

This removes cChardet as an optional dependency for speeds up and replaces it with chardetng_py, which is a python binding to Mozilla's chartdetng (or chardet Next Generation) library.

Advantages over cChardet:

Support for newer versions of python. cChardet is not compatible with the last few versions of Python, but chardetng_py is.
Support for CPU architectures other than x86/amd64. In particular, this brings aarch64 support, but also s390x, ppc64le, and armv7l. Full list here: https://pypi.org/project/chardetng-py/#files
Support for MacOS 11 (including Arm64 support)
Better compatibility with browser implementation by using Firefox's imlementation
pypy compatibility

Other notes

i. chardetng is as-fast-or-faster than cchardet in my testing
ii. encoding detection is as-good-or-better than cchardet

Are there changes in behavior for the user?

The exact encoding returned by chardetng might not match what was returned by cchardet, but it is in production use with Firefox. If you have specific questions about the operation of the library, you should read this blog post: https://hsivonen.fi/chardetng/

Related issue number

Should close #7126

cchardetng_py also support incremental encoding detection where the buffer can be fed into the detector in chunks.

Implementing that would solve #4112
Docs: https://chardetng-py.readthedocs.io/en/latest/class_reference.html

Checklist

I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
- The format is <Name> <Surname>.
- Please keep alphabetical order, the file is sorted by names.
Add a new news fragment into the CHANGES folder
- name it <issue_id>.<type> for example (588.bugfix)
- if you don't have an issue_id change it to the pr id after creating the pr
- ensure type is one of the following:
  - .feature: Signifying a new feature.
  - .bugfix: Signifying a bug fix.
  - .doc: Signifying a documentation improvement.
  - .removal: Signifying a deprecation or removal of public API.
  - .misc: A ticket has been closed, but it is not of interest to users.
- Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

Dreamsorcerer · 2023-08-27T16:07:07Z

I think it's still going to be preferable to remove chardet completely, we can add chardetng to the docs for the few users that might want it.

Also, should the detect() method accept the tld parameter? The article seems to suggest it's reasonably important for accuracy, and that should almost always be available to the developer.

Dreamsorcerer · 2023-08-27T17:14:33Z

Also, the article on chardetng says that it explicitly doesn't support utf-8 and will never select that as an encoding. I see, for example, that Github use text/css without a charset. With the current aiohttp code, this will result in such a file being read as anything other than utf-8, which is going to cause more problems than it solves.

john-parton · 2023-08-27T17:37:01Z

I can add the tld parameter to the detect function. I didn't do it here because I was trying to keep my pull request as small as possible with limited scope.
Regarding chardetng not returning utf-8 as a valid codec, that's actually configurable! There's an allow_utf8 boolean which changes the behavior.
Even if we don't use the allow_utf8 flag, it might be best to try and do a decode with utf-8 first, and then fallback on character detection if that fails. Thoughts?
I believe character detection has a place in this library. It's a very difficult problem to get right, and all major browsers and many other client libraries implement this feature. In a perfect world, sure, I would love if we could rely on charset declarations or default utf-8, but it's just not feasible.

Dreamsorcerer · 2023-08-27T18:01:21Z

I agree it's difficult to get right (actually, probably impossible), but I'm just suggesting we move it to documentation and tell the user to use this library if relevant to them. We should be able to provide a simple copy/paste bit of code that shows how to do it (with allow_utf8 and tld included for best accuracy).

for more information, see https://pre-commit.ci

john-parton · 2023-08-27T18:08:50Z

@Dreamsorcerer

I've updated my PR with the following changes:

The tld is passed to the chardetng function.
chardet imported from charset normalizer is wrapped to ignore the tld and allow_utf8 parameters
I refactored the get_encoding so that it has one fewer level of nesting.

Is there a separate PR which drops character detection completely? I can't find it. I can go ahead and work on opening a pull request which does that, along with adding the documentation you suggested.

Dreamsorcerer · 2023-08-27T18:13:25Z

Is there a separate PR which drops character detection completely? I can't find it. I can go ahead and work on opening a pull request which does that, along with adding the documentation you suggested.

Not yet, so feel free. I was going to look at it after a couple of other tasks I'm dealing with currently.

I assume something like this would be sufficient for the documentation:

body = await resp.read()
tld = resp.url.host.rsplit(".")[-1]
text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld))

Although, not sure how well it'll play with TLDs that have multiple parts (but, that's another hard problem that requires an up-to-date list of TLDs).

john-parton · 2023-08-27T18:16:19Z

I assume something like this would be sufficient for the documentation:
body = await resp.read()
tld = resp.url.host.rsplit(".")[-1]
text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld))

That's likely wrong, unfortunately. You probably want to look at the CONTENT_TYPE header first.

I believe this is closer

try:
    text = await resp.text()
except UnicodeDecodeError:
    tld = resp.url.host.rsplit(".")[-1]
    body = await response.read()
    text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld))

I'll see if I can get that other PR opened.

Dreamsorcerer · 2023-08-27T18:23:28Z

Right, maybe then it's worth adding a parameter to text(), so we can do something like:

def chardet(body, resp):
    tld = resp.url.host.rsplit(".")[-1]
    return chardetng_py.detect(body, allow_utf8=True, tld=tld)

...
await resp.text(fallback_encoding=chardet)

Could also include the parameter in ClientSession, so it only has to be set once. At which point there's really no convenience lost, just copy/paste a couple of lines of code at setup time and you have the charset behaviour back. Users are also in full control over which libraries to use too, so we also don't have to worry about cchardet being abandoned etc.

Dreamsorcerer · 2023-08-27T18:30:06Z

Could also include the parameter in ClientSession

Maybe it should only be set in ClientSession and we can skip the parameter in .text() (we can always add it later if users ask for it).

john-parton · 2023-08-27T18:33:41Z

So a few more things:

You don't have to worry about cchardet being abandoned if you accept this pull request. It's not based on chardet at all. It's a different library written in rust. In terms of maintenance burden, it's much easier to keep rust bindings updated with maturin/pyo3 than how cchardet is written. I can pinky promise I'll keep it updated if that makes you feel better. 😆

I have opened the pull request to remove charset detection completely here: #7560

Regarding setting a fallback character set at the session level. I'm not sure I love that either. The second the session accesses multiple urls from multiple domains, it's just going to get confusing if you've picked the right fallback or not.

Dreamsorcerer · 2023-08-27T18:37:42Z

Regarding setting a fallback character set at the session level. I'm not sure I love that either. The second the session accesses multiple urls from multiple domains, it's just going to get confusing if you've picked the right fallback or not.

Not sure what you mean, my example was using a user-supplied function to reimplement the current behaviour. If a mimetype isn't present in Content-Type, then we call that function (which can just default to lambda b, r: "utf-8").

john-parton · 2023-08-27T18:43:42Z

Oh, I understand now. You were recommending setting a callable at the client session level.

An alternate approach would be to direct users to subclass subclass ClientSession.

class ClientSessionWithCharsetDetection(ClientSession):
    async def text(self, *args, **kwargs):
        try:
            return await super().text(*args, **kwargs)
        except UnicodeDecodeError:
            tld = self.url.host.rsplit(".")[-1]
            return self._body.decode(
                chardetng_py.detect(self._body, allow_utf8=True, tld=tld)
            )

I think that achieves the same function as a callable. I generally try to avoid inheritance, but it looks clean there.

Dreamsorcerer · 2023-08-27T18:55:08Z

ClientSession is final:

aiohttp/aiohttp/client.py

Line 167 in db2c274

@final

So, a callable would be needed.

john-parton · 2023-08-27T18:58:31Z

Wow, I had no idea about final. I've subclassed ClientSession in my own code. I'll have to read more about the rationale for that.

I do have one more proposal:

Change the behavior of charset detection so the encoding specified in the header or utf-8 by default is tried, and only in the case of a decoding failure, try character set detection. This will actually improve performance, and then you can put a warning in the character set detection block so that users of the library will have a more visible heads up that their code won't work when character detection is removed at a later time. Does that sound reasonable?

Presumably DeprecationWarning

EDIT

Unrelated, but for anyone reading, it looks like subclassing CilentSession is forbidden in 4.0 now because of __slots__ optimization. Guess I'll have some code I need to refactor later.

Dreamsorcerer · 2023-08-27T19:14:31Z

Change the behavior of charset detection so the encoding specified in the header or utf-8 by default is tried, and only in the case of a decoding failure, try character set detection. This will actually improve performance, and then you can put a warning in the character set detection block so that users of the library will have a more visible heads up that their code won't work when character detection is removed at a later time. Does that sound reasonable?

Could be an improvement, let's take a look at it in a PR.

john-parton · 2023-08-27T19:32:09Z

@Dreamsorcerer Pull request is here: #7561

I included the chardetng_py changes in there, because I really do think if character set detection is being offered, even in a deprecated form, that it's a nice addition.

Replace cChardet with chardetng_py.

393b66a

john-parton requested review from webknjaz and asvetlov as code owners August 26, 2023 19:57

Add contributors and changes.

ed7814f

psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Aug 26, 2023

john-parton and others added 3 commits August 27, 2023 13:02

Pass tld and allow_utf8 to chardetng_py

02e4e69

[pre-commit.ci] auto fixes from pre-commit.com hooks

618a628

for more information, see https://pre-commit.ci

Bump chardetng_py.

5a6c9ec

john-parton mentioned this pull request Aug 27, 2023

Remove charset-normalizer and cchardet as dependencies. Update docs #7560

Closed

5 tasks

Properly lookup name in CodecInfo.

6c727a9

john-parton mentioned this pull request Aug 27, 2023

Remove chardet/charset-normalizer. Add fallback_charset_resolver ClientSession parameter. #7561

Merged

5 tasks

Dreamsorcerer closed this Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace cChardet with chardetng_py. #7559

Replace cChardet with chardetng_py. #7559

john-parton commented Aug 26, 2023 •

edited

Dreamsorcerer commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023 •

edited

john-parton commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023 •

edited

john-parton commented Aug 27, 2023 •

edited

Dreamsorcerer commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023 •

edited

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023 •

edited

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023

Replace cChardet with chardetng_py. #7559

Replace cChardet with chardetng_py. #7559

Conversation

john-parton commented Aug 26, 2023 • edited

What do these changes do?

Are there changes in behavior for the user?

Related issue number

Checklist

Dreamsorcerer commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023 • edited

john-parton commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023 • edited

john-parton commented Aug 27, 2023 • edited

Dreamsorcerer commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023 • edited

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023 • edited

Dreamsorcerer commented Aug 27, 2023

john-parton commented Aug 27, 2023

john-parton commented Aug 26, 2023 •

edited

Dreamsorcerer commented Aug 27, 2023 •

edited

Dreamsorcerer commented Aug 27, 2023 •

edited

john-parton commented Aug 27, 2023 •

edited

john-parton commented Aug 27, 2023 •

edited

john-parton commented Aug 27, 2023 •

edited