Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cChardet with chardetng_py. #7559

Closed
wants to merge 6 commits into from

Conversation

john-parton
Copy link
Contributor

@john-parton john-parton commented Aug 26, 2023

What do these changes do?

This removes cChardet as an optional dependency for speeds up and replaces it with chardetng_py, which is a python binding to Mozilla's chartdetng (or chardet Next Generation) library.

Advantages over cChardet:

  1. Support for newer versions of python. cChardet is not compatible with the last few versions of Python, but chardetng_py is.
  2. Support for CPU architectures other than x86/amd64. In particular, this brings aarch64 support, but also s390x, ppc64le, and armv7l. Full list here: https://pypi.org/project/chardetng-py/#files
  3. Support for MacOS 11 (including Arm64 support)
  4. Better compatibility with browser implementation by using Firefox's imlementation
  5. pypy compatibility

Other notes

i. chardetng is as-fast-or-faster than cchardet in my testing
ii. encoding detection is as-good-or-better than cchardet

Are there changes in behavior for the user?

The exact encoding returned by chardetng might not match what was returned by cchardet, but it is in production use with Firefox. If you have specific questions about the operation of the library, you should read this blog post: https://hsivonen.fi/chardetng/

Related issue number

Should close #7126

cchardetng_py also support incremental encoding detection where the buffer can be fed into the detector in chunks.

Implementing that would solve #4112
Docs: https://chardetng-py.readthedocs.io/en/latest/class_reference.html

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the CHANGES folder
    • name it <issue_id>.<type> for example (588.bugfix)
    • if you don't have an issue_id change it to the pr id after creating the pr
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Aug 26, 2023
@Dreamsorcerer
Copy link
Member

I think it's still going to be preferable to remove chardet completely, we can add chardetng to the docs for the few users that might want it.

Also, should the detect() method accept the tld parameter? The article seems to suggest it's reasonably important for accuracy, and that should almost always be available to the developer.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Aug 27, 2023

Also, the article on chardetng says that it explicitly doesn't support utf-8 and will never select that as an encoding. I see, for example, that Github use text/css without a charset. With the current aiohttp code, this will result in such a file being read as anything other than utf-8, which is going to cause more problems than it solves.

@john-parton
Copy link
Contributor Author

  1. I can add the tld parameter to the detect function. I didn't do it here because I was trying to keep my pull request as small as possible with limited scope.
  2. Regarding chardetng not returning utf-8 as a valid codec, that's actually configurable! There's an allow_utf8 boolean which changes the behavior.
  3. Even if we don't use the allow_utf8 flag, it might be best to try and do a decode with utf-8 first, and then fallback on character detection if that fails. Thoughts?
  4. I believe character detection has a place in this library. It's a very difficult problem to get right, and all major browsers and many other client libraries implement this feature. In a perfect world, sure, I would love if we could rely on charset declarations or default utf-8, but it's just not feasible.

@Dreamsorcerer
Copy link
Member

I agree it's difficult to get right (actually, probably impossible), but I'm just suggesting we move it to documentation and tell the user to use this library if relevant to them. We should be able to provide a simple copy/paste bit of code that shows how to do it (with allow_utf8 and tld included for best accuracy).

@john-parton
Copy link
Contributor Author

@Dreamsorcerer

I've updated my PR with the following changes:

  1. The tld is passed to the chardetng function.
  2. chardet imported from charset normalizer is wrapped to ignore the tld and allow_utf8 parameters
  3. I refactored the get_encoding so that it has one fewer level of nesting.

Is there a separate PR which drops character detection completely? I can't find it. I can go ahead and work on opening a pull request which does that, along with adding the documentation you suggested.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Aug 27, 2023

Is there a separate PR which drops character detection completely? I can't find it. I can go ahead and work on opening a pull request which does that, along with adding the documentation you suggested.

Not yet, so feel free. I was going to look at it after a couple of other tasks I'm dealing with currently.

I assume something like this would be sufficient for the documentation:

body = await resp.read()
tld = resp.url.host.rsplit(".")[-1]
text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld))

Although, not sure how well it'll play with TLDs that have multiple parts (but, that's another hard problem that requires an up-to-date list of TLDs).

@john-parton
Copy link
Contributor Author

john-parton commented Aug 27, 2023

I assume something like this would be sufficient for the documentation:

body = await resp.read()
tld = resp.url.host.rsplit(".")[-1]
text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld))

That's likely wrong, unfortunately. You probably want to look at the CONTENT_TYPE header first.

I believe this is closer

try:
    text = await resp.text()
except UnicodeDecodeError:
    tld = resp.url.host.rsplit(".")[-1]
    body = await response.read()
    text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld))

I'll see if I can get that other PR opened.

@Dreamsorcerer
Copy link
Member

Right, maybe then it's worth adding a parameter to text(), so we can do something like:

def chardet(body, resp):
    tld = resp.url.host.rsplit(".")[-1]
    return chardetng_py.detect(body, allow_utf8=True, tld=tld)

...
await resp.text(fallback_encoding=chardet)

Could also include the parameter in ClientSession, so it only has to be set once. At which point there's really no convenience lost, just copy/paste a couple of lines of code at setup time and you have the charset behaviour back. Users are also in full control over which libraries to use too, so we also don't have to worry about cchardet being abandoned etc.

@Dreamsorcerer
Copy link
Member

Could also include the parameter in ClientSession

Maybe it should only be set in ClientSession and we can skip the parameter in .text() (we can always add it later if users ask for it).

@john-parton
Copy link
Contributor Author

So a few more things:

You don't have to worry about cchardet being abandoned if you accept this pull request. It's not based on chardet at all. It's a different library written in rust. In terms of maintenance burden, it's much easier to keep rust bindings updated with maturin/pyo3 than how cchardet is written. I can pinky promise I'll keep it updated if that makes you feel better. 😆

I have opened the pull request to remove charset detection completely here: #7560

Regarding setting a fallback character set at the session level. I'm not sure I love that either. The second the session accesses multiple urls from multiple domains, it's just going to get confusing if you've picked the right fallback or not.

@Dreamsorcerer
Copy link
Member

Regarding setting a fallback character set at the session level. I'm not sure I love that either. The second the session accesses multiple urls from multiple domains, it's just going to get confusing if you've picked the right fallback or not.

Not sure what you mean, my example was using a user-supplied function to reimplement the current behaviour. If a mimetype isn't present in Content-Type, then we call that function (which can just default to lambda b, r: "utf-8").

@john-parton
Copy link
Contributor Author

john-parton commented Aug 27, 2023

Oh, I understand now. You were recommending setting a callable at the client session level.

An alternate approach would be to direct users to subclass subclass ClientSession.

class ClientSessionWithCharsetDetection(ClientSession):
    async def text(self, *args, **kwargs):
        try:
            return await super().text(*args, **kwargs)
        except UnicodeDecodeError:
            tld = self.url.host.rsplit(".")[-1]
            return self._body.decode(
                chardetng_py.detect(self._body, allow_utf8=True, tld=tld)
            )

I think that achieves the same function as a callable. I generally try to avoid inheritance, but it looks clean there.

@Dreamsorcerer
Copy link
Member

ClientSession is final:

@final

So, a callable would be needed.

@john-parton
Copy link
Contributor Author

john-parton commented Aug 27, 2023

Wow, I had no idea about final. I've subclassed ClientSession in my own code. I'll have to read more about the rationale for that.

I do have one more proposal:

Change the behavior of charset detection so the encoding specified in the header or utf-8 by default is tried, and only in the case of a decoding failure, try character set detection. This will actually improve performance, and then you can put a warning in the character set detection block so that users of the library will have a more visible heads up that their code won't work when character detection is removed at a later time. Does that sound reasonable?

Presumably DeprecationWarning

EDIT

Unrelated, but for anyone reading, it looks like subclassing CilentSession is forbidden in 4.0 now because of __slots__ optimization. Guess I'll have some code I need to refactor later.

@Dreamsorcerer
Copy link
Member

Change the behavior of charset detection so the encoding specified in the header or utf-8 by default is tried, and only in the case of a decoding failure, try character set detection. This will actually improve performance, and then you can put a warning in the character set detection block so that users of the library will have a more visible heads up that their code won't work when character detection is removed at a later time. Does that sound reasonable?

Could be an improvement, let's take a look at it in a PR.

@john-parton
Copy link
Contributor Author

@Dreamsorcerer Pull request is here: #7561

I included the chardetng_py changes in there, because I really do think if character set detection is being offered, even in a deprecated form, that it's a nice addition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use cChardet fork
2 participants