Skip to content

Commit

Permalink
Remove chardet/charset-normalizer. Add fallback_charset_resolver Clie…
Browse files Browse the repository at this point in the history
…ntSession parameter. (#7561)

Co-authored-by: Sam Bull <git@sambull.org>
(cherry picked from commit 6755796)
  • Loading branch information
john-parton authored and Dreamsorcerer committed Sep 7, 2023
1 parent 28010cd commit eacb758
Show file tree
Hide file tree
Showing 21 changed files with 104 additions and 190 deletions.
3 changes: 0 additions & 3 deletions .mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,6 @@ ignore_missing_imports = True
[mypy-brotli]
ignore_missing_imports = True

[mypy-cchardet]
ignore_missing_imports = True

[mypy-gunicorn.*]
ignore_missing_imports = True

Expand Down
2 changes: 2 additions & 0 deletions CHANGES/7561.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Replace automatic character set detection with a `fallback_charset_resolver` parameter
in `ClientSession` to allow user-supplied character set detection functions.
1 change: 1 addition & 0 deletions CONTRIBUTORS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ Jesus Cea
Jian Zeng
Jinkyu Yi
Joel Watts
John Parton
Jon Nabozny
Jonas Krüger Svensson
Jonas Obrist
Expand Down
6 changes: 1 addition & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,22 +159,18 @@ Requirements

- async-timeout_
- attrs_
- charset-normalizer_
- multidict_
- yarl_
- frozenlist_

Optionally you may install the cChardet_ and aiodns_ libraries (highly
recommended for sake of speed).
Optionally you may install the aiodns_ library (highly recommended for sake of speed).

.. _charset-normalizer: https://pypi.org/project/charset-normalizer
.. _aiodns: https://pypi.python.org/pypi/aiodns
.. _attrs: https://github.com/python-attrs/attrs
.. _multidict: https://pypi.python.org/pypi/multidict
.. _frozenlist: https://pypi.org/project/frozenlist/
.. _yarl: https://pypi.python.org/pypi/yarl
.. _async-timeout: https://pypi.python.org/pypi/async_timeout
.. _cChardet: https://pypi.python.org/pypi/cchardet

License
=======
Expand Down
5 changes: 5 additions & 0 deletions aiohttp/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ class ClientTimeout:
DEFAULT_TIMEOUT: Final[ClientTimeout] = ClientTimeout(total=5 * 60)

_RetType = TypeVar("_RetType")
_CharsetResolver = Callable[[ClientResponse, bytes], str]


class ClientSession:
Expand Down Expand Up @@ -194,6 +195,7 @@ class ClientSession:
"_read_bufsize",
"_max_line_size",
"_max_field_size",
"_resolve_charset",
]
)

Expand Down Expand Up @@ -230,6 +232,7 @@ def __init__(
read_bufsize: int = 2**16,
max_line_size: int = 8190,
max_field_size: int = 8190,
fallback_charset_resolver: _CharsetResolver = lambda r, b: "utf-8",
) -> None:
if loop is None:
if connector is not None:
Expand Down Expand Up @@ -325,6 +328,8 @@ def __init__(
for trace_config in self._trace_configs:
trace_config.freeze()

self._resolve_charset = fallback_charset_resolver

def __init_subclass__(cls: Type["ClientSession"]) -> None:
warnings.warn(
"Inheritance class {} from ClientSession "
Expand Down
54 changes: 27 additions & 27 deletions aiohttp/client_reqrep.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from typing import (
TYPE_CHECKING,
Any,
Callable,
Dict,
Iterable,
List,
Expand Down Expand Up @@ -70,11 +71,6 @@
ssl = None # type: ignore[assignment]
SSLContext = object # type: ignore[misc,assignment]

try:
import cchardet as chardet
except ImportError: # pragma: no cover
import charset_normalizer as chardet


__all__ = ("ClientRequest", "ClientResponse", "RequestInfo", "Fingerprint")

Expand Down Expand Up @@ -742,8 +738,8 @@ class ClientResponse(HeadersMixin):
_raw_headers: RawHeaders = None # type: ignore[assignment]

_connection = None # current connection
_source_traceback = None
# setted up by ClientRequest after ClientResponse object creation
_source_traceback: Optional[traceback.StackSummary] = None
# set up by ClientRequest after ClientResponse object creation
# post-init stage allows to not change ctor signature
_closed = True # to allow __del__ for non-initialized properly response
_released = False
Expand Down Expand Up @@ -780,6 +776,15 @@ def __init__(
self._loop = loop
# store a reference to session #1985
self._session: Optional[ClientSession] = session
# Save reference to _resolve_charset, so that get_encoding() will still
# work after the response has finished reading the body.
if session is None:
# TODO: Fix session=None in tests (see ClientRequest.__init__).
self._resolve_charset: Callable[
["ClientResponse", bytes], str
] = lambda *_: "utf-8"
else:
self._resolve_charset = session._resolve_charset
if loop.get_debug():
self._source_traceback = traceback.extract_stack(sys._getframe(1))

Expand Down Expand Up @@ -1070,27 +1075,22 @@ def get_encoding(self) -> str:

encoding = mimetype.parameters.get("charset")
if encoding:
try:
codecs.lookup(encoding)
except LookupError:
encoding = None
if not encoding:
if mimetype.type == "application" and (
mimetype.subtype == "json" or mimetype.subtype == "rdap"
):
# RFC 7159 states that the default encoding is UTF-8.
# RFC 7483 defines application/rdap+json
encoding = "utf-8"
elif self._body is None:
raise RuntimeError(
"Cannot guess the encoding of " "a not yet read body"
)
else:
encoding = chardet.detect(self._body)["encoding"]
if not encoding:
encoding = "utf-8"
with contextlib.suppress(LookupError):
return codecs.lookup(encoding).name

if mimetype.type == "application" and (
mimetype.subtype == "json" or mimetype.subtype == "rdap"
):
# RFC 7159 states that the default encoding is UTF-8.
# RFC 7483 defines application/rdap+json
return "utf-8"

if self._body is None:
raise RuntimeError(
"Cannot compute fallback encoding of a not yet read body"
)

return encoding
return self._resolve_charset(self, self._body)

async def text(self, encoding: Optional[str] = None, errors: str = "strict") -> str:
"""Read response payload and decode."""
Expand Down
5 changes: 0 additions & 5 deletions docs/_snippets/cchardet-unmaintained-admonition.rst

This file was deleted.

30 changes: 30 additions & 0 deletions docs/client_advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -653,3 +653,33 @@ are changed so that aiohttp itself can wait on the underlying
connection to close. Please follow issue `#1925
<https://github.com/aio-libs/aiohttp/issues/1925>`_ for the progress
on this.


Character Set Detection
-----------------------

If you encounter a :exc:`UnicodeDecodeError` when using :meth:`ClientResponse.text()`
this may be because the response does not include the charset needed
to decode the body.

If you know the correct encoding for a request, you can simply specify
the encoding as a parameter (e.g. ``resp.text("windows-1252")``).

Alternatively, :class:`ClientSession` accepts a ``fallback_charset_resolver`` parameter which
can be used to introduce charset guessing functionality. When a charset is not found
in the Content-Type header, this function will be called to get the charset encoding. For
example, this can be used with the ``chardetng_py`` library.::

from chardetng_py import detect

def charset_resolver(resp: ClientResponse, body: bytes) -> str:
tld = resp.url.host.rsplit(".", maxsplit=1)[-1]
return detect(body, allow_utf8=True, tld=tld)

ClientSession(fallback_charset_resolver=charset_resolver)

Or, if ``chardetng_py`` doesn't work for you, then ``charset-normalizer`` is another option::

from charset_normalizer import detect

ClientSession(fallback_charset_resolver=lamba r, b: detect(b)["encoding"] or "utf-8")
59 changes: 22 additions & 37 deletions docs/client_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ The client session supports the context manager protocol for self closing.
read_bufsize=2**16, \
requote_redirect_url=True, \
trust_env=False, \
trace_configs=None)
trace_configs=None, \
fallback_charset_resolver=lambda r, b: "utf-8")

The class for creating client sessions and making requests.

Expand Down Expand Up @@ -226,6 +227,16 @@ The client session supports the context manager protocol for self closing.
disabling. See :ref:`aiohttp-client-tracing-reference` for
more information.

:param Callable[[ClientResponse,bytes],str] fallback_charset_resolver:
A :term:`callable` that accepts a :class:`ClientResponse` and the
:class:`bytes` contents, and returns a :class:`str` which will be used as
the encoding parameter to :meth:`bytes.decode()`.

This function will be called when the charset is not known (e.g. not specified in the
Content-Type header). The default function simply defaults to ``utf-8``.

.. versionadded:: 3.8.6

.. attribute:: closed

``True`` if the session has been closed, ``False`` otherwise.
Expand Down Expand Up @@ -1424,12 +1435,8 @@ Response object
Read response's body and return decoded :class:`str` using
specified *encoding* parameter.

If *encoding* is ``None`` content encoding is autocalculated
using ``Content-Type`` HTTP header and *charset-normalizer* tool if the
header is not provided by server.

:term:`cchardet` is used with fallback to :term:`charset-normalizer` if
*cchardet* is not available.
If *encoding* is ``None`` content encoding is determined from the
Content-Type header, or using the ``fallback_charset_resolver`` function.

Close underlying connection if data reading gets an error,
release connection otherwise.
Expand All @@ -1438,35 +1445,21 @@ Response object
``None`` for encoding autodetection
(default).

:return str: decoded *BODY*

:raise LookupError: if the encoding detected by cchardet is
unknown by Python (e.g. VISCII).

.. note::
:raises: :exc:`UnicodeDecodeError` if decoding fails. See also
:meth:`get_encoding`.

If response has no ``charset`` info in ``Content-Type`` HTTP
header :term:`cchardet` / :term:`charset-normalizer` is used for
content encoding autodetection.

It may hurt performance. If page encoding is known passing
explicit *encoding* parameter might help::

await resp.text('ISO-8859-1')
:return str: decoded *BODY*

.. method:: json(*, encoding=None, loads=json.loads, \
content_type='application/json')
:async:

Read response's body as *JSON*, return :class:`dict` using
specified *encoding* and *loader*. If data is not still available
a ``read`` call will be done,
a ``read`` call will be done.

If *encoding* is ``None`` content encoding is autocalculated
using :term:`cchardet` or :term:`charset-normalizer` as fallback if
*cchardet* is not available.

if response's `content-type` does not match `content_type` parameter
If response's `content-type` does not match `content_type` parameter
:exc:`aiohttp.ContentTypeError` get raised.
To disable content type check pass ``None`` value.

Expand Down Expand Up @@ -1498,17 +1491,9 @@ Response object

.. method:: get_encoding()

Automatically detect content encoding using ``charset`` info in
``Content-Type`` HTTP header. If this info is not exists or there
are no appropriate codecs for encoding then :term:`cchardet` /
:term:`charset-normalizer` is used.

Beware that it is not always safe to use the result of this function to
decode a response. Some encodings detected by cchardet are not known by
Python (e.g. VISCII). *charset-normalizer* is not concerned by that issue.

:raise RuntimeError: if called before the body has been read,
for :term:`cchardet` usage
Retrieve content encoding using ``charset`` info in ``Content-Type`` HTTP header.
If no charset is present or the charset is not understood by Python, the
``fallback_charset_resolver`` function associated with the ``ClientSession`` is called.

.. versionadded:: 3.0

Expand Down
16 changes: 0 additions & 16 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,22 +45,6 @@
Any object that can be called. Use :func:`callable` to check
that.

charset-normalizer

The Real First Universal Charset Detector.
Open, modern and actively maintained alternative to Chardet.

https://pypi.org/project/charset-normalizer/

cchardet

cChardet is high speed universal character encoding detector -
binding to charsetdetect.

https://pypi.python.org/pypi/cchardet/

.. include:: _snippets/cchardet-unmaintained-admonition.rst

gunicorn

Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for
Expand Down
24 changes: 3 additions & 21 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,6 @@ Library Installation
$ pip install aiohttp
You may want to install *optional* :term:`cchardet` library as faster
replacement for :term:`charset-normalizer`:

.. code-block:: bash
$ pip install cchardet
.. include:: _snippets/cchardet-unmaintained-admonition.rst

For speeding up DNS resolving by client API you may install
:term:`aiodns` as well.
This option is highly recommended:
Expand All @@ -53,9 +44,9 @@ This option is highly recommended:
Installing all speedups in one command
--------------------------------------

The following will get you ``aiohttp`` along with :term:`cchardet`,
:term:`aiodns` and ``Brotli`` in one bundle. No need to type
separate commands anymore!
The following will get you ``aiohttp`` along with :term:`aiodns` and ``Brotli`` in one
bundle.
No need to type separate commands anymore!

.. code-block:: bash
Expand Down Expand Up @@ -158,17 +149,8 @@ Dependencies

- *async_timeout*
- *attrs*
- *charset-normalizer*
- *multidict*
- *yarl*
- *Optional* :term:`cchardet` as faster replacement for
:term:`charset-normalizer`.

Install it explicitly via:

.. code-block:: bash
$ pip install cchardet

- *Optional* :term:`aiodns` for fast DNS resolving. The
library is highly recommended.
Expand Down

0 comments on commit eacb758

Please sign in to comment.