Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massively reduce EncodingWarning spam in tests #4234

Closed
wants to merge 15 commits into from

Conversation

Avasam
Copy link
Contributor

@Avasam Avasam commented Feb 23, 2024

Summary of changes

Explicitly add encoding=locale.getpreferredencoding(False) to files opened as string. Starting Python 3.10 this can become encoding="locale"
Done by running ruff check . --select=PLW1514 --fix --unsafe-fixes (no manual code change)
I also added the check to Ruff config
Edit: This was changed by a constant that's "locale" when supported. PLW1514 also wasn't added to the Ruff config because it requires the preview flag which itself causes new lint errors unrelated to this PR

I didn't go for "utf-8" because it could be a breaking change for some projects on Windows.

Works toward #3810 (comment) (I don't think it closes, idk if there's more warnings, probably from vendored packages)

Pull Request Checklist

  • Changes have tests (all existing tests should pass, but with a lot less warnings now)
  • News fragment added in newsfragments/.
    (See documentation for details)

Copy link
Contributor

@abravalheri abravalheri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Avasam, I would like to thank you for all the contributions you are doing to setuptools. You have been very nice to the project and put a lot of effort in tackling very hard and time consuming tasks.


My minor comment about this are the following:

  • When dealing with .pth files specifically, the advice we got from CPython maintainers is to omit the encoding parameter and ignore the warning specifically, instead of getpreferredencoding: .pth files cannot contain folders with utf-8 names python/cpython#77102 (comment).

    I while ago I left a comment in editable_wheel to remind myself about that in the future.

  • .py files should always use UTF-8 right? (Because we only support Python3)

  • In the tests folder, for everything else that is not .pth or is explicitly trying to test encoding, we probably can risk utf-8.

@Avasam
Copy link
Contributor Author

Avasam commented Feb 23, 2024

Alright so doing this made me notice another issue that leads me to believe we can't completely hide these warnings in a non-breaking way to stop to spam in tests (until Python 3.9 support is dropped so encoding="locale" can be used everywhere it should):

In diffcov (which runs Python 3.12), the warning is simply changed to EncodingWarning: UTF-8 Mode affects locale.getpreferredencoding(). Consider locale.getencoding() instead. But since I'm only using getpreferredencoding for open, may as well use encoding="locale" at that point. (both available at Python 3.11). Which defeats the purpose of my attempted changes here.

Edit: Unless getting the default encoding is extracted to a helper method, so the getpreferredencoding warning appears only a few times.


  • In the tests folder, for everything else that is not .pth or is explicitly trying to test encoding, we probably can risk utf-8.

I was thinking that too. Some or most tests could probably still be changed to use "utf-8" directly.

But we could also just blanket ignore this warning (if that's doable) given the special situation setuptools always has. And it would achieve the same result of improving test results readability.


  • .py files should always use UTF-8 right? (Because we only support Python3)

I have come across python files with non-utf-8 characters in the wild, specifically this one with Windows 1252 chars that I remember tripping up Ruff:
https://github.com/mhammond/pywin32/blob/main/Pythonwin/pywin/test/_dbgscript.py

>>> open("./Pythonwin/pywin/test/_dbgscript.py").readlines()
['# -*- coding: latin-1 -*-\n', '#\n', '# Script for testing pywin.debugger & interactive exec from test_pywin\n', '\n', '# Umlauts for encoding test: áéúäöü\n', '\n', 'aa = 11\n', 'aa = 22\n', '\n', '\n', 'class CC:\n', '    cc = 44\n', '\n', '\n', 'def ff(bb=55):\n', '    global aa\n', '    aa = 77\n', '    return aa + bb\n', '\n', '\n', 'ff()\n', 'aa = 33\n']
>>> open("./Pythonwin/pywin/test/_dbgscript.py", encoding="utf-8").readlines()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 129: invalid continuation byte

And pywin32 also has these files with a different top-of-the-file encoding declaration:
https://github.com/mhammond/pywin32/blob/main/com/win32com/demos/iebutton.py
https://github.com/mhammond/pywin32/blob/main/com/win32com/demos/ietoolbar.py

Granted these are Python 2 & Internet Explorer era files, and the one with non-UTF-8 chars is in a test. But apparently they exist?

@Avasam Avasam changed the title Fix EncodingWarning: 'encoding' argument not specified spam in tests Fix EncodingWarning spam in tests for Python 3.8 & 3.9 Feb 23, 2024
@Avasam Avasam force-pushed the EncodingWarning branch 5 times, most recently from 41d7d29 to a460f2c Compare February 24, 2024 04:36
@Avasam Avasam force-pushed the EncodingWarning branch 4 times, most recently from 544aed8 to ae88944 Compare February 24, 2024 05:43
@Avasam Avasam marked this pull request as draft February 24, 2024 05:55
@Avasam
Copy link
Contributor Author

Avasam commented Feb 24, 2024

All tests pass (except coverage), and you can see using the diffcov test (because they only appear in failing test) that there's 4 EncodingWarning left down from ~124 ! (2 of which are from pip and I believe are happening outside pytest, ref.: pypa/pip#12070)

That doesn't mean this is done though, I still need to:

  1. Validate with you, the maintainers, whether the name and location of the new utilities is good
    2. Manually search the code for .pth to try to set the encoding explicitly to None (unless you tell me that passing tests is enough)
    3. Should I use encoding_for_open_for_mode everywhere where the mode argument to open is dynamic and could potentially be b (bytes read/write)? Rn I only added it where tests fail.

Finally, once I have a full complete solution, this could likely be split up in smaller PRs. For example to start with all changes in tests that certainly won't affect end-users.

@Avasam Avasam changed the title Fix EncodingWarning spam in tests for Python 3.8 & 3.9 Massively reduce EncodingWarning spam in tests Feb 24, 2024
@Avasam Avasam force-pushed the EncodingWarning branch 6 times, most recently from fca4dc1 to acd2a0b Compare February 24, 2024 18:18
@Avasam Avasam force-pushed the EncodingWarning branch 3 times, most recently from 35ac604 to 2bb129d Compare February 24, 2024 19:06
@Avasam Avasam marked this pull request as ready for review February 25, 2024 19:10
Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for doing this work. I know it's a lot.

I'd like to explore adopting the future default UTF-8 if possible.

pytest.ini Outdated Show resolved Hide resolved
pytest.ini Outdated Show resolved Hide resolved
from typing import Optional


locale_encoding = "locale" if sys.version_info >= (3, 10) else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that encoding='locale' is essentially locking in the legacy behavior, which is UTF-8 on most systems, but cp1252 or similar on Windows. I believe the intention of the EncodingWarnings is to eventually default to UTF-8 regardless of the platform or environment. It would be my preference too if Setuptools could adopt UTF-8 and avoid the platform variance or at the very least minimize the variance to places where it's absolutely necessary.
When I made similar changes to distutils (pypa/distutils#232), I simply hard-coded "utf-8".
I still need to absorb the discussion, which I haven't the time right now, but I wanted to get this review in.

Copy link
Contributor Author

@Avasam Avasam Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding of encoding='locale' seems to align with mine. As a point of precision for this PR's goal: It's not to be a long-term fix that is forward-compatible with Python 3.15 (current ETA of UTF-8 as default), but rather just to stop the warning spam whilst keeping the current behaviour + backwards compatibility (no possible breaking change) and keeping it easy to find where the encoding is set explicitly for an eventual update to UTF-8.

I'd like to explore adopting the future default UTF-8 if possible.
It would be my preference too if Setuptools could adopt UTF-8 and avoid the platform variance or at the very least minimize the variance to places where it's absolutely necessary.

As far as going full utf-8 goes, @abravalheri suggested starting with tests, where we couldn't directly affect any user of setuptools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to suppress the warnings, I would simply suppress them in the pytest.ini rather than coding in the less desirable behavior. At the very least, if this code is undesirable/transitional, I'd like to see a comment (probably linked to an open issue) indicating that there's anticipated work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered pretty late into doing this how the pytest.ini file could be used to suppress these warnings. If you're fine suppressing them there, with an appropriate comment, to make current failed tests more readable, then I'll try to make it work that way instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this #4255 ?
If so, I'll immediately close this PR in favor of that one.

@abravalheri
Copy link
Contributor

In #4261, I am testing out "utf-8" for setuptools/tests discussed in #4234 (review).

@Avasam Avasam marked this pull request as draft March 7, 2024 17:23
@Avasam
Copy link
Contributor Author

Avasam commented Mar 9, 2024

As mentioned in #4255 (comment) , the vast majority of the spam has been greatly reduced. And about half of what's left should be fixed upstream (and can be ignored by that PR). Let's try to use UTF-8 where applicable for non-test changes.

@Avasam Avasam closed this Mar 9, 2024
@Avasam Avasam deleted the EncodingWarning branch May 9, 2024 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants