Fix `--debug` and setting of `PYTEST_CURRENT_TEST` in non-UTF8 locales #7787

asottile · 2020-09-23T18:51:12Z

Resolves #7781
Resolves #7786

The-Compiler

I'm not sure if I agree with the approach here.

For the debug file, I guess forcing it to be UTF-8 isn't too bad - even on Windows, I suppose most text editors will be able to detect and open UTF-8 (though I'm not sure if Windows Notepad does?)

But for the environment variable, this will just shift the error to the reader of the variable - surely the reader won't expect an UTF-8 encoded variable when the system encoding isn't UTF-8. Also I'm not sure how the situation is on Windows, are environment variables text/UTF-8 there? Otherwise this would still fail there, no?

The only other solution I can see is something like:

encoding = sys.getfilesystemencoding()
value = value.encode(encoding, errors='replace').decode(encoding)

asottile · 2020-09-24T15:04:08Z

I'm not sure if I agree with the approach here.

For the debug file, I guess forcing it to be UTF-8 isn't too bad - even on Windows, I suppose most text editors will be able to detect and open UTF-8 (though I'm not sure if Windows Notepad does?)

But for the environment variable, this will just shift the error to the reader of the variable - surely the reader won't expect an UTF-8 encoded variable when the system encoding isn't UTF-8. Also I'm not sure how the situation is on Windows, are environment variables text/UTF-8 there? Otherwise this would still fail there, no?

The only other solution I can see is something like:
encoding = sys.getfilesystemencoding()
value = value.encode(encoding, errors='replace').decode(encoding)

windows doesn't support bytes environ, so it is always text there
I know of no modern system which doesn't use UTF-8 as a default locale

nicoddemus · 2020-09-24T18:51:46Z

windows doesn't support bytes environ, so it is always text there

True:

Python 3.6.7 (default, Mar  4 2020, 17:08:00) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.supports_bytes_environ
False

I know of no modern system which doesn't use UTF-8 as a default locale

I think so too, but shouldn't we use the default locale then?

nicoddemus · 2020-09-24T18:55:28Z

For the debug file, I guess forcing it to be UTF-8 isn't too bad - even on Windows, I suppose most text editors will be able to detect and open UTF-8 (though I'm not sure if Windows Notepad does?)

I agree. The debug file is meant for advanced users, so it is fair to encode it using a known encoding and expect those advanced users to be able to open it. Even in situations where the user can't use a text editor that supports UTF-8, all it will happen is a few wrong characters if output contains non-ASCII output, which shouldn't be too bad.

asottile · 2020-09-24T23:29:38Z

I think so too, but shouldn't we use the default locale then?

if we use the default locale then there's information loss

nicoddemus · 2020-09-25T00:33:43Z

if we use the default locale then there's information loss

Hmm TBH I not sure I follow the underlying semantics here, but I will defer to your expertise that this is the correct approach. Other than that, the rest looks great. 👍

bluetech · 2020-09-27T10:16:41Z

Agree with the others that UTF-8 is good for --debug.

As for PYTEST_CURRENT_TEST I think UTF-8 is not really appropriate, because anyone reading the value would not do so correctly. You mentioned you don't want information loss but I think given how PYTEST_CURRENT_TEST is described in the docs, specifically

The contents of PYTEST_CURRENT_TEST is meant to be human readable and the actual format can be changed between releases (even bug fixes) so it shouldn’t be relied on for scripting or automation.

that information loss is preferable over bad encoding. So my suggestion would be to use encode(sys.getfilesystemencoding(), "errors"). For ascii it will be b't.py::test4_????_????'.

(getfilesystemencoding seems to be what python uses for envvars when supports_bytes_environ is true).

asottile · 2020-09-27T17:08:52Z

I'm going to merge this as-is unless there's stronger opposition

python3.8+ in the same situation chooses UTF-8 so I think this choice is consistent and appropriate for cpython -- this is really only a bug in older python versions

the filesystem encoding also isn't appropriate for the environ, it's not a file -- there are however two equally bad choices: locale or filesystem

bluetech · 2020-09-27T17:42:34Z

python3.8+ in the same situation chooses UTF-8

Do you have a reference for that? Looking at the os.py code it seems the same to me.

asottile · 2020-09-27T18:22:46Z

actually, it's 3.7+

$ LANG= python3.6 -c 'import os; os.environ["x"] = "☃"'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30-32: surrogates not allowed
$ LANG= python3.7 -c 'import os; os.environ["x"] = "☃"'
$ LANG= python3.8 -c 'import os; os.environ["x"] = "☃"'
$

bluetech · 2020-09-27T18:28:16Z

That's because these versions changed the interpretation of "C" to UTF-8. But doesn't work for other cases:

$ LANG=en_US.ISO-8859-1 python
Python 3.8.5 (default, Jul 27 2020, 08:42:51) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os; os.environ['x'] = 't.py::test4_\u0447\u045b\u0448\u0452_\u010d\u0107\u0161\u0111'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/os.py", line 680, in __setitem__
    value = self.encodevalue(value)
  File "/usr/lib/python3.8/os.py", line 751, in encode
    return value.encode(encoding, 'surrogateescape')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 12-15: ordinal not in range(256)

asottile · 2023-06-21T18:09:20Z

via #10935

asottile added the needs backport applied to PRs, indicates that it should be ported to the current bug-fix branch label Sep 23, 2020

Fix --debug and setting of PYTEST_CURRENT_TEST in non-UTF8 locales

802182f

The-Compiler reviewed Sep 24, 2020

View reviewed changes

nicoddemus approved these changes Sep 25, 2020

View reviewed changes

Base automatically changed from master to main March 9, 2021 20:40

RonnyPfannschmidt added this to To do in unicode-node-ids Aug 26, 2021

DanielNoord mentioned this pull request Feb 9, 2022

Pytest debug option "--debug" has problems to write non ASCII characters #7781

Closed

4 tasks

asottile closed this Jun 21, 2023

asottile deleted the fix_non_encodable_test_names branch June 21, 2023 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `--debug` and setting of `PYTEST_CURRENT_TEST` in non-UTF8 locales #7787

Fix `--debug` and setting of `PYTEST_CURRENT_TEST` in non-UTF8 locales #7787

asottile commented Sep 23, 2020

The-Compiler left a comment

asottile commented Sep 24, 2020

nicoddemus commented Sep 24, 2020

nicoddemus commented Sep 24, 2020

asottile commented Sep 24, 2020

nicoddemus commented Sep 25, 2020

bluetech commented Sep 27, 2020

asottile commented Sep 27, 2020

bluetech commented Sep 27, 2020

asottile commented Sep 27, 2020

bluetech commented Sep 27, 2020

asottile commented Jun 21, 2023

Fix --debug and setting of PYTEST_CURRENT_TEST in non-UTF8 locales #7787

Fix --debug and setting of PYTEST_CURRENT_TEST in non-UTF8 locales #7787

Conversation

asottile commented Sep 23, 2020

The-Compiler left a comment

Choose a reason for hiding this comment

asottile commented Sep 24, 2020

nicoddemus commented Sep 24, 2020

nicoddemus commented Sep 24, 2020

asottile commented Sep 24, 2020

nicoddemus commented Sep 25, 2020

bluetech commented Sep 27, 2020

asottile commented Sep 27, 2020

bluetech commented Sep 27, 2020

asottile commented Sep 27, 2020

bluetech commented Sep 27, 2020

asottile commented Jun 21, 2023

Fix `--debug` and setting of `PYTEST_CURRENT_TEST` in non-UTF8 locales #7787

Fix `--debug` and setting of `PYTEST_CURRENT_TEST` in non-UTF8 locales #7787