Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix --debug and setting of PYTEST_CURRENT_TEST in non-UTF8 locales #7787

Closed
wants to merge 1 commit into from
Closed

Fix --debug and setting of PYTEST_CURRENT_TEST in non-UTF8 locales #7787

wants to merge 1 commit into from

Conversation

asottile
Copy link
Member

Resolves #7781
Resolves #7786

@asottile asottile added the needs backport applied to PRs, indicates that it should be ported to the current bug-fix branch label Sep 23, 2020
Copy link
Member

@The-Compiler The-Compiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I agree with the approach here.

For the debug file, I guess forcing it to be UTF-8 isn't too bad - even on Windows, I suppose most text editors will be able to detect and open UTF-8 (though I'm not sure if Windows Notepad does?)

But for the environment variable, this will just shift the error to the reader of the variable - surely the reader won't expect an UTF-8 encoded variable when the system encoding isn't UTF-8. Also I'm not sure how the situation is on Windows, are environment variables text/UTF-8 there? Otherwise this would still fail there, no?

The only other solution I can see is something like:

encoding = sys.getfilesystemencoding()
value = value.encode(encoding, errors='replace').decode(encoding)

@asottile
Copy link
Member Author

I'm not sure if I agree with the approach here.

For the debug file, I guess forcing it to be UTF-8 isn't too bad - even on Windows, I suppose most text editors will be able to detect and open UTF-8 (though I'm not sure if Windows Notepad does?)

But for the environment variable, this will just shift the error to the reader of the variable - surely the reader won't expect an UTF-8 encoded variable when the system encoding isn't UTF-8. Also I'm not sure how the situation is on Windows, are environment variables text/UTF-8 there? Otherwise this would still fail there, no?

The only other solution I can see is something like:

encoding = sys.getfilesystemencoding()
value = value.encode(encoding, errors='replace').decode(encoding)
  • windows doesn't support bytes environ, so it is always text there
  • I know of no modern system which doesn't use UTF-8 as a default locale

@nicoddemus
Copy link
Member

windows doesn't support bytes environ, so it is always text there

True:

Python 3.6.7 (default, Mar  4 2020, 17:08:00) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.supports_bytes_environ
False

I know of no modern system which doesn't use UTF-8 as a default locale

I think so too, but shouldn't we use the default locale then?

@nicoddemus
Copy link
Member

For the debug file, I guess forcing it to be UTF-8 isn't too bad - even on Windows, I suppose most text editors will be able to detect and open UTF-8 (though I'm not sure if Windows Notepad does?)

I agree. The debug file is meant for advanced users, so it is fair to encode it using a known encoding and expect those advanced users to be able to open it. Even in situations where the user can't use a text editor that supports UTF-8, all it will happen is a few wrong characters if output contains non-ASCII output, which shouldn't be too bad.

@asottile
Copy link
Member Author

I think so too, but shouldn't we use the default locale then?

if we use the default locale then there's information loss

@nicoddemus
Copy link
Member

if we use the default locale then there's information loss

Hmm TBH I not sure I follow the underlying semantics here, but I will defer to your expertise that this is the correct approach. Other than that, the rest looks great. 👍

@bluetech
Copy link
Member

Agree with the others that UTF-8 is good for --debug.

As for PYTEST_CURRENT_TEST I think UTF-8 is not really appropriate, because anyone reading the value would not do so correctly. You mentioned you don't want information loss but I think given how PYTEST_CURRENT_TEST is described in the docs, specifically

The contents of PYTEST_CURRENT_TEST is meant to be human readable and the actual format can be changed between releases (even bug fixes) so it shouldn’t be relied on for scripting or automation.

that information loss is preferable over bad encoding. So my suggestion would be to use encode(sys.getfilesystemencoding(), "errors"). For ascii it will be b't.py::test4_????_????'.

(getfilesystemencoding seems to be what python uses for envvars when supports_bytes_environ is true).

@asottile
Copy link
Member Author

I'm going to merge this as-is unless there's stronger opposition

python3.8+ in the same situation chooses UTF-8 so I think this choice is consistent and appropriate for cpython -- this is really only a bug in older python versions

the filesystem encoding also isn't appropriate for the environ, it's not a file -- there are however two equally bad choices: locale or filesystem

@bluetech
Copy link
Member

python3.8+ in the same situation chooses UTF-8

Do you have a reference for that? Looking at the os.py code it seems the same to me.

@asottile
Copy link
Member Author

actually, it's 3.7+

$ LANG= python3.6 -c 'import os; os.environ["x"] = "☃"'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30-32: surrogates not allowed
$ LANG= python3.7 -c 'import os; os.environ["x"] = "☃"'
$ LANG= python3.8 -c 'import os; os.environ["x"] = "☃"'
$

@bluetech
Copy link
Member

That's because these versions changed the interpretation of "C" to UTF-8. But doesn't work for other cases:

$ LANG=en_US.ISO-8859-1 python
Python 3.8.5 (default, Jul 27 2020, 08:42:51) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os; os.environ['x'] = 't.py::test4_\u0447\u045b\u0448\u0452_\u010d\u0107\u0161\u0111'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/os.py", line 680, in __setitem__
    value = self.encodevalue(value)
  File "/usr/lib/python3.8/os.py", line 751, in encode
    return value.encode(encoding, 'surrogateescape')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 12-15: ordinal not in range(256)

Base automatically changed from master to main March 9, 2021 20:40
@asottile
Copy link
Member Author

via #10935

@asottile asottile closed this Jun 21, 2023
@asottile asottile deleted the fix_non_encodable_test_names branch June 21, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs backport applied to PRs, indicates that it should be ported to the current bug-fix branch
Projects
No open projects
4 participants