Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures with Python 3.12.0b1 #1005

Closed
mgorny opened this issue May 25, 2023 · 8 comments
Closed

Test failures with Python 3.12.0b1 #1005

mgorny opened this issue May 25, 2023 · 8 comments
Assignees

Comments

@mgorny
Copy link
Contributor

mgorny commented May 25, 2023

Overview Description

The test suite fails when run with Python 3.12.0b1:

FAILED tests/messages/test_extract.py::ExtractPythonTestCase::test_utf8_message_with_utf8_bom -   File "<string>", line 1
FAILED tests/messages/test_extract.py::ExtractPythonTestCase::test_utf8_message_with_utf8_bom_and_magic_comment -   File "<string>", line 1
FAILED tests/messages/test_extract.py::ExtractPythonTestCase::test_utf8_raw_strings_match_unicode_strings -   File "<string>", line 1
FAILED tests/messages/test_extract.py::ExtractTestCase::test_f_strings - AssertionError: assert 3 == 4
FAILED tests/messages/test_extract.py::ExtractTestCase::test_f_strings_non_utf8 - assert 0 == 1

Furthermore, tox -e py312 fails by default because of missing distutils module (installing setuptools can workaround that but distutils use should be removed altogether).

Steps to Reproduce

  1. tox -e py312

Actual Results

________________________________________ ExtractPythonTestCase.test_utf8_message_with_utf8_bom ________________________________________

self = <tests.messages.test_extract.ExtractPythonTestCase testMethod=test_utf8_message_with_utf8_bom>

        def test_utf8_message_with_utf8_bom(self):
            buf = BytesIO(codecs.BOM_UTF8 + """
    # NOTE: hello
    msg = _('Bonjour à tous')
    """.encode('utf-8'))
>           messages = list(extract.extract_python(buf, ('_',), ['NOTE:'], {}))

tests/messages/test_extract.py:367: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
babel/messages/extract.py:500: in extract_python
    for tok, value, (lineno, _), _, _ in tokens:
/usr/lib/python3.12/tokenize.py:451: in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

source = "\ufeff\n# NOTE: hello\nmsg = _('Bonjour à tous')\n", extra_tokens = True

    def _generate_tokens_from_c_tokenizer(source, extra_tokens=False):
        """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
        import _tokenize as c_tokenizer
>       for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
E         File "<string>", line 1
E           
E           ^
E       SyntaxError: invalid non-printable character U+FEFF

/usr/lib/python3.12/tokenize.py:542: SyntaxError
_______________________________ ExtractPythonTestCase.test_utf8_message_with_utf8_bom_and_magic_comment _______________________________

self = <tests.messages.test_extract.ExtractPythonTestCase testMethod=test_utf8_message_with_utf8_bom_and_magic_comment>

        def test_utf8_message_with_utf8_bom_and_magic_comment(self):
            buf = BytesIO(codecs.BOM_UTF8 + """# -*- coding: utf-8 -*-
    # NOTE: hello
    msg = _('Bonjour à tous')
    """.encode('utf-8'))
>           messages = list(extract.extract_python(buf, ('_',), ['NOTE:'], {}))

tests/messages/test_extract.py:376: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
babel/messages/extract.py:500: in extract_python
    for tok, value, (lineno, _), _, _ in tokens:
/usr/lib/python3.12/tokenize.py:451: in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

source = "\ufeff# -*- coding: utf-8 -*-\n# NOTE: hello\nmsg = _('Bonjour à tous')\n", extra_tokens = True

    def _generate_tokens_from_c_tokenizer(source, extra_tokens=False):
        """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
        import _tokenize as c_tokenizer
>       for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
E         File "<string>", line 1
E           # -*- coding: utf-8 -*-
E           ^
E       SyntaxError: invalid non-printable character U+FEFF

/usr/lib/python3.12/tokenize.py:542: SyntaxError
__________________________________ ExtractPythonTestCase.test_utf8_raw_strings_match_unicode_strings __________________________________

self = <tests.messages.test_extract.ExtractPythonTestCase testMethod=test_utf8_raw_strings_match_unicode_strings>

        def test_utf8_raw_strings_match_unicode_strings(self):
            buf = BytesIO(codecs.BOM_UTF8 + """
    msg = _('Bonjour à tous')
    msgu = _(u'Bonjour à tous')
    """.encode('utf-8'))
>           messages = list(extract.extract_python(buf, ('_',), ['NOTE:'], {}))

tests/messages/test_extract.py:393: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
babel/messages/extract.py:500: in extract_python
    for tok, value, (lineno, _), _, _ in tokens:
/usr/lib/python3.12/tokenize.py:451: in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

source = "\ufeff\nmsg = _('Bonjour à tous')\nmsgu = _(u'Bonjour à tous')\n", extra_tokens = True

    def _generate_tokens_from_c_tokenizer(source, extra_tokens=False):
        """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
        import _tokenize as c_tokenizer
>       for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
E         File "<string>", line 1
E           
E           ^
E       SyntaxError: invalid non-printable character U+FEFF

/usr/lib/python3.12/tokenize.py:542: SyntaxError
___________________________________________________ ExtractTestCase.test_f_strings ____________________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings>

        def test_f_strings(self):
            buf = BytesIO(br"""
    t1 = _('foobar')
    t2 = _(f'spameggs' f'feast')  # should be extracted; constant parts only
    t2 = _(f'spameggs' 'kerroshampurilainen')  # should be extracted (mixing f with no f)
    t3 = _(f'''whoa! a '''  # should be extracted (continues on following lines)
    f'flying shark'
        '... hello'
    )
    t4 = _(f'spameggs {t1}')  # should not be extracted
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 4
E           AssertionError: assert 3 == 4
E            +  where 3 = len([(2, 'foobar', [], None), (4, 'kerroshampurilainen', [], None), (5, '... hello', [], None)])

tests/messages/test_extract.py:544: AssertionError
_______________________________________________ ExtractTestCase.test_f_strings_non_utf8 _______________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings_non_utf8>

        def test_f_strings_non_utf8(self):
            buf = BytesIO(b"""
    # -- coding: latin-1 --
    t2 = _(f'\xe5\xe4\xf6' f'\xc5\xc4\xd6')
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 1
E           assert 0 == 1
E            +  where 0 = len([])

tests/messages/test_extract.py:556: AssertionError

Expected Results

Passing tests (or at least passing as well as py3.11 did).

Reproducibility

Always.

Additional Information

Confirmed with git 8b152db.

@mgorny
Copy link
Contributor Author

mgorny commented May 28, 2023

I've digged a bit since there were some regressions in Python 3.12's tokenizer but this doesn't seem to be done. FWICS Babel is decoding the BOM into U+FEFF, and then passing it into generate_tokens().

Note that in Python 3.11 this returned ERRORTOKEN:

>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
[TokenInfo(type=60 (ERRORTOKEN), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

whereas in Python 3.12 it raises a SyntaxError:

>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/tokenize.py", line 451, in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
  File "/usr/lib/python3.12/tokenize.py", line 542, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
  File "<string>", line 1
    
    ^
SyntaxError: invalid non-printable character U+FEFF

CPython itself is stripping BOM as part of the encoding detection, before starting to decode the source for tokenization. Babel probably needs to do the same.

@vstinner
Copy link

Are you still able to reproduce the issue with the just released Python 3.12.0rc3 version? The issue was created at May 25, I supposed that Python 3.12.0 beta1 was tested. But bugs were fixed in the meanwhile.

I get a different behavior with #1005 (comment) example and Python 3.12.0rc2.

bug.py:

import io, tokenize
print(list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline)))

Output:

$ python3.11 bug.py 
[TokenInfo(type=60 (ERRORTOKEN), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

$ python3.12 bug.py 
[TokenInfo(type=1 (NAME), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

$ python3.11 -VV
Python 3.11.5 (main, Aug 28 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)]

$ python3.12 -VV
Python 3.12.0rc2 (main, Sep  6 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)]

I don't get a SyntaxError.

Using the REPL:

$ python3.12
Python 3.12.0rc2 (main, Sep  6 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize, io
>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
[TokenInfo(type=1 (NAME), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

@mgorny
Copy link
Contributor Author

mgorny commented Sep 20, 2023

The first three failures seem to be gone. These two seem to remain (plus the missing setuptools dependency):

___________________________________________________ ExtractTestCase.test_f_strings ____________________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings>

        def test_f_strings(self):
            buf = BytesIO(br"""
    t1 = _('foobar')
    t2 = _(f'spameggs' f'feast')  # should be extracted; constant parts only
    t2 = _(f'spameggs' 'kerroshampurilainen')  # should be extracted (mixing f with no f)
    t3 = _(f'''whoa! a '''  # should be extracted (continues on following lines)
    f'flying shark'
        '... hello'
    )
    t4 = _(f'spameggs {t1}')  # should not be extracted
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 4
E           AssertionError: assert 3 == 4
E            +  where 3 = len([(2, 'foobar', [], None), (4, 'kerroshampurilainen', [], None), (5, '... hello', [], None)])

tests/messages/test_extract.py:544: AssertionError
_______________________________________________ ExtractTestCase.test_f_strings_non_utf8 _______________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings_non_utf8>

        def test_f_strings_non_utf8(self):
            buf = BytesIO(b"""
    # -- coding: latin-1 --
    t2 = _(f'\xe5\xe4\xf6' f'\xc5\xc4\xd6')
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 1
E           assert 0 == 1
E            +  where 0 = len([])

tests/messages/test_extract.py:556: AssertionError

@vstinner
Copy link

To install distutils on Python 3.12, you can use this change:

diff --git a/tox.ini b/tox.ini
index 11cca0c..7c4d56a 100644
--- a/tox.ini
+++ b/tox.ini
@@ -11,6 +11,7 @@ deps =
     backports.zoneinfo;python_version<"3.9"
     tzdata;sys_platform == 'win32'
     pytz: pytz
+    setuptools;python_version>="3.12"
 allowlist_externals = make
 commands = make clean-cldr test
 setenv =

@encukou
Copy link
Contributor

encukou commented Sep 21, 2023

Here's a PR for the f-string parsing: #1027

@akx
Copy link
Member

akx commented Oct 1, 2023

#1027 was just merged and we're now running CI on 3.12 too as of #1028. Thanks all! ❤️

@akx akx closed this as completed Oct 1, 2023
@akx
Copy link
Member

akx commented Oct 3, 2023

Released in https://pypi.org/project/Babel/2.13.0/ just now 🎉

@oprypin
Copy link
Contributor

oprypin commented Oct 7, 2023

Regarding #1005 (comment), adding the "setuptools" dependency only for CI was not the correct solution, because it's the package itself that depends on it, so CI of other projects (and actual local usages) will still break. I opened issue #1031 and a pull request accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants