tokenizer: skip lines that are just slash and whitespace #4343

tusharsadhwani · 2024-04-30T13:59:21Z

Description

Resolves #4261

Checklist - did you ...

Add an entry in CHANGES.md if necessary?
Add / update tests if necessary?
Add new / update outdated documentation?

github-actions · 2024-04-30T14:38:52Z

diff-shades reports zero changes comparing this PR (2313537) to main (f22b243).

What is this? | Workflow run | diff-shades documentation

tusharsadhwani · 2024-04-30T14:40:40Z

Ah. it affects multiline strings. Should be an easy fix...

tests/data/cases/comment_after_escaped_newline.py

JelleZijlstra · 2024-05-04T15:10:04Z

src/blib2to3/pgen2/tokenize.py

+
+        # skip lines that are just a slash, to avoid storing that line's
+        # indent information.
+        if not contstr and line.rstrip("\n").strip(" \t") == "\\":


Is this equivalent to just line.strip() == "\\"? Or do we need to care about exotic whitespace characters that are not newline/space/tab?

I'm not entirely sure, so I tried to be conservative. Does Python normally treat characters like \r or \v as part of indentation?

Not too sure, there is code in our tokenizer that deals with \f at least. In CPython (Parser/lexer/lexer.c) I see some code dealing with \r and with \f (\014).

$ printf 'def foo():\n \t pass' | python -m tokenize 1,0-1,3: NAME 'def' 1,4-1,7: NAME 'foo' 1,7-1,8: OP '(' 1,8-1,9: OP ')' 1,9-1,10: OP ':' 1,10-1,11: NEWLINE '\n' 2,0-2,3: INDENT ' \t ' 2,3-2,7: NAME 'pass' 2,7-2,8: NEWLINE '' 3,0-3,0: DEDENT '' 3,0-3,0: ENDMARKER '' $ printf 'def foo():\n \f pass' | python -m tokenize 1,0-1,3: NAME 'def' 1,4-1,7: NAME 'foo' 1,7-1,8: OP '(' 1,8-1,9: OP ')' 1,9-1,10: OP ':' 1,10-1,11: NEWLINE '\n' 2,0-2,3: INDENT ' \x0c ' 2,3-2,7: NAME 'pass' 2,7-2,8: NEWLINE '' 3,0-3,0: DEDENT '' 3,0-3,0: ENDMARKER '' $ printf 'def foo():\n \r pass' | python -m tokenize 1,0-1,3: NAME 'def' 1,4-1,7: NAME 'foo' 1,7-1,8: OP '(' 1,8-1,9: OP ')' 1,9-1,10: OP ':' 1,10-1,11: NEWLINE '\n' 2,1-2,3: OP '\r ' 2,3-2,7: NAME 'pass' 2,7-2,8: NEWLINE '' 3,0-3,0: ENDMARKER ''

So \f is infact legitimate indentation, how interesting.

$ python -c 'print("def foo():\n \v pass")' | python -m tokenize 1,0-1,3: NAME 'def' 1,4-1,7: NAME 'foo' 1,7-1,8: OP '(' 1,8-1,9: OP ')' 1,9-1,10: OP ':' 1,10-1,11: NEWLINE '\n' 2,0-2,1: INDENT ' ' <stdin>:2:2: error: invalid non-printable character U+000B

\v is unparseable.

So editing the PR to do .lstrip(' \t\f') should take care of all cases I believe.

Made changes. The reason I can't do .strip() is because \ must be at the end of the line. If there's spaces after the backslash it's no longer escaping the newline:

$ printf 'print(2 + \\\n3)' print(2 + \ 3) $ printf 'print(2 + \\\n3)' | python3 5 $ printf 'print(2 + \\ \n3)' | python3 File "<stdin>", line 1 print(2 + \ ^ SyntaxError: unexpected character after line continuation character

form_feeds.py contains a \f\ (line 42) that is getting preserved, while the line was being deleted entirely before. Which I think is fine.

tusharsadhwani · 2024-05-04T16:10:14Z

tests/data/cases/form_feeds.py

@@ -156,6 +156,7 @@ def something(self):

 #

+


this is just a \n

tokenizer: skip lines that are just whitespace

19cc96b

tusharsadhwani changed the title ~~tokenizer: skip lines that are just whitespace~~ tokenizer: skip lines that are just slash and whitespace Apr 30, 2024

tusharsadhwani added 3 commits April 30, 2024 19:34

make test 3.10+

72a84b5

fix failing test

44ae615

add changelog entry

9daa7e4

tusharsadhwani added 2 commits April 30, 2024 20:12

don't ignore slashes inside strings

4f91e3b

Add name to authors

339fd0d

JelleZijlstra reviewed May 4, 2024

View reviewed changes

tests/data/cases/comment_after_escaped_newline.py Show resolved Hide resolved

JelleZijlstra reviewed May 4, 2024

View reviewed changes

tusharsadhwani added 2 commits May 4, 2024 21:31

add \f to indent chars

216f836

fix failing test

2313537

tusharsadhwani commented May 4, 2024

View reviewed changes

tests/data/cases/form_feeds.py

@@ -156,6 +156,7 @@ def something(self):

#

Copy link

Contributor Author

tusharsadhwani May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a \n

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer: skip lines that are just slash and whitespace #4343

tokenizer: skip lines that are just slash and whitespace #4343

tusharsadhwani commented Apr 30, 2024 •

edited

github-actions bot commented Apr 30, 2024 •

edited

tusharsadhwani commented Apr 30, 2024

JelleZijlstra May 4, 2024

tusharsadhwani May 4, 2024

JelleZijlstra May 4, 2024

tusharsadhwani May 4, 2024 •

edited

tusharsadhwani May 4, 2024

tusharsadhwani May 4, 2024 •

edited

tusharsadhwani May 4, 2024 •

edited

tusharsadhwani May 4, 2024

tokenizer: skip lines that are just slash and whitespace #4343

Are you sure you want to change the base?

tokenizer: skip lines that are just slash and whitespace #4343

Conversation

tusharsadhwani commented Apr 30, 2024 • edited

Description

Checklist - did you ...

github-actions bot commented Apr 30, 2024 • edited

tusharsadhwani commented Apr 30, 2024

JelleZijlstra May 4, 2024

Choose a reason for hiding this comment

tusharsadhwani May 4, 2024

Choose a reason for hiding this comment

JelleZijlstra May 4, 2024

Choose a reason for hiding this comment

tusharsadhwani May 4, 2024 • edited

Choose a reason for hiding this comment

tusharsadhwani May 4, 2024

Choose a reason for hiding this comment

tusharsadhwani May 4, 2024 • edited

Choose a reason for hiding this comment

tusharsadhwani May 4, 2024 • edited

Choose a reason for hiding this comment

tusharsadhwani May 4, 2024

Choose a reason for hiding this comment

tusharsadhwani commented Apr 30, 2024 •

edited

github-actions bot commented Apr 30, 2024 •

edited

tusharsadhwani May 4, 2024 •

edited

tusharsadhwani May 4, 2024 •

edited

tusharsadhwani May 4, 2024 •

edited