Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer: skip lines that are just slash and whitespace #4343

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tusharsadhwani
Copy link
Contributor

@tusharsadhwani tusharsadhwani commented Apr 30, 2024

Description

Resolves #4261

Checklist - did you ...

  • Add an entry in CHANGES.md if necessary?
  • Add / update tests if necessary?
  • Add new / update outdated documentation?

@tusharsadhwani tusharsadhwani changed the title tokenizer: skip lines that are just whitespace tokenizer: skip lines that are just slash and whitespace Apr 30, 2024
Copy link

github-actions bot commented Apr 30, 2024

diff-shades reports zero changes comparing this PR (2313537) to main (f22b243).


What is this? | Workflow run | diff-shades documentation

@tusharsadhwani
Copy link
Contributor Author

Ah. it affects multiline strings. Should be an easy fix...


# skip lines that are just a slash, to avoid storing that line's
# indent information.
if not contstr and line.rstrip("\n").strip(" \t") == "\\":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to just line.strip() == "\\"? Or do we need to care about exotic whitespace characters that are not newline/space/tab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure, so I tried to be conservative. Does Python normally treat characters like \r or \v as part of indentation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure, there is code in our tokenizer that deals with \f at least. In CPython (Parser/lexer/lexer.c) I see some code dealing with \r and with \f (\014).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ printf 'def foo():\n \t pass' | python -m tokenize
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'foo'
1,7-1,8:            OP             '('
1,8-1,9:            OP             ')'
1,9-1,10:           OP             ':'
1,10-1,11:          NEWLINE        '\n'
2,0-2,3:            INDENT         ' \t '
2,3-2,7:            NAME           'pass'
2,7-2,8:            NEWLINE        ''
3,0-3,0:            DEDENT         ''
3,0-3,0:            ENDMARKER      ''

$ printf 'def foo():\n \f pass' | python -m tokenize
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'foo'
1,7-1,8:            OP             '('
1,8-1,9:            OP             ')'
1,9-1,10:           OP             ':'
1,10-1,11:          NEWLINE        '\n'
2,0-2,3:            INDENT         ' \x0c '
2,3-2,7:            NAME           'pass'
2,7-2,8:            NEWLINE        ''
3,0-3,0:            DEDENT         ''
3,0-3,0:            ENDMARKER      ''

$ printf 'def foo():\n \r pass' | python -m tokenize
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'foo'
1,7-1,8:            OP             '('
1,8-1,9:            OP             ')'
1,9-1,10:           OP             ':'
1,10-1,11:          NEWLINE        '\n'
2,1-2,3:            OP             '\r '
2,3-2,7:            NAME           'pass'
2,7-2,8:            NEWLINE        ''
3,0-3,0:            ENDMARKER      ''

So \f is infact legitimate indentation, how interesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ python -c 'print("def foo():\n \v pass")' | python -m tokenize
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'foo'
1,7-1,8:            OP             '('
1,8-1,9:            OP             ')'
1,9-1,10:           OP             ':'
1,10-1,11:          NEWLINE        '\n'
2,0-2,1:            INDENT         ' '
<stdin>:2:2: error: invalid non-printable character U+000B

\v is unparseable.

So editing the PR to do .lstrip(' \t\f') should take care of all cases I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes. The reason I can't do .strip() is because \ must be at the end of the line. If there's spaces after the backslash it's no longer escaping the newline:

$ printf 'print(2 + \\\n3)'
print(2 + \
3)

$ printf 'print(2 + \\\n3)' | python3
5

$ printf 'print(2 + \\ \n3)' | python3
  File "<stdin>", line 1
    print(2 + \
               ^
SyntaxError: unexpected character after line continuation character

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form_feeds.py contains a \f\ (line 42) that is getting preserved, while the line was being deleted entirely before. Which I think is fine.

@@ -156,6 +156,7 @@ def something(self):

#


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a \n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fails to parse backslash on line by itself
2 participants