Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[syntax-errors] PEP 701 f-strings before Python 3.12 #16543

Merged
merged 35 commits into from
Mar 18, 2025
Merged

Conversation

ntBre
Copy link
Contributor

@ntBre ntBre commented Mar 6, 2025

Summary

This PR detects the use of PEP 701 f-strings before 3.12. This one sounded difficult and ended up being pretty easy, so I think there's a good chance I've over-simplified things. However, from experimenting in the Python REPL and checking with pyright, I think this is correct. pyright actually doesn't even flag the comment case, but Python does.

I also checked pyright's implementation for quotes and escapes and think I've approximated how they do it.

Python's error messages also point to the simple approach of these characters simply not being allowed:

Python 3.11.11 (main, Feb 12 2025, 14:51:05) [Clang 19.1.6 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f'''multiline {
... expression # comment
... }'''
  File "<stdin>", line 3
    }'''
        ^
SyntaxError: f-string expression part cannot include '#'
>>> f'''{not a line \
... continuation}'''
  File "<stdin>", line 2
    continuation}'''
                    ^
SyntaxError: f-string expression part cannot include a backslash
>>> f'hello {'world'}'
  File "<stdin>", line 1
    f'hello {'world'}'
              ^^^^^
SyntaxError: f-string: expecting '}'

And since escapes aren't allowed, I don't think there are any tricky cases where nested quotes or comments can sneak in.

It's also slightly annoying that the error is repeated for every nested quote character, but that also mirrors pyright, although they highlight the whole nested string, which is a little nicer. However, their check is in the analysis phase, so I don't think we have such easy access to the quoted range, at least without adding another mini visitor.

Test Plan

New inline tests

ntBre added 7 commits March 6, 2025 14:17

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
@ntBre ntBre added parser Related to the parser preview Related to preview mode features labels Mar 6, 2025
Copy link
Contributor

github-actions bot commented Mar 6, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@ntBre
Copy link
Contributor Author

ntBre commented Mar 7, 2025

I thought of some additional test cases tonight:

Python 3.11.11 (main, Feb 12 2025, 14:51:05) [Clang 19.1.6 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f"{"""x"""}"
  File "<stdin>", line 1
    f"{"""x"""}"
          ^
SyntaxError: f-string: expecting '}'
>>> f'{'''x'''}'
  File "<stdin>", line 1
    f'{'''x'''}'
          ^
SyntaxError: f-string: expecting '}'
>>> f"""{"x"}"""
'x'
>>> f'''{'x'}'''
'x'

I'm pretty sure the code here handles these but it might be nice to add them as tests. I was especially concerned about the first two but checking for the outer quote_str should capture the right behavior.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe take a look at https://github.com/astral-sh/ruff/blob/1ecb7ce6455a4a9a134fe8536625e89f74e3ec5b/crates/ruff_python_formatter/resources/test/fixtures/ruff/expression/fstring.py

and

https://github.com/astral-sh/ruff/blob/82faa9bb62e66a562f8a7ad81a645162ca558a08/crates/ruff_python_formatter/resources/test/fixtures/ruff/expression/fstring_preview.py

they contain a good enumeration of the tricky cases.

It does make me slightly nervous that the current approach does a lot oparations on the source text directly instead of analyzing the tokens but accessing the tokens might require making this an analyzer (linter) check

@dhruvmanila
Copy link
Member

It does make me slightly nervous that the current approach does a lot oparations on the source text directly instead of analyzing the tokens but accessing the tokens might require making this an analyzer (linter) check

Yeah, and f-strings are tricky because there's a lot more involved in here.

Another approach here would be to use the tokens either in the parser or in the analyzer (which you've mentioned), more preference towards in the parser mainly because it already has the surrounding context i.e., are we in a nested f-string or are we in f-string expression?

Maybe we could do this in the lexer itself and utilize FStringErrorType to emit the errors which then the parser would convert into UnsupportedSyntaxError but I haven't explored this option. In the lexer, it would be easier to just check for Comment, Newline tokens when in a f-string expression mode and emit the errors. My main worry regarding the lexer would be any performance implications.

@ntBre
Copy link
Contributor Author

ntBre commented Mar 7, 2025

Oof, thanks for the reviews. I had a feeling I over-simplified things, but these false positives look quite obvious in hindsight. I'll mark this as a draft for now and take a deeper look at this today.

@ntBre ntBre marked this pull request as draft March 7, 2025 15:01
@ntBre
Copy link
Contributor Author

ntBre commented Mar 7, 2025

I still need to look for more tricky cases in the formatter fixtures, but I checked on the suggested escape and quote test cases, and I believe those are true positives (I also added them as tests). So the main issues here are around comments, which might be quite tricky (maybe this is why pyright doesn't flag them?) and around inspecting the source text directly.

@MichaReiser
Copy link
Member

I think it would be helpful to do summarize the invalid patterns that we need to detect. It will help us decide:

  • How to best detect those (tokens, AST pass, parser, lexer, all of it?)
  • which patterns are easy/hard to detect

Based on this we can decide on the approach but also the prioritisation of what the check should detect and we can even split it up into multiple PRs.

@ntBre
Copy link
Contributor Author

ntBre commented Mar 10, 2025

I think it would be helpful to do summarize the invalid patterns that we need to detect. It will help us decide:

  • How to best detect those (tokens, AST pass, parser, lexer, all of it?)
  • which patterns are easy/hard to detect

Based on this we can decide on the approach but also the prioritisation of what the check should detect and we can even split it up into multiple PRs.

That's a good idea, thanks. The three main cases I took away from the PEP were:

  1. Nested quotes
  2. Escape sequences
  3. Comments

Escape sequences seem to be the easiest because as far as I can tell, CPython throws an error for any \ in an f-string expression part, whether it's part of an escape character (\n) or looks like a line-continuation character.

I think quotes are also easy because any nested quote_str (in our parlance) ends the string. That still feels oversimplified but I haven't seen any cases to the contrary. The PEP also includes this example:

In fact, this is the most nested-fstring that can be written:

>>> f"""{f'''{f'{f"{1+1}"}'}'''}"""
'2'

Comments are the hardest because you can't just check for # as Dhruv pointed out because that's a valid character inside of strings within the f-string.

Those are the three cases I attempted to fix here.

I see now in PEP 498 that "Expressions cannot contain ':' or '!' outside of strings or parentheses, brackets, or braces. The exception is that the '!=' operator is allowed as a special case." So that might be a fourth case we'd want to consider. At least initially it sounds roughly as complex as detecting comments.

@MichaReiser
Copy link
Member

We discussed a possible approach in our 1:1. @ntBre let me know if that doesn't work and i can take another look

@ntBre
Copy link
Contributor Author

ntBre commented Mar 14, 2025

Thanks for the in_range suggestion! I factored out part of Tokens::in_range to reuse in the new TokenSource::in_range method, which made things much simpler.

I tried applying a similar strategy to quotes, but FStringStart, FStringEnd, and FStringMiddle all carry their own string flags, so it's not easy to differentiate between the inner and outer f-strings. Maybe I could bring back the stack from the previous implementation to track that, though.

I still think comparing the quote_str gets the correct answer because it includes triple quotes, but I'm still open to reworking that if you prefer. I could at least use memmem and memchr for the searches.

Similarly, I don't think \ is a token, so we pretty much have to do a text search for that, as far as I can tell.

I also looked into the : and ! mention from PEP 498 again, but I can't come up with anything that is valid syntax after 3.12 either. So I think it's okay not to check for those specially.

@ntBre ntBre marked this pull request as ready for review March 14, 2025 22:37
@ntBre ntBre marked this pull request as draft March 15, 2025 04:54
@MichaReiser
Copy link
Member

I still think comparing the quote_str gets the correct answer because it includes triple quotes, but I'm still open to reworking that if you prefer. I could at least use memmem and memchr for the searches.

Yeah, that could work. An alternative is to inspect the parsed AST. What's important is that we only run the search over expression parts (e.g. f"test\"abcd" is valid)

I also looked into the : and ! mention from PEP 498 again, but I can't come up with anything that is valid syntax after 3.12 either. So I think it's okay not to check for those specially.

Do you have a reference from PEP701 that anything changed related to : and ! handling?

@ntBre
Copy link
Contributor Author

ntBre commented Mar 17, 2025

Do you have a reference from PEP701 that anything changed related to : and ! handling?

No, it sounds like the same restrictions are in place in PEP 701:

  • We have decided not to lift the restriction that some expression portions need to wrap : and ! in parentheses at the top level

They were just mentioned along with comments and backslashes as receiving special treatment in PEP 498, so I was worried that they could have changed, but this sounds pretty conclusive after looking again, thanks!

I left this in draft because I wanted to run the new code on the formatter tests you linked above. I'll do that now and then open it for review again.

I also just added your f"test\"abcd" case as a test and will try out memchr.

@ntBre
Copy link
Contributor Author

ntBre commented Mar 17, 2025

I manually tested these out on the formatter test fixtures and all of the errors looked like true positives. Would it be worth adding those as permanent parser tests? It seemed weird to refer to test files in a different crate, but I could duplicate them. Hopefully I've already captured the key subset in the inline tests, though.

@ntBre ntBre marked this pull request as ready for review March 17, 2025 15:03
Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I just have a couple of minor comments, feel free to make any relevant changes if required otherwise merge it as is.

Comment on lines +169 to +182
/// Returns a slice of [`Token`] that are within the given `range`.
pub(crate) fn in_range(&self, range: TextRange) -> &[Token] {
let start = self
.tokens
.iter()
.rposition(|tok| tok.start() == range.start());
let end = self.tokens.iter().rposition(|tok| tok.end() == range.end());

let (Some(start), Some(end)) = (start, end) else {
return &self.tokens;
};

&self.tokens[start..=end]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on using this method elsewhere?

If not, we could inline the logic in check_fstring_comments and simplify it to avoid the iteration for the end variable as, I think, the parser is already at that position? So, something like what Micha suggested in #16543 (comment) i.e., just iterate over the tokens in reverse order until we reach the f-string start and report an error for all the Comment tokens found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need/want a method of some kind because TokenSource::tokens is a private field. I could just add a tokens getter though, of course.

I also tried this without end, but cases like

f'Magic wand: { bag['wand'] }'     # nested quotes

caught new errors on the trailing comment. At the point we do this processing, we've bumped past the FStringEnd and any trivia tokens after it, so I think we do need to find the end point as well.

Hmm, maybe a tokens getter would be nicest. Then I could do all of the processing on a single iterator in check_fstring_comments at least.

Copy link
Member

@dhruvmanila dhruvmanila Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the f-string range directly? Or, is there something else I'm missing? I don't think the comment is part of the f-string range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the node_range calculation avoids any trailing trivia tokens like the one that you've mentioned in the example above. This is done by keeping track of the end of the previous token which excludes some tokens like comment. Here, when you call node_range, then it will give you the range which doesn't include the trailing comment. If it wouldn't then the f-string range would be incorrect here.

Copy link
Member

@dhruvmanila dhruvmanila Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, shoot, I think the tokens field should still include the trailing comment. Happy to go with what you think is best here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's a good summary. We have the exact f-string range but need to match that up with the actual Tokens in the tokens field, which includes trailing comments.

I tried the tokens getter and moving the logic into check_fstring_comments, but I do aesthetically prefer how it looked with self.tokens.in_range... even if the in_range method itself looks a little weird. So I might just leave it alone for now. Thanks for double checking!

@ntBre ntBre merged commit dcf31c9 into main Mar 18, 2025
22 checks passed
@ntBre ntBre deleted the brent/syn-f-strings branch March 18, 2025 15:12
dcreager added a commit that referenced this pull request Mar 18, 2025
* main:
  [playground] Avoid concurrent deployments (#16834)
  [red-knot] Infer `lambda` return type as `Unknown` (#16695)
  [red-knot] Move `name` field on parameter kind (#16830)
  [red-knot] Emit errors for more AST nodes that are invalid (or only valid in specific contexts) in type expressions (#16822)
  [playground] Use cursor for clickable elements (#16833)
  [red-knot] Deploy playground on main (#16832)
  Red Knot Playground (#12681)
  [syntax-errors] PEP 701 f-strings before Python 3.12 (#16543)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser preview Related to preview mode features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants