Add support for PEP 701 #7376

This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added: * `FStringStart`: Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix and the opening quote(s). * `FStringMiddle`: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace. * `FStringEnd`: Token value for the end of an f-string. This includes the closing quote. Additionally, a new `Exclamation` token is added for conversion (`f"{foo!s}"`) as that's part of an expression. New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e. _I've put the number of f-strings for each of the following files after the file name_ ``` lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec ``` It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile: * As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code. * The `lex_fstring_middle_or_end` takes the most amount of time followed by the `current_mut` line when lexing the `:` token. The latter is to check if we're at the start of a format spec or not. * In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py [^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted by string allocation for the string literal part of `FStringMiddle` token (https://share.firefox.dev/3ErEa1W) I don't see anything out of ordinary for `pydantic/types` profile (https://share.firefox.dev/45XcLRq) fixes: #7042 [^1]: We could add this in lexer and parser benchmark

This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </details> <details><summary>3.12</summary> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </details> <details><summary>3.12</summary> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </details> <details><summary>3.12</summary> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835

This PR adds a new enum type `StringType` which is either a string literal, byte literal or f-string. The motivation behind this is to have a narrow type which is accepted in `concatenate_strings` as that function is only applicable for the mentioned 3 types. This makes the code more readable and easy to reason about. A future improvement (which was prototyped here and removed) is to split the current string literal pattern in LALRPOP definition into two parts: 1. A single string literal or a f-string: This means no checking for bytes / non-bytes and other unnecessary compution 2. Two or more of string/byte/f-string: This will call the `concatenate_strings` function. The reason for removing the second change is because of how ranges work. The range for an individual string/byte is the entire range which includes the quotes as well but if the same string/byte is part of a f-string, then it only includes the range for the content (without the quotes / inner range). The current string parser returns with the former range. To give an example, for `"foo"`, the range of the string would be `0..5`, but for `f"foo"` the range of the string would be `2..5` while the range for the f-string expression would be `0..6`. The ranges are correct but they differ in the context the string constant itself is being used for. Is it part of a f-string or is it a standalone string? `cargo test`

This PR updates the handling of disallowing non-parenthesized lambda expr in f-strings. Previously, the lexer was used to emit an empty `FStringMiddle` token in certain cases for which there's no pattern in the parser to match. That would then raise an unexpected token error while parsing. This PR adds a new f-string error type `LambdaWithoutParentheses`. In cases where the parser still can't detect the error, it's guaranteed to be caught by the fact that there's no `FStringMiddle` token in the pattern. Add test cases wherever we throw the `LambdaWithoutParentheses` error. As this is the final PR for the parser, I'm putting the parser benchmarks here: ``` group fstring-parser main ----- -------------- ---- parser/large/dataset.py 1.00 4.7±0.24ms 8.7 MB/sec 1.03 4.8±0.25ms 8.4 MB/sec parser/numpy/ctypeslib.py 1.03 921.8±39.00µs 18.1 MB/sec 1.00 897.6±39.03µs 18.6 MB/sec parser/numpy/globals.py 1.01 90.4±5.23µs 32.6 MB/sec 1.00 89.6±6.24µs 32.9 MB/sec parser/pydantic/types.py 1.00 1899.5±94.78µs 13.4 MB/sec 1.03 1954.4±105.88µs 13.0 MB/sec parser/unicode/pypinyin.py 1.03 292.3±21.14µs 14.4 MB/sec 1.00 283.2±13.16µs 14.8 MB/sec ```

## Summary This PR fixes the escape handling of curly braces inside a f-string. There are 2 main changes: ### Lexer The lexer change was actually a bug. Instead of breaking as soon as we find a curly brace after the `\` character, we'll continue and let the next iteration handle it in the curly brace branch. This fixes the following case: ```python f"\{{foo}}" # ^ use the curly brace branch to handle this character instead of breaking ``` ### Parser We can encounter a `\` as the last character in a `FStringMiddle` token which is valid in this context[^1]. For example, ```python f"\{foo} \{bar:\}" # ^ ^^ ^ # The marked characters are part of 3 different `FStringMiddle` token ``` Here, the `FStringMiddle` token content will be `"\"` and `" \"` which is invalid in a regular string literal. However, it's valid here because it's a substring of a f-string. Even though curly braces cannot be escaped, it's a valid syntax. [^1]: Refer to point 3 in https://peps.python.org/pep-0701/#rejected-ideas ## Test Plan Verified that existing test cases are passing and add new test cases for the lexer and parser.

## Summary This PR fixes the bug to allow `NUL` (`\0`) character inside f-strings. ## Test Plan Add test case with `NUL` character inside f-string.

## Summary This PR updates `Stylist` quote detection to include the f-string tokens. As f-strings cannot be used as docstrings, we'll skip the check for triple-quoted f-strings. ## Test Plan Add new test cases with f-strings. fixes: #7293

This PR updates `PLE2510`, `PLE2512-2515` to check in f-strings. > ### Reference: > * `PLE2510`: Invalid unescaped character backspace, use "\b" instead > * `PLE2512`: Invalid unescaped character SUB, use "\x1A" instead > * `PLE2513`: Invalid unescaped character ESC, use "\x1B" instead > * `PLE2514`: Invalid unescaped character NUL, use "\0" instead > * `PLE2515`: Invalid unescaped character zero-width-space, use "\u200B" instead Add test cases for f-strings.

## Summary This PR updates the `F541` rule to use the new f-string tokens. ## Test Plan Add new test case and uncomment a broken test case. fixes: #7292

## Summary This PR updates the `Indexer` to use the new f-string tokens to compute the `f_string_ranges` for f-strings. It adds a new abstraction which exposes two methods to support extracting the range for the surrounding innermost and outermost f-string. It uses the builder pattern to build the f-string ranges which is similar to how the comment ranges are built. ## Test Plan Add new test cases for f-strings for: * Tab indentation rule * Line continuation detection in the indexer * To get the innermost / outermost f-string range * All detected f-string ranges fixes: #7290

## Summary This PR updates the rule `RUF001` and `RUF003` to check in f-strings using the `FStringMiddle` token which contains the non-expression part of a f-string. For reference, | Code | Name | Message| | --- | --- | --- | | RUF001 | ambiguous-unicode-character-string | String contains ambiguous {}. Did you mean {}? | | RUF003 | ambiguous-unicode-character-comment | Comment contains ambiguous {}. Did you mean {}? | ## Test Plan `cargo test`

This PR updates the `W605` (invalid-escape-sequence) to check inside f-strings. It also adds support to report violation on invalid escape sequence within f-strings w.r.t. the curly braces. So, the following cases will be identified: ```python f"\{1}" f"\{{1}}" f"{1:\}" ``` The new CPython parser also gives out a syntax warning for such cases: ``` fstring.py:1: SyntaxWarning: invalid escape sequence '\{' f"\{1}" fstring.py:2: SyntaxWarning: invalid escape sequence '\{' f"\{{1}}" fstring.py:3: SyntaxWarning: invalid escape sequence '\}' f"{1:\}" ``` Nested f-strings are supported here, so the generated fix is aware of that and will create an edit for the proper f-string. Add new test cases for f-strings. fixes: #7295

## Summary This PR updates the implicit string concatenation rules, specifically `ISC001` and `ISC002` to account for the new f-string tokens. `ISC003` checks for explicit string concatenation and is not affected by PEP 701 because it is based on AST. ### Implementation The implementation is based on the boundary tokens of the f-string which are `FStringStart` and `FStringEnd`. There are 4 cases to look for: 1. `String` followed by `FStringStart` 2. `FStringEnd` followed by `String` 3. `FStringEnd` followed by `FStringStart` 4. `String` followed by `String` For f-string tokens, we use the `Indexer` to get the entire range of the f-string. This is the range of the innermost f-string. ## Test Plan Add new test cases for nested f-strings.

## Summary This PR updates the NoQA directive detection to consider the new f-string tokens. The reason being that now there can be multi-line f-strings without triple-quotes: ```python f"{ x * y }" ``` Here, the `noqa` directive should go at the end of the last line. ## Test Plan * Add new test cases for f-strings * Tested with `--add-noqa` using the following command with the above code snippet: ```console $ cargo run --bin ruff -- check --select=F821 --no-cache --isolated ~/playground/ruff/fstring.py --add-noqa Added 1 noqa directive. ``` Output: ```python f"{ x * y }" # noqa: F821 ``` Running the same command again doesn't add `noqa` directive and without the `--add-noqa` flag, the violation isn't reported. fixes: #7291

## Summary This PR updates the string formatter to account for the new f-string tokens. The formatter uses the full lexer to handle comments around implicitly concatenated strings. The reason it uses the lexer is because the AST merges them into a single node so the boundaries aren't preserved. For f-strings, it creates some complexity now that it isn't represented as a `String` token. A single f-string will atleast emit 3 tokens (`FStringStart`, `FStringMiddle`, `FStringEnd`) and if it contains expressions, then it'll emit the respective tokens for them. In our case, we're currently only interested in the outermost f-string range for which I've introduced a new `FStringRangeBuilder` which keeps builds the outermost f-string range by considering the start and end tokens and the nesting level. Note that this doesn't support in any way nested f-strings which is out of scope for this PR. This means that if there are nested f-strings, especially the ones using the same quote, the formatter will escape the inner quotes: ```python f"hello world { x + f\"nested {y}\" }" ``` ## Test plan ``` cargo test --package ruff_python_formatter ```

This PR fixes the following issues w.r.t. the PEP 701 changes: 1. Mark all unformatted comments inside f-strings as formatted only _after_ the f-string has been formatted. 2. Do not escape or remove the quote escape when normalizing the expression part of a f-string. This PR also updates the `--files-with-errors` number to be 1 less. This is because we can now parse the [`test_fstring.py`](https://discord.com/channels/1039017663004942429/1082324263199064206/1154633274887516254) file in the CPython repository which contains the new f-string syntax. This is also the file which updates the similarity index for CPython compared to main. `cargo test -p ruff_python_formatter` | project | similarity index | total files | changed files | |--------------|------------------:|------------------:|------------------:| | cpython | 0.76051 | 1789 | 1632 | | django | 0.99983 | 2760 | 36 | | transformers | 0.99963 | 2587 | 323 | | twine | 1.00000 | 33 | 0 | | typeshed | 0.99979 | 3496 | 22 | | warehouse | 0.99967 | 648 | 15 | | zulip | 0.99972 | 1437 | 21 | | project | similarity index | total files | changed files | |--------------|------------------:|------------------:|------------------:| | cpython | 0.76083 | 1789 | 1631 | | django | 0.99983 | 2760 | 36 | | transformers | 0.99963 | 2587 | 323 | | twine | 1.00000 | 33 | 0 | | typeshed | 0.99979 | 3496 | 22 | | warehouse | 0.99967 | 648 | 15 | | zulip | 0.99972 | 1437 | 21 |

This PR updates the `Q003` rule to accommodate the new f-string context. The logic here takes into consideration the nested f-strings and the configured target version. The rule checks for escaped quotes within a string and determines if they are avoidable or not. It is avoidable if: 1. Outer quote matches the user preferred quote 2. Not a raw string 3. Not a triple-quoted string 4. String content contains the same quote as the outer one 5. String content _doesn't_ contain the opposite quote For f-string, the way it works is by using a context stack to keep track of certain things but mainly the text range (`FStringMiddle`) where the escapes exists. It contains the following: 1. Do we want to check for escaped quotes in the current f-string? This is required to: * Preserve the context for `FStringMiddle` tokens where we need to check for escaped quotes. But, the answer to whether we need to check or not lies with the `FStringStart` token which contains the quotes. So, when the context starts, we'll store this information. * Disallow nesting for pre 3.12 target versions 2. Store the `FStringStart` token range. This is required to create the edit to replace the quote if this f-string contains escaped quote(s). 3. All the `FStringMiddle` ranges where there are escaped quote(s). * Add new test cases for nested f-strings. * Write new tests for old Python versions as existing ones test it on the latest version by default which is 3.12 as of this writing. * Verify the snapshots

## Summary This PR updates the `Q000`, and `Q001` rules to consider the new f-string tokens. The docstring rule (`Q002`) doesn't need to be updated because f-strings cannot be used as docstrings. I tried implementing the nested f-string support but there are still some edge cases in my current implementation so I've decided to pause it for now and I'll pick it up sometime soon. So, for now this doesn't support nested f-strings. ### Implementation The implementation uses the same `FStringRangeBuilder` introduced in #7586 to build up the outermost f-string range. The reason to use the same implementation is because this is a temporary solution until we add support for nested f-strings. ## Test Plan `cargo test`