Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing f-string as per PEP 701 #7041

Merged
merged 17 commits into from
Sep 14, 2023

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Sep 1, 2023

Summary

This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node.

Grammar

Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings.

string.rs

This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:

  • parse_string: Used to parse a single string literal
  • parse_strings: Used to parse strings which were implicitly concatenated

Now, there are 3 entry points:

  • parse_string_literal: Renamed from parse_string
  • parse_fstring_middle: Used to parse a FStringMiddle token which is basically a string literal without the quotes
  • concatenate_strings: Renamed from parse_strings but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node.

A short primer on FStringMiddle token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in f"foo {bar:.3f{x}} bar", the foo , .3f and bar are FStringMiddle token content.

Constant::kind changed in the AST

Discussion in the official implementation: python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with u) and f-strings are used in an implicitly concatenated string value. For example,

u"foo" f"{bar}" "baz" " some"

Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both "foo" and "baz some" (implicit concatenation) would be given the u kind:

Pre 3.12 AST:

Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')

But, post Python 3.12, only the string with the u prefix will be assigned the value:

Pre 3.12 AST:

Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')

Here are some more iterations around the change:

  1. "foo" f"{bar}" u"baz" "no"
Pre 3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')

3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')

  1. "foo" f"{bar}" "baz" u"no"
Pre 3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')

3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')

  1. u"foo" f"bar {baz} realy" u"bar" "no"
Pre 3.12

Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')

3.12

Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')

Errors

With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop:

  • A closing delimiter was not opened properly
  • An opening delimiter was not closed properly
  • Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc.

Test Plan

  1. Refactor existing test cases to use parse_suite instead of parse_fstrings (doesn't exists anymore)
  2. Additional test cases are added as required

Updated the snapshots. The change from parse_fstrings to parse_suite means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges.

Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 1, 2023

CodSpeed Performance Report

Merging #7041 will degrade performances by 9.7%

⚠️ No base runs were found

Falling back to comparing dhruv/fstring-parser (cd2e18b) with main (04183b0)

Summary

❌ 8 regressions
✅ 17 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main dhruv/fstring-parser Change
lexer[numpy/globals.py] 233.2 µs 252.9 µs -7.8%
lexer[unicode/pypinyin.py] 620.2 µs 673.4 µs -7.89%
lexer[large/dataset.py] 9.8 ms 10.7 ms -8.42%
parser[numpy/ctypeslib.py] 12.4 ms 12.7 ms -2.35%
lexer[numpy/ctypeslib.py] 2 ms 2.1 ms -7.65%
parser[large/dataset.py] 68.4 ms 70.1 ms -2.34%
lexer[pydantic/types.py] 4.1 ms 4.6 ms -9.7%
parser[unicode/pypinyin.py] 4.3 ms 4.4 ms -2.6%

@dhruvmanila dhruvmanila linked an issue Sep 1, 2023 that may be closed by this pull request
@dhruvmanila dhruvmanila linked an issue Sep 2, 2023 that may be closed by this pull request
@dhruvmanila dhruvmanila force-pushed the dhruv/fstring-parser branch 3 times, most recently from ff7024a to d34b4d3 Compare September 5, 2023 02:49
@dhruvmanila dhruvmanila marked this pull request as ready for review September 5, 2023 04:19
@MichaReiser MichaReiser added accepted Ready for implementation parser Related to the parser and removed accepted Ready for implementation labels Sep 5, 2023
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Overall this is looking good to me. It will be interesting to get some ecosystem checks once ruff compiles.

Most of my comments are nits but I think there's potential to clean up string.rs further (and hopefully improving performance at the same time). I leave it up to you if you want to do this as part of this or a follow up PR.

crates/ruff/src/linter.rs Show resolved Hide resolved
/// assert!(expr.is_ok());
/// ```
pub fn parse_tokens(
lxr: impl IntoIterator<Item = LexResult>,
source: &str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to start grouping source, mode and source_path by an abstraction but we don't have to do this as part of this PR. But we should probably do it before introducing more arguments.

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/python.lalrpop Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/python.lalrpop Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved
Comment on lines 257 to 280
// This is to account for the empty `FStringMiddle` token that is created
// to check for non-parenthesized lambda expressions.
if source.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this comment with an example and why it is necessary? I know we talked about it on Discord but new and external maintainers won't know about the conversation or don't have access to it.

Overall: this is unfortunate because it requires moving all parts into a new Vec (flatten, then collect). I wonder if a dedicated token would be a better solution

Comment on lines +312 to 338
for parenthesized_expr in strings {
match parenthesized_expr.expr {
Expr::Constant(ast::ExprConstant {
value: Constant::Bytes(BytesConstant { value, .. }),
..
}) => content.extend(value),
_ => unreachable!("Unexpected non-bytes expression."),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add a fast path (for the whole method) that handles the most common case of a single string:

  • No need to test if it is a mixed byte / non-byte concatenation
  • No need to copy the value (value.into_bytes() or use the original string)

crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved
Copy link
Member

@konstin konstin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infinite loop for something that succesfully fails in the unit test is strange

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/parser.rs Show resolved Hide resolved
crates/ruff_python_parser/src/python.lalrpop Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string.rs Outdated Show resolved Hide resolved
Base automatically changed from dhruv/fstring-tokens to dhruv/pep-701 September 14, 2023 01:46
@dhruvmanila
Copy link
Member Author

Thanks for this PR!

Hey, thanks for the comment. We're in the final phase of making the necessary changes across other parts of the codebase (linter and formatter) and then we'll merge this.

@dhruvmanila dhruvmanila merged commit 0dada9f into dhruv/pep-701 Sep 14, 2023
10 of 14 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/fstring-parser branch September 14, 2023 02:00
dhruvmanila added a commit that referenced this pull request Sep 18, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 19, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 20, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 26, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 27, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 28, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser python312 Related to Python 3.12
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for PEP 701 in the parser [Linter panic] failing to parse odd f-string f"{x, y=}"
4 participants