Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PEP 701 in the parser #7043

Closed
Tracked by #6502
dhruvmanila opened this issue Sep 1, 2023 · 0 comments · Fixed by #7041, #7211, #7263 or #7331
Closed
Tracked by #6502

Add support for PEP 701 in the parser #7043

dhruvmanila opened this issue Sep 1, 2023 · 0 comments · Fixed by #7041, #7211, #7263 or #7331
Assignees
Labels
parser Related to the parser python312 Related to Python 3.12

Comments

@dhruvmanila
Copy link
Member

dhruvmanila commented Sep 1, 2023

The task is to use the new tokens emitted by the lexer (#7042) and create the ExprFString node using it. This will require major refactor in string.rs to support this.

Relevant PRs:

@dhruvmanila dhruvmanila added parser Related to the parser python312 Related to Python 3.12 labels Sep 1, 2023
@dhruvmanila dhruvmanila self-assigned this Sep 1, 2023
@dhruvmanila dhruvmanila linked a pull request Sep 1, 2023 that will close this issue
dhruvmanila added a commit that referenced this issue Sep 14, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
@dhruvmanila dhruvmanila linked a pull request Sep 14, 2023 that will close this issue
@dhruvmanila dhruvmanila removed a link to a pull request Sep 15, 2023
dhruvmanila added a commit that referenced this issue Sep 18, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 19, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 20, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 26, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 27, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 28, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 29, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this issue Sep 29, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment