Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PEP 701 in the lexer #7042

Closed
Tracked by #6502
dhruvmanila opened this issue Sep 1, 2023 · 0 comments · Fixed by #6659, #7378 or #7331
Closed
Tracked by #6502

Add support for PEP 701 in the lexer #7042

dhruvmanila opened this issue Sep 1, 2023 · 0 comments · Fixed by #6659, #7378 or #7331
Assignees
Labels
parser Related to the parser python312 Related to Python 3.12

Comments

@dhruvmanila
Copy link
Member

dhruvmanila commented Sep 1, 2023

The task is to update the lexer to emit the new tokens for PEP 701: FStringStart, FStringMiddle and FStringEnd. Along with these, a new token Exclamation needs to be added for conversion flag (f"{foo!s}") as it's now part of the expression.

Some of the error handling which was previously done in the parser will need to be moved into the lexer.

@dhruvmanila dhruvmanila changed the title Update the lexer to emit new tokens (, , ) Add support for PEP 701 in the lexer Sep 1, 2023
@dhruvmanila dhruvmanila self-assigned this Sep 1, 2023
@dhruvmanila dhruvmanila linked a pull request Sep 1, 2023 that will close this issue
@dhruvmanila dhruvmanila added parser Related to the parser python312 Related to Python 3.12 labels Sep 1, 2023
dhruvmanila added a commit that referenced this issue Sep 14, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
@dhruvmanila dhruvmanila linked a pull request Sep 14, 2023 that will close this issue
@dhruvmanila dhruvmanila removed a link to a pull request Sep 15, 2023
dhruvmanila added a commit that referenced this issue Sep 18, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 19, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 20, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 22, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 22, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 26, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 27, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 28, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 29, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this issue Sep 29, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser python312 Related to Python 3.12
Projects
None yet
1 participant