Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the new f-string tokens per PEP 701 #6659

Merged
merged 9 commits into from
Sep 14, 2023

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Aug 17, 2023

Summary

This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added:

  • FStringStart: Token value for the start of an f-string. This includes the f/F/fr prefix and the opening quote(s).
  • FStringMiddle: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace.
  • FStringEnd: Token value for the end of an f-string. This includes the closing quote.

Additionally, a new Exclamation token is added for conversion (f"{foo!s}") as that's part of an expression.

Test Plan

New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e.

Benchmarks

I've put the number of f-strings for each of the following files after the file name

lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec

It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile:

  • As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code.
  • The lex_fstring_middle_or_end takes the most amount of time followed by the current_mut line when lexing the : token. The latter is to check if we're at the start of a format spec or not.
  • In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py 1 (293), most of the time in lex_fstring_middle_or_end is accounted by string allocation for the string literal part of FStringMiddle token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for pydantic/types profile (https://share.firefox.dev/45XcLRq)

fixes: #7042

Footnotes

  1. We could add this in lexer and parser benchmark

@MichaReiser MichaReiser added the parser Related to the parser label Aug 18, 2023
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
@dhruvmanila
Copy link
Member Author

@MichaReiser Thanks for your initial review even though the PR is still in draft :)

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
Base automatically changed from dhruv/lexer-tests to main August 22, 2023 18:23
Comment on lines 661 to 714
if kind.is_any_fstring() {
return Ok(self.lex_fstring_start(quote, triple_quoted, kind.is_raw()));
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of removing the StringKind::FString and StringKind::RawFString as that is an invalid representation now that we'll be emitting different tokens for f-strings. Instead, I'm thinking of updating the lex_identifier to directly check for f (and related F, r, R) and call lex_fstring_start directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be done after the linter changes are complete.

@dhruvmanila dhruvmanila marked this pull request as ready for review August 23, 2023 05:39
@dhruvmanila
Copy link
Member Author

(This is ready for review, but not to be merged)

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive work. I left a few follow-up questions and I want to wait to approve this PR until we have the first benchmarks in.

@@ -44,6 +44,14 @@ pub enum Tok {
/// Whether the string is triple quoted.
triple_quoted: bool,
},
/// Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix
/// and the opening quote(s).
FStringStart,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to go through all logical line rules to make sure they handle f-strings correctly

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
Comment on lines 1333 to 1433
parentheses_count: u32,
format_spec_count: u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document these fields and why tracking them in the state is necessary? E.g. why do we need to track both parenthesis_count and format_spec_count (Should it be format_spec_depth?). What are the different possible states?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 reasons to track these fields:

  1. To check if we're in a f-string expression f"{<here>}"
  2. To check if we're in a format spec within a f-string expression f"{foo:<here>}"
  3. To check if the : is to start a format spec or is it part of, for example, named expression :=:
    • For f"{x:=1}", the colon is to indicate the format spec start
    • For f"{(x:=1)}", the colon is part of := named expression because the colon is not at the same level of parentheses. Another example is f"{x,{y:=1}}" where the colon is part of := named expression

From PEP 701: How to produce these new tokens:

  1. [..] This mode tokenizes as the “Regular Python tokenization” until a : or a } character is encountered with the same level of nesting as the opening bracket token that was pushed when we enter the f-string part. [..]

With this explanation, do you think _depth is a better suffix?

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
@@ -579,6 +718,19 @@ impl<'source> Lexer<'source> {
// This is the main entry point. Call this function to retrieve the next token.
// This function is used by the iterator implementation.
pub fn next_token(&mut self) -> LexResult {
if let Some(fstring_context) = self.fstring_stack.last() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike adding a stack lookup into the hot path but I don't see a way of how we could avoid it :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could utilize State? The problem here is that it'll get updated when inside a f-string expression as it's not persistent.

@zanieb zanieb added the python312 Related to Python 3.12 label Aug 24, 2023
Comment on lines 732 to 743
if !fstring_context.is_in_expression()
// Avoid lexing f-string middle/end if we're sure that this is
// the start of a f-string expression i.e., `f"{foo}"` and not
// `f"{{foo}}"`.
&& (self.cursor.first() != '{' || self.cursor.second() == '{')
// Avoid lexing f-string middle/end if we're sure that this is
// the end of a f-string expression. This is only for when
// the `}` is after the format specifier i.e., `f"{foo:.3f}"`
// because the `.3f` is lexed as `FStringMiddle` and thus not
// in a f-string expression.
&& (!fstring_context.is_in_format_spec() || self.cursor.first() != '}')
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this that much. I think we're sacrificing readability a bit. Without this, the lex_fstring_middle_or_end function would return Result<Option<Tok>, ...> and this condition will be identified initially in the function loop itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this change although I'm open to suggestions.

@dhruvmanila dhruvmanila force-pushed the dhruv/fstring-tokens branch 5 times, most recently from 2a75ece to 62e052d Compare August 29, 2023 02:46
Comment on lines +178 to +183
('f' | 'F', quote @ ('\'' | '"')) => {
self.cursor.bump();
return Ok(self.lex_fstring_start(quote, false));
}
('r' | 'R', 'f' | 'F') | ('f' | 'F', 'r' | 'R') if is_quote(self.cursor.second()) => {
self.cursor.bump();
let quote = self.cursor.bump().unwrap();
return Ok(self.lex_fstring_start(quote, true));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This separation is because the StringKind::FString and StringKind::RawFString variants will be removed after all the linter changes.

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
};
Ok(Some(Tok::FStringMiddle {
value,
is_raw: context.is_raw_string,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for the parser changes.

Comment on lines 716 to 704
// When we are in an f-string, check whether does the initial quote
// matches with f-strings quotes and if it is, then this must be a
// missing '}' token so raise the proper error.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crates/ruff_python_parser/src/parser.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/parser.rs Outdated Show resolved Hide resolved
@dhruvmanila dhruvmanila changed the base branch from main to dhruv/eat-char2 August 29, 2023 04:53
Base automatically changed from dhruv/eat-char2 to main August 29, 2023 11:48
@dhruvmanila dhruvmanila changed the base branch from main to dhruv/pep-701 September 14, 2023 01:37
@dhruvmanila dhruvmanila merged commit 9820c04 into dhruv/pep-701 Sep 14, 2023
10 of 15 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/fstring-tokens branch September 14, 2023 01:46
dhruvmanila added a commit that referenced this pull request Sep 18, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 19, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 20, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 26, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 27, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 28, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser python312 Related to Python 3.12
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for PEP 701 in the lexer
4 participants