Skip to content

Commit

Permalink
Add support for the new f-string tokens per PEP 701 (#6659)
Browse files Browse the repository at this point in the history
## Summary

This PR adds support in the lexer for the newly added f-string tokens as
per PEP 701. The following new tokens are added:
* `FStringStart`: Token value for the start of an f-string. This
includes the `f`/`F`/`fr` prefix and the opening quote(s).
* `FStringMiddle`: Token value that includes the portion of text inside
the f-string that's not part of the expression part and isn't an opening
or closing brace.
* `FStringEnd`: Token value for the end of an f-string. This includes
the closing quote.

Additionally, a new `Exclamation` token is added for conversion
(`f"{foo!s}"`) as that's part of an expression.

## Test Plan

New test cases are added to for various possibilities using snapshot
testing. The output has been verified using python/cpython@f2cc00527e.

## Benchmarks

_I've put the number of f-strings for each of the following files after
the file name_

```
lexer/large/dataset.py (1)       1.05   612.6±91.60µs    66.4 MB/sec    1.00   584.7±33.72µs    69.6 MB/sec
lexer/numpy/ctypeslib.py (0)     1.01    131.8±3.31µs   126.3 MB/sec    1.00    130.9±5.37µs   127.2 MB/sec
lexer/numpy/globals.py (1)       1.02     13.2±0.43µs   222.7 MB/sec    1.00     13.0±0.41µs   226.8 MB/sec
lexer/pydantic/types.py (8)      1.13   285.0±11.72µs    89.5 MB/sec    1.00   252.9±10.13µs   100.8 MB/sec
lexer/unicode/pypinyin.py (0)    1.03     32.9±1.92µs   127.5 MB/sec    1.00     31.8±1.25µs   132.0 MB/sec
```

It seems that overall the lexer has regressed. I profiled every file
mentioned above and I saw one improvement which is done in
(098ee5d). But otherwise I don't see
anything else. A few notes by isolating the f-string part in the
profile:
* As we're adding new tokens and functionality to emit them, I expect
the lexer to take more time because of more code.
* The `lex_fstring_middle_or_end` takes the most amount of time followed
by the `current_mut` line when lexing the `:` token. The latter is to
check if we're at the start of a format spec or not.
* In a f-string heavy file such as
https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py
[^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted
by string allocation for the string literal part of `FStringMiddle`
token (https://share.firefox.dev/3ErEa1W)

I don't see anything out of ordinary for `pydantic/types` profile
(https://share.firefox.dev/45XcLRq)

fixes: #7042

[^1]: We could add this in lexer and parser benchmark
  • Loading branch information
dhruvmanila committed Sep 22, 2023
1 parent 82978ac commit a1a08e6
Show file tree
Hide file tree
Showing 24 changed files with 2,317 additions and 11 deletions.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions crates/ruff_python_parser/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ ruff_python_ast = { path = "../ruff_python_ast" }
ruff_text_size = { path = "../ruff_text_size" }

anyhow = { workspace = true }
bitflags = { workspace = true }
is-macro = { workspace = true }
itertools = { workspace = true }
lalrpop-util = { version = "0.20.0", default-features = false }
Expand Down
422 changes: 412 additions & 10 deletions crates/ruff_python_parser/src/lexer.rs

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions crates/ruff_python_parser/src/lexer/cursor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,18 @@ impl<'a> Cursor<'a> {
}
}

pub(super) fn eat_char3(&mut self, c1: char, c2: char, c3: char) -> bool {
let mut chars = self.chars.clone();
if chars.next() == Some(c1) && chars.next() == Some(c2) && chars.next() == Some(c3) {
self.bump();
self.bump();
self.bump();
true
} else {
false
}
}

pub(super) fn eat_if<F>(&mut self, mut predicate: F) -> Option<char>
where
F: FnMut(char) -> bool,
Expand Down
158 changes: 158 additions & 0 deletions crates/ruff_python_parser/src/lexer/fstring.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
use bitflags::bitflags;

use ruff_text_size::TextSize;

bitflags! {
#[derive(Debug)]
pub(crate) struct FStringContextFlags: u8 {
/// The current f-string is a triple-quoted f-string i.e., the number of
/// opening quotes is 3. If this flag is not set, the number of opening
/// quotes is 1.
const TRIPLE = 1 << 0;

/// The current f-string is a double-quoted f-string. If this flag is not
/// set, the current f-string is a single-quoted f-string.
const DOUBLE = 1 << 1;

/// The current f-string is a raw f-string i.e., prefixed with `r`/`R`.
/// If this flag is not set, the current f-string is a normal f-string.
const RAW = 1 << 2;
}
}

/// The context representing the current f-string that the lexer is in.
#[derive(Debug)]
pub(crate) struct FStringContext {
flags: FStringContextFlags,

/// The level of nesting for the lexer when it entered the current f-string.
/// The nesting level includes all kinds of parentheses i.e., round, square,
/// and curly.
nesting: u32,

/// The current depth of format spec for the current f-string. This is because
/// there can be multiple format specs nested for the same f-string.
/// For example, `{a:{b:{c}}}` has 3 format specs.
format_spec_depth: u32,
}

impl FStringContext {
pub(crate) const fn new(flags: FStringContextFlags, nesting: u32) -> Self {
Self {
flags,
nesting,
format_spec_depth: 0,
}
}

pub(crate) const fn nesting(&self) -> u32 {
self.nesting
}

/// Returns the quote character for the current f-string.
pub(crate) const fn quote_char(&self) -> char {
if self.flags.contains(FStringContextFlags::DOUBLE) {
'"'
} else {
'\''
}
}

/// Returns the number of quotes for the current f-string.
pub(crate) const fn quote_size(&self) -> TextSize {
if self.is_triple_quoted() {
TextSize::new(3)
} else {
TextSize::new(1)
}
}

/// Returns the triple quotes for the current f-string if it is a triple-quoted
/// f-string, `None` otherwise.
pub(crate) const fn triple_quotes(&self) -> Option<&'static str> {
if self.is_triple_quoted() {
if self.flags.contains(FStringContextFlags::DOUBLE) {
Some(r#"""""#)
} else {
Some("'''")
}
} else {
None
}
}

/// Returns `true` if the current f-string is a raw f-string.
pub(crate) const fn is_raw_string(&self) -> bool {
self.flags.contains(FStringContextFlags::RAW)
}

/// Returns `true` if the current f-string is a triple-quoted f-string.
pub(crate) const fn is_triple_quoted(&self) -> bool {
self.flags.contains(FStringContextFlags::TRIPLE)
}

/// Calculates the number of open parentheses for the current f-string
/// based on the current level of nesting for the lexer.
const fn open_parentheses_count(&self, current_nesting: u32) -> u32 {
current_nesting.saturating_sub(self.nesting)
}

/// Returns `true` if the lexer is in a f-string expression i.e., between
/// two curly braces.
pub(crate) const fn is_in_expression(&self, current_nesting: u32) -> bool {
self.open_parentheses_count(current_nesting) > self.format_spec_depth
}

/// Returns `true` if the lexer is in a f-string format spec i.e., after a colon.
pub(crate) const fn is_in_format_spec(&self, current_nesting: u32) -> bool {
self.format_spec_depth > 0 && !self.is_in_expression(current_nesting)
}

/// Returns `true` if the context is in a valid position to start format spec
/// i.e., at the same level of nesting as the opening parentheses token.
/// Increments the format spec depth if it is.
///
/// This assumes that the current character for the lexer is a colon (`:`).
pub(crate) fn try_start_format_spec(&mut self, current_nesting: u32) -> bool {
if self
.open_parentheses_count(current_nesting)
.saturating_sub(self.format_spec_depth)
== 1
{
self.format_spec_depth += 1;
true
} else {
false
}
}

/// Decrements the format spec depth unconditionally.
pub(crate) fn end_format_spec(&mut self) {
self.format_spec_depth = self.format_spec_depth.saturating_sub(1);
}
}

/// The f-strings stack is used to keep track of all the f-strings that the
/// lexer encounters. This is necessary because f-strings can be nested.
#[derive(Debug, Default)]
pub(crate) struct FStrings {
stack: Vec<FStringContext>,
}

impl FStrings {
pub(crate) fn push(&mut self, context: FStringContext) {
self.stack.push(context);
}

pub(crate) fn pop(&mut self) -> Option<FStringContext> {
self.stack.pop()
}

pub(crate) fn current(&self) -> Option<&FStringContext> {
self.stack.last()
}

pub(crate) fn current_mut(&mut self) -> Option<&mut FStringContext> {
self.stack.last_mut()
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
source: crates/ruff_python_parser/src/lexer.rs
expression: lex_source(source)
---
[
(
FStringStart,
0..2,
),
(
FStringEnd,
2..3,
),
(
String {
value: "",
kind: String,
triple_quoted: false,
},
4..6,
),
(
FStringStart,
7..9,
),
(
FStringEnd,
9..10,
),
(
FStringStart,
11..13,
),
(
FStringEnd,
13..14,
),
(
String {
value: "",
kind: String,
triple_quoted: false,
},
15..17,
),
(
FStringStart,
18..22,
),
(
FStringEnd,
22..25,
),
(
FStringStart,
26..30,
),
(
FStringEnd,
30..33,
),
(
Newline,
33..33,
),
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
source: crates/ruff_python_parser/src/lexer.rs
expression: lex_source(source)
---
[
(
FStringStart,
0..2,
),
(
FStringMiddle {
value: "normal ",
is_raw: false,
},
2..9,
),
(
Lbrace,
9..10,
),
(
Name {
name: "foo",
},
10..13,
),
(
Rbrace,
13..14,
),
(
FStringMiddle {
value: " {another} ",
is_raw: false,
},
14..27,
),
(
Lbrace,
27..28,
),
(
Name {
name: "bar",
},
28..31,
),
(
Rbrace,
31..32,
),
(
FStringMiddle {
value: " {",
is_raw: false,
},
32..35,
),
(
Lbrace,
35..36,
),
(
Name {
name: "three",
},
36..41,
),
(
Rbrace,
41..42,
),
(
FStringMiddle {
value: "}",
is_raw: false,
},
42..44,
),
(
FStringEnd,
44..45,
),
(
Newline,
45..45,
),
]

0 comments on commit a1a08e6

Please sign in to comment.