Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start tracking quoting style in the AST #10298

Merged
merged 3 commits into from Mar 8, 2024

Conversation

AlexWaygood
Copy link
Member

@AlexWaygood AlexWaygood commented Mar 8, 2024

Summary

This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:

  • The quoting style used (double or single quotes)
  • Whether the string is triple-quoted or not
  • Whether the string is raw or not

This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the ExprGenerator in a followup PR, which will allow us to provide a fix for #7799.

The PR should be easiest to review commit-by-commit:

  1. The first commit makes the necessary modifications for tracking this information in ast::StringLiteral nodes
  2. The second commit does the same for ast::BytesLiteral nodes
  3. The third commit does the same for ast::FString nodes

The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on Tok::String, Tok::FStringStart and Tok::FStringMiddle tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.

Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a u prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.

Test Plan

cargo test

@AlexWaygood AlexWaygood added the internal An internal refactor or improvement label Mar 8, 2024
@AlexWaygood
Copy link
Member Author

Performance is neutral on all benchmarks according to Codspeed.

Copy link
Contributor

github-actions bot commented Mar 8, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It would be nice to use the new string flags in the formatter to avoid parsing them from source. But that's nothing for this PR.

pub(crate) fn parse(input: &str) -> StringPrefix {
let chars = input.chars();
let mut prefix = StringPrefix::empty();
for c in chars {
let flag = match c {
'u' | 'U' => StringPrefix::UNICODE,
'f' | 'F' => StringPrefix::F_STRING,
'b' | 'B' => StringPrefix::BYTE,
'r' => StringPrefix::RAW,
'R' => StringPrefix::RAW_UPPER,
'\'' | '"' => break,
c => {
unreachable!(
"Unexpected character '{c}' terminating the prefix of a string literal"
);
}
};
prefix |= flag;
}
prefix
}

I posted a solution that avoids the need to change FStringStart. I would prefer if we keep it as is to reduce the diff size.

crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
}

/// Does the string have a `u` or `U` prefix?
pub const fn is_u_string(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit,

Suggested change
pub const fn is_u_string(&self) -> bool {
pub const fn is_unicode(&self) -> bool {

Potentially without the string postfix because that's given by the enclosing type. E.g it's also is_triple_quoted and not is_triple_quoted_string. E.g let's compare some call-sites: string_literal.is_unicode_string() seems repetivie in comparison to string_literal.is_unicode()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as #10298 (comment)

crates/ruff_python_ast/src/node.rs Outdated Show resolved Hide resolved
Comment on lines +105 to 112
impl From<ruff_python_ast::str::QuoteStyle> for Quote {
fn from(value: ruff_python_ast::str::QuoteStyle) -> Self {
match value {
ruff_python_parser::QuoteStyle::Double => Self::Double,
ruff_python_parser::QuoteStyle::Single => Self::Single,
ruff_python_ast::str::QuoteStyle::Double => Self::Double,
ruff_python_ast::str::QuoteStyle::Single => Self::Single,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace the Quote type with QuoteStyle? I would prefer to minimize the number of Quotelike types we have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do, but it would require making the ruff_python_literal crate have a dependency on the ruff_python_ast crate (or vice versa), due to this:

impl From<Quote> for StrQuote {
fn from(val: Quote) -> Self {
match val {
Quote::Single => StrQuote::Single,
Quote::Double => StrQuote::Double,
}
}
}

If we used ruff_python_ast::str::QuoteStyle there instead of ruff_python_codegen::stylist::Quote, we wouldn't be able to have the implementation of that trait in the ruff_python_codegen crate, since the QuoteStyle enum comes from the ruff_python_literal crate. ruff_python_literal does not currently depend on ruff_python_ast (nor vice versa).

crates/ruff_python_parser/src/token.rs Outdated Show resolved Hide resolved
.with_value(capital_env_var.into_boxed_str())
.with_prefix({
if env_var.is_unicode() {
StringLiteralPrefix::UString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think I would prefer Unicode, it avoids the string repetition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike Unicode, because all strings in Python are unicode... the u prefix does nothing at runtime, it's purely there for Python 2 compatibility. (In fact, it was originally removed in Python 3.0, but they added it back in a later Python 3 version because it was so difficult to write code that worked on both Python 2 and Python 3 if the prefix was gone in Python 3.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. But it is the unicode flag ;) Can you do a quick check on what we call it in other places in the code base? It would be nice if it is consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a grep -- there's a pyupgrade rule called UnicodeKindPrefix, but other than that, I can't see much use of the term Unicode across the codebase to describe this prefix...

I'd still rather keep this as-is -- I personally would find it really confusing to refer to these strings as "unicode strings", given that all strings are unicode strings, even without the prefix

@AlexWaygood AlexWaygood force-pushed the quotestyle-ast branch 8 times, most recently from c8bdaf0 to 7ab7c5a Compare March 8, 2024 18:41
@AlexWaygood AlexWaygood merged commit 1d97f27 into astral-sh:main Mar 8, 2024
17 checks passed
@AlexWaygood AlexWaygood deleted the quotestyle-ast branch March 8, 2024 19:11
nkxxll pushed a commit to nkxxll/ruff that referenced this pull request Mar 10, 2024
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:
- The quoting style used (double or single quotes)
- Whether the string is triple-quoted or not
- Whether the string is raw or not

This PR is a followup to astral-sh#10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for astral-sh#7799.

The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in astral-sh#10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.

Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants