Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply NFKC normalization to unicode identifiers in the lexer #10412

Merged
merged 6 commits into from Mar 18, 2024

Conversation

AlexWaygood
Copy link
Member

@AlexWaygood AlexWaygood commented Mar 14, 2024

Summary

A second attempt to fix #5003, hopefully without the performance problems that #10381 suffered from.

Python applies NFKC normalization to identifiers that use unicode characters. That means that F821 should not be emitted if ruff encounters the following snippet (but on main, it is), as from Python's perspective, these are all the same identifier:

𝒞 = 500
print(𝒞)
print(C + 𝒞)  # ruff says `C` isn't defined
print(C / 𝒞)
print(C == 𝑪 == 𝒞 == 𝓒 == 𝕮)  # ruff says `C`, `𝑪`, ... isn't defined

This PR fixes that false positive by NFKC-normalizing identifers as they are encountered in the lexer.

Test Plan

cargo test

@AlexWaygood AlexWaygood changed the title fix the bug Apply NFKC normalization to unicode identifiers in the lexer Mar 14, 2024
Copy link

github-actions bot commented Mar 14, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great with such a minimal change. Can you please write / update documentation around this change? Maybe the lex_identifier or the Name token?

It's good to go from my side but I'll let @MichaReiser give the final approval.

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
@AlexWaygood
Copy link
Member Author

Can you please write / update documentation around this change? Maybe the lex_identifier or the Name token?

Thanks! I had a go at writing some docs in d785be5 -- how does that look to you?

@dhruvmanila
Copy link
Member

Thanks! I had a go at writing some docs in d785be5 -- how does that look to you?

Looks good. Thank you!

@@ -16,6 +16,9 @@ pub enum Tok {
/// Token value for a name, commonly known as an identifier.
Name {
/// The name value.
///
/// Unicode names are NFKC-normalized by the lexer,
/// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
name: Box<str>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was eventually hoping to remove all owned data from the lexer tokens (e.g., prior to this change, we could've conceivably removed this field altogether; if we remove more similar fields from other tokens, we can eventually reduce the size of the Tok enum, which could be great for performance). This now precludes us from doing so. But I don't have enough context on the future design of the lexer-parser to know if it matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).

What I have in mind is to:

  • Replace Tok with TokenKind (that holds no data)
  • Add a take_value() method on Lexer that "takes" the current value (enum over Name, Int etc.).

This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We could potentially move this to the parse_identifier method in the parser as that's the final destination for where this value needs to be. I could come back to this once the new parser is merged and I'm looking into the feedback loop between the lexer and parser.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, I'll defer to Micha and Dhruv to approve.

@MichaReiser MichaReiser added the parser Related to the parser label Mar 18, 2024
@@ -16,6 +16,9 @@ pub enum Tok {
/// Token value for a name, commonly known as an identifier.
Name {
/// The name value.
///
/// Unicode names are NFKC-normalized by the lexer,
/// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
name: Box<str>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).

What I have in mind is to:

  • Replace Tok with TokenKind (that holds no data)
  • Add a take_value() method on Lexer that "takes" the current value (enum over Name, Int etc.).

This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
@AlexWaygood AlexWaygood merged commit 92e6026 into astral-sh:main Mar 18, 2024
17 checks passed
@AlexWaygood AlexWaygood deleted the unicode-normalization-2 branch March 18, 2024 11:56
@AlexWaygood
Copy link
Member Author

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

F821 False Positive when using equivalent script characters
4 participants