Apply NFKC normalization to unicode identifiers in the lexer #10412

AlexWaygood · 2024-03-14T17:34:50Z

Summary

A second attempt to fix #5003, hopefully without the performance problems that #10381 suffered from.

Python applies NFKC normalization to identifiers that use unicode characters. That means that F821 should not be emitted if ruff encounters the following snippet (but on main, it is), as from Python's perspective, these are all the same identifier:

𝒞 = 500
print(𝒞)
print(C + 𝒞)  # ruff says `C` isn't defined
print(C / 𝒞)
print(C == 𝑪 == 𝒞 == 𝓒 == 𝕮)  # ruff says `C`, `𝑪`, ... isn't defined

This PR fixes that false positive by NFKC-normalizing identifers as they are encountered in the lexer.

Test Plan

cargo test

crates/ruff_python_formatter/src/expression/expr_name.rs

github-actions · 2024-03-14T17:54:38Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

dhruvmanila

This is great with such a minimal change. Can you please write / update documentation around this change? Maybe the lex_identifier or the Name token?

It's good to go from my side but I'll let @MichaReiser give the final approval.

crates/ruff_python_parser/src/lexer.rs

AlexWaygood · 2024-03-15T18:43:14Z

Can you please write / update documentation around this change? Maybe the lex_identifier or the Name token?

Thanks! I had a go at writing some docs in d785be5 -- how does that look to you?

dhruvmanila · 2024-03-16T02:32:15Z

Thanks! I had a go at writing some docs in d785be5 -- how does that look to you?

Looks good. Thank you!

charliermarsh · 2024-03-17T21:36:41Z

crates/ruff_python_parser/src/token.rs

@@ -16,6 +16,9 @@ pub enum Tok {
    /// Token value for a name, commonly known as an identifier.
    Name {
        /// The name value.
+        ///
+        /// Unicode names are NFKC-normalized by the lexer,
+        /// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
        name: Box<str>,


I was eventually hoping to remove all owned data from the lexer tokens (e.g., prior to this change, we could've conceivably removed this field altogether; if we remove more similar fields from other tokens, we can eventually reduce the size of the Tok enum, which could be great for performance). This now precludes us from doing so. But I don't have enough context on the future design of the lexer-parser to know if it matters.

I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).

What I have in mind is to:

Replace Tok with TokenKind (that holds no data)

Add a take_value() method on Lexer that "takes" the current value (enum over Name, Int etc.).

This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).

This is a good point. We could potentially move this to the parse_identifier method in the parser as that's the final destination for where this value needs to be. I could come back to this once the new parser is merged and I'm looking into the feedback loop between the lexer and parser.

charliermarsh

Looks reasonable to me, I'll defer to Micha and Dhruv to approve.

crates/ruff_linter/resources/test/fixtures/pyflakes/F821_28.py

crates/ruff_python_formatter/src/expression/expr_name.rs

MichaReiser · 2024-03-18T09:42:20Z

crates/ruff_python_parser/src/token.rs

@@ -16,6 +16,9 @@ pub enum Tok {
    /// Token value for a name, commonly known as an identifier.
    Name {
        /// The name value.
+        ///
+        /// Unicode names are NFKC-normalized by the lexer,
+        /// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
        name: Box<str>,


I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).

What I have in mind is to:

Replace Tok with TokenKind (that holds no data)

Add a take_value() method on Lexer that "takes" the current value (enum over Name, Int etc.).

This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).

crates/ruff_python_parser/src/lexer.rs

AlexWaygood · 2024-03-18T11:57:08Z

Thanks all!

AlexWaygood changed the title ~~fix the bug~~ Apply NFKC normalization to unicode identifiers in the lexer Mar 14, 2024

AlexWaygood commented Mar 14, 2024

View reviewed changes

crates/ruff_python_formatter/src/expression/expr_name.rs Show resolved Hide resolved

AlexWaygood marked this pull request as ready for review March 14, 2024 17:42

AlexWaygood requested a review from MichaReiser as a code owner March 14, 2024 17:42

AlexWaygood mentioned this pull request Mar 14, 2024

Apply NFKC normalization to unicode identifiers when storing bindings in the semantic model #10381

Closed

AlexWaygood added 2 commits March 14, 2024 20:30

fix the bug

0d02da1

remove a debug assertion that no longer passes

a8a9863

AlexWaygood force-pushed the unicode-normalization-2 branch from 1eed840 to a8a9863 Compare March 14, 2024 20:30

AlexWaygood added 2 commits March 15, 2024 15:44

optimize a little bit

1646f0f

Optimize more

d9a68ba

dhruvmanila reviewed Mar 15, 2024

View reviewed changes

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved

docs and better naming

d785be5

charliermarsh reviewed Mar 17, 2024

View reviewed changes

MichaReiser added the parser Related to the parser label Mar 18, 2024

MichaReiser approved these changes Mar 18, 2024

View reviewed changes

address Micha's review

35500ab

AlexWaygood merged commit 92e6026 into astral-sh:main Mar 18, 2024
17 checks passed

AlexWaygood deleted the unicode-normalization-2 branch March 18, 2024 11:56

namurphy mentioned this pull request Mar 22, 2024

Autofix for I001 unexpectedly altering characters from Unicode Block “Letterlike Symbols” #10528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply NFKC normalization to unicode identifiers in the lexer #10412

Apply NFKC normalization to unicode identifiers in the lexer #10412

AlexWaygood commented Mar 14, 2024 •

edited

github-actions bot commented Mar 14, 2024 •

edited

dhruvmanila left a comment

AlexWaygood commented Mar 15, 2024

dhruvmanila commented Mar 16, 2024

charliermarsh Mar 17, 2024

MichaReiser Mar 18, 2024

dhruvmanila Mar 18, 2024

charliermarsh left a comment

MichaReiser Mar 18, 2024

AlexWaygood commented Mar 18, 2024

Apply NFKC normalization to unicode identifiers in the lexer #10412

Apply NFKC normalization to unicode identifiers in the lexer #10412

Conversation

AlexWaygood commented Mar 14, 2024 • edited

Summary

Test Plan

github-actions bot commented Mar 14, 2024 • edited

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

dhruvmanila left a comment

Choose a reason for hiding this comment

AlexWaygood commented Mar 15, 2024

dhruvmanila commented Mar 16, 2024

charliermarsh Mar 17, 2024

Choose a reason for hiding this comment

MichaReiser Mar 18, 2024

Choose a reason for hiding this comment

dhruvmanila Mar 18, 2024

Choose a reason for hiding this comment

charliermarsh left a comment

Choose a reason for hiding this comment

MichaReiser Mar 18, 2024

Choose a reason for hiding this comment

AlexWaygood commented Mar 18, 2024

AlexWaygood commented Mar 14, 2024 •

edited

github-actions bot commented Mar 14, 2024 •

edited

`ruff-ecosystem` results