Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply NFKC normalization to unicode identifiers in the lexer #10412

Merged
merged 6 commits into from Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Expand Up @@ -108,6 +108,7 @@ unic-ucd-category = { version = "0.9" }
unicode-ident = { version = "1.0.12" }
unicode-width = { version = "0.1.11" }
unicode_names2 = { version = "1.2.2" }
unicode-normalization = { version = "0.1.23" }
ureq = { version = "2.9.6" }
url = { version = "2.5.0" }
uuid = { version = "1.6.1", features = ["v4", "fast-rng", "macro-diagnostics", "js"] }
Expand Down
@@ -0,0 +1,9 @@
"""Test that unicode identifiers are NFKC-normalised"""

𝒞 = 500
print(𝒞)
print(C + 𝒞) # 2 references to the same variable due to NFKC normalization
print(C / 𝒞)
print(C == 𝑪 == 𝒞 == 𝓒 == 𝕮)

print(𝒟) # F821
1 change: 1 addition & 0 deletions crates/ruff_linter/src/rules/pyflakes/mod.rs
Expand Up @@ -156,6 +156,7 @@ mod tests {
#[test_case(Rule::UndefinedName, Path::new("F821_26.py"))]
#[test_case(Rule::UndefinedName, Path::new("F821_26.pyi"))]
#[test_case(Rule::UndefinedName, Path::new("F821_27.py"))]
#[test_case(Rule::UndefinedName, Path::new("F821_28.py"))]
#[test_case(Rule::UndefinedExport, Path::new("F822_0.py"))]
#[test_case(Rule::UndefinedExport, Path::new("F822_0.pyi"))]
#[test_case(Rule::UndefinedExport, Path::new("F822_1.py"))]
Expand Down
@@ -0,0 +1,10 @@
---
source: crates/ruff_linter/src/rules/pyflakes/mod.rs
---
F821_28.py:9:7: F821 Undefined name `𝒟`
|
7 | print(C == 𝑪 == 𝒞 == 𝓒 == 𝕮)
8 |
9 | print(𝒟) # F821
| ^ F821
|
17 changes: 6 additions & 11 deletions crates/ruff_python_formatter/src/expression/expr_name.rs
@@ -1,4 +1,4 @@
use ruff_formatter::{write, FormatContext};
use ruff_formatter::write;
use ruff_python_ast::AnyNodeRef;
use ruff_python_ast::ExprName;

Expand All @@ -11,16 +11,11 @@ pub struct FormatExprName;

impl FormatNodeRule<ExprName> for FormatExprName {
fn fmt_fields(&self, item: &ExprName, f: &mut PyFormatter) -> FormatResult<()> {
let ExprName { id, range, ctx: _ } = item;

debug_assert_eq!(
id.as_str(),
f.context()
.source_code()
.slice(*range)
.text(f.context().source_code())
);

AlexWaygood marked this conversation as resolved.
Show resolved Hide resolved
let ExprName {
id: _,
range,
ctx: _,
} = item;
write!(f, [source_text_slice(*range)])
}

Expand Down
1 change: 1 addition & 0 deletions crates/ruff_python_parser/Cargo.toml
Expand Up @@ -28,6 +28,7 @@ rustc-hash = { workspace = true }
static_assertions = { workspace = true }
unicode-ident = { workspace = true }
unicode_names2 = { workspace = true }
unicode-normalization = { workspace = true }

[dev-dependencies]
insta = { workspace = true }
Expand Down
40 changes: 36 additions & 4 deletions crates/ruff_python_parser/src/lexer.rs
Expand Up @@ -32,6 +32,7 @@ use std::iter::FusedIterator;
use std::{char, cmp::Ordering, str::FromStr};

use unicode_ident::{is_xid_continue, is_xid_start};
use unicode_normalization::UnicodeNormalization;

use ruff_python_ast::{Int, IpyEscapeKind};
use ruff_text_size::{TextLen, TextRange, TextSize};
Expand Down Expand Up @@ -197,10 +198,25 @@ impl<'source> Lexer<'source> {
_ => {}
}

self.cursor.eat_while(is_identifier_continuation);
// Keep track of whether the identifier is ASCII-only or not.
//
// This is important because Python applies NFKC normalization to
// identifiers: https://docs.python.org/3/reference/lexical_analysis.html#identifiers.
// We need to therefore do the same in our lexer, but applying NFKC normalization
// unconditionally is extremely expensive. If we know an identifier is ASCII-only,
// (by far the most common case), we can skip NFKC normalization of the identifier.
let mut is_ascii = first.is_ascii();
self.cursor
.eat_while(|c| is_identifier_continuation(c, &mut is_ascii));

let text = self.token_text();

if !is_ascii {
return Ok(Tok::Name {
name: text.nfkc().collect::<String>().into_boxed_str(),
});
}

let keyword = match text {
"False" => Tok::False,
"None" => Tok::None,
Expand Down Expand Up @@ -1583,14 +1599,19 @@ fn is_unicode_identifier_start(c: char) -> bool {
is_xid_start(c)
}

// Checks if the character c is a valid continuation character as described
// in https://docs.python.org/3/reference/lexical_analysis.html#identifiers
fn is_identifier_continuation(c: char) -> bool {
/// Checks if the character c is a valid continuation character as described
/// in <https://docs.python.org/3/reference/lexical_analysis.html#identifiers>.
///
/// Additionally, this function also keeps track of whether or not the total
/// identifier is ASCII-only or not by mutably altering a reference to a
/// boolean value passed in.
fn is_identifier_continuation(c: char, identifier_is_ascii_only: &mut bool) -> bool {
// Arrange things such that ASCII codepoints never
// result in the slower `is_xid_continue` getting called.
if c.is_ascii() {
matches!(c, 'a'..='z' | 'A'..='Z' | '_' | '0'..='9')
} else {
*identifier_is_ascii_only = false;
is_xid_continue(c)
}
}
Expand Down Expand Up @@ -2042,6 +2063,17 @@ def f(arg=%timeit a = b):
assert_debug_snapshot!(lex_source(source));
}

fn get_tokens_only(source: &str) -> Vec<Tok> {
lex_source(source).into_iter().map(|(tok, _)| tok).collect()
}

#[test]
fn test_nfkc_normalization() {
let source1 = "𝒞 = 500";
let source2 = "C = 500";
assert_eq!(get_tokens_only(source1), get_tokens_only(source2));
}

fn triple_quoted_eol(eol: &str) -> Vec<Spanned> {
let source = format!("\"\"\"{eol} test string{eol} \"\"\"");
lex_source(&source)
Expand Down
3 changes: 3 additions & 0 deletions crates/ruff_python_parser/src/token.rs
Expand Up @@ -16,6 +16,9 @@ pub enum Tok {
/// Token value for a name, commonly known as an identifier.
Name {
/// The name value.
///
/// Unicode names are NFKC-normalized by the lexer,
/// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)
name: Box<str>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was eventually hoping to remove all owned data from the lexer tokens (e.g., prior to this change, we could've conceivably removed this field altogether; if we remove more similar fields from other tokens, we can eventually reduce the size of the Tok enum, which could be great for performance). This now precludes us from doing so. But I don't have enough context on the future design of the lexer-parser to know if it matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).

What I have in mind is to:

  • Replace Tok with TokenKind (that holds no data)
  • Add a take_value() method on Lexer that "takes" the current value (enum over Name, Int etc.).

This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We could potentially move this to the parse_identifier method in the parser as that's the final destination for where this value needs to be. I could come back to this once the new parser is merged and I'm looking into the feedback loop between the lexer and parser.

},
/// Token value for an integer.
Expand Down