Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of builtin symbols in linter rules #10919

Merged
merged 9 commits into from Apr 16, 2024

Conversation

AlexWaygood
Copy link
Member

Summary

A pattern that shows up a lot in the ruff_linter crate currently is to do something like this to determine whether an ast::Expr node refers to a builtin symbol:

fn do_the_check(checker: &mut Checker, expr: &ast::Expr) {
    let ast::Expr::Name(ast::ExprName { id, .. }) = expr else {
        return;
    }
    if id != "zip" {
        return;
    }
    if !checker.semantic().is_builtin("zip") {
        return;
    }
    // and now we do the check that only applies to the builtin zip function...
}

This has two problems:

  1. It's pretty verbose and repetitive, and lots of our checks have some interaction with a builtin function or class in some way
  2. This pattern doesn't account for the fact that it's perfectly possible to explicitly import the builtins module and e.g. refer to zip by using builtins.zip rather than just zip

This PR adds a new method to the semantic model that means that this kind of logic becomes both more concise and more accurate. The above logic can now be replaced with this, which will also (unlike the old function) recognise builtins.zip as being a reference to the builtin zip function:

fn do_the_check(checker: &mut Checker, expr: &ast::Expr) {
    if !checker.semantic().match_builtin_expr(expr, "zip") {
        return;
    }
    // and now we do the check that only applies to the builtin zip function...
}

A new method is also added to crates/ruff_linter/src/importer/mod.rs to enable us to provide autofixes involving builtin functions/classes, even when they've been shadowed by a user-defined function or class in the current scope.

Test Plan

cargo test. Several fixtures have been extended with new examples to test the new functionality.

This PR should be easiest to review commit-by-commit: each commit passes the full test suite by itself.

@AlexWaygood AlexWaygood changed the title Add a new method to check whether an Expr refers to a builtin symbol Improve handling of builtin symbols in linter rules Apr 13, 2024
@AlexWaygood AlexWaygood added the linter Related to the linter label Apr 13, 2024
Copy link
Contributor

github-actions bot commented Apr 13, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Apr 13, 2024

Ah, that's a regression: something can be imported explicitly from builtins even if it's not an ast::Name node. My fast path doesn't work as it's currently written.

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Apr 13, 2024

Seems like this leads to a 1% perf regression overall, with regressions of 2-3% on some linter benchmarks: https://codspeed.io/astral-sh/ruff/branches/AlexWaygood:improve-builtins-handling

I don't know whether that's worth it or not — it's less repetitive code for us and it's more principled in that we're handling the semantics of Python more accurately, but importing builtins explicitly also isn't that common.

@charliermarsh
Copy link
Member

Would you have expected ~no change here for benchmarks that don't import builtins?

@AlexWaygood
Copy link
Member Author

Would you have expected ~no change here for benchmarks that don't import builtins?

yeah. I would have hoped that the fast path from the if self.seen_modules(Modules::BUILTINS) check would mean that performance would be basically unchanged for those benchmarks.

Any idea what might be the cause of the slowdown? pyflakes::rules::unused_import seems to be showing up in a bunch of the codspeed flamegraphs, which is weird, because I haven't touched that function in this PR

@charliermarsh
Copy link
Member

No, I'm not sure. I looked through the changes again and my expectation would've been no change. You can try just pushing again and see if the benchmarks move in the other direction, since it might just be noise. I wouldn't read into the flamegraph diff though -- I think CodSpeed is just confused.

@charliermarsh
Copy link
Member

The other thing you might consider doing is benchmarking locally with Hyperfine. E.g., you could build a release build on main, stash it (mv ./target/release/ruff ./target/release/main), then build on your branch and do something like:

hyperfine --warmup 10 --runs 100 \
    "./target/release/main ./crates/ruff/resources/test/cpython/ --no-cache --silent -e" \ 
    "./target/release/ruff ./crates/ruff/resources/test/cpython/ --no-cache --silent -e" 

I'd suggest running it a few times and switching the order of main and ruff around to reach a safe conclusion.

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Apr 13, 2024

since it might just be noise.

I'm getting similar results from running cargo benchmark locally (tried it a few times earlier), so I don't think it's just noise :/ I'll try tomorrow with hyperfine to see if the macrobenchmark shows anything different to the microbenchmarks...

@charliermarsh
Copy link
Member

If you run the micro-benchmarks locally, I suggest running them a few times consecutively for each PR. At least on my machine, the second run of the same binary is always significant faster.

@carljm
Copy link
Contributor

carljm commented Apr 14, 2024

It seems like this PR is doing two separable things: improving detection of what is a builtin, and allowing auto-fixes that want to use a builtin to import it via builtins module if the name is shadowed.

Have you benchmarked those changes in isolation, to clarify which one the regression is coming from? IIUC we always eagerly generate fixes, so it seems like a change that allows more fixes in more cases could easily cause a regression?

@charliermarsh
Copy link
Member

Yeah, although we should only be generating more diagnostics (and therefore more fixes) for files that import builtins, and we don’t have any such files in our benchmarks (IIUC).

It could be that computing the fixes become more expensive (the use of get_or_import_symbol will be more expensive) even in the case that the file doesn’t import builtins, though I don’t have intuition on whether that could account for the delta here. You could try removing the changes that involve modifying the fixes, to see if the benchmarks return to normal as Carl suggests.

@carljm
Copy link
Contributor

carljm commented Apr 14, 2024

It looks to me like all the rules that are changed to use try_set_fix with the new get_or_import_builtin_symbol (e.g. unreliable_call_check is one example) will now generate more fixes than before in any file where the builtin they want to add a reference to is shadowed. E.g. in a module that contains callable = ..., unreliable_call_check will now generate a fix that imports builtins and uses builtins.callable, rather than bailing and not generating a fix at all. Similar for a number of other rules. So this would generate more fixes even in files that don't currently import builtins.

I haven't checked whether the benchmarks in question do this, but this change seems entirely separable from the better detection of imported builtins, so if there are any mysteries it seems like a good first step is to separate the changes and narrow down the cause. (And if it's still a mystery, perhaps further separate by only changing one rule at a time.)

@charliermarsh
Copy link
Member

Oh, that's true! Good call. I would be surprised if we had any instances of those in our benchmark files (and especially not in all of them) but it's possible there are a few and it's definitely a good idea to isolate the changes.

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Apr 14, 2024

Yet another possibility: the change to the fixes also involves changing a bunch of fixes to use Fix::safe_edits instead of Fix::safe_edit, even if a builtin symbol is not overridden in the relevant scope. It's possible safe_edits is slower than safe_edit, even if only a single edit is passed? Anyway, I didn't get to it today, but I'll do some benchmarking tomorrow without the changes to the fixes, as @carljm suggests :-)

crates/ruff_python_semantic/src/model.rs Outdated Show resolved Hide resolved
crates/ruff_python_semantic/src/model.rs Outdated Show resolved Hide resolved
if name_expr.id != "dict" {
return;
}
if !checker.semantic().is_builtin("dict") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that is_builtin uses lookup_symbol but the new implementation now uses resolve_qualified_name (which also works for member expression which the old implementation didn't)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a need for the new abstraction to work with member expressions. I just used resolve_qualified_name because it seemed like the simplest way to implement the new abstraction. Do you think using lookup_symbol would be more efficient here? (Goes to compare the two definitions...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we would be able to use lookup_symbol here, as lookup_symbol takes &str (representing the name of the symbol) rather than an AST node. And we don't know the name of the symbol -- even if it's a builtin, as you can do from builtins import open as o -- unless we resolve the qualified name

crates/ruff_linter/src/rules/pylint/rules/bad_open_mode.rs Outdated Show resolved Hide resolved
crates/ruff_linter/src/rules/pylint/rules/bad_open_mode.rs Outdated Show resolved Hide resolved
Comment on lines 72 to 75
match qualified_name.segments() {
["" | "builtins", "min"] => Some(MinMax::Min),
["" | "builtins", "max"] => Some(MinMax::Max),
_ => None,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider adding a method is_builtin to QualifiedName that tests the segments

if qualified_name.is_builtin("min") {
    Some(MinMax::Min)
} else if qualified_name.is_builtin("max") {
    Some(MinMax::Max)
} else {
    None
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QualifiedName already has an is_builtin method, which does something slightly different right now:

/// If the first segment is empty, the `CallPath` is that of a builtin.
/// Ex) `["", "bool"]` -> `"bool"`
pub fn is_builtin(&self) -> bool {
matches!(self.segments(), ["", ..])
}

/// Return `true` if `member` is a reference to `builtins.$target`,
/// i.e. either `object` (where `object` is not overridden in the global scope),
/// or `builtins.object` (where `builtins` is imported as a module at the top level)
pub fn match_builtin_expr(&self, expr: &Expr, symbol: &str) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could instead change the method to return a QualifiedName where constructing the QualifiedName short-circuits if Modules::BUILTINS isn't imported.

I'm bringing this up because QualifiedName already has methods for handling builtins (that might be incorrect?)

pub fn is_builtin(&self) -> bool {
    matches!(self.segments(), ["", ..])
}

So what I would have in mind is

model.resolve_builtin_name(expr).is_some_and(|builtin| builtin.is_builtin_name("open"))

We could still have a helper as you have today to avoid some of the repetition (is_some_and is somewhat annoying). But it would build up on the same concepts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 4f62717 goes some way to addressing this comment, though I haven't implemented your suggestion in exactly the way you suggested. What do you think of that solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

@AlexWaygood AlexWaygood force-pushed the improve-builtins-handling branch 2 times, most recently from b8bfb9c to 4e3c7c0 Compare April 15, 2024 08:46
@AlexWaygood
Copy link
Member Author

@MichaReiser's suggestions to move the symbol lookups lower down the function for each rule got rid of the performance regression 🥳🥳

@AlexWaygood AlexWaygood force-pushed the improve-builtins-handling branch 3 times, most recently from 513de18 to 08c5e30 Compare April 15, 2024 11:39
@charliermarsh
Copy link
Member

That's great! In my head those lookups didn't matter since they were only occurring when we saw the "right" name (e.g., open for a rule that cares about open), but of course that's not true -- we were doing them far more frequently on this branch.

@@ -75,7 +69,7 @@ pub(crate) fn getattr_with_constant(
if is_mangled_private(value.to_str()) {
return;
}
if !checker.semantic().is_builtin("getattr") {
if !checker.semantic().match_builtin_expr(func, "getattr") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counterpoint is that we're now doing other work that isn't strictly necessary (for example, if this is a name, and builtins isn't imported, and the name isn't getattr, then the call to is_mangled_private wasn't necessary -- we could've exited much earlier; similarly, we risk calling is_identifier on many more function calls that won't ever match).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly still the right tradeoff, but that's the tradeoff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way around this would be to, like, have some could_be_builtin("getattr") method you call at the top that returns false if builtins isn't imported, it's not a name, or the name isn't getattr, but IDK, that seems not great / worthwhile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These things (like is_identifier) do show up in traces sometimes because we tend to be calling these rules for every function call, which is why being able to gate on something as cheap as "the name of the function" is useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to change anything here but wanted to raise visibility on it.

@AlexWaygood
Copy link
Member Author

Benchmarks are now showing an... improvement? Not sure how that's possible, but I'll take it https://codspeed.io/astral-sh/ruff/branches/AlexWaygood:improve-builtins-handling

@AlexWaygood AlexWaygood force-pushed the improve-builtins-handling branch 2 times, most recently from 03dada4 to 4f62717 Compare April 15, 2024 13:22
@AlexWaygood AlexWaygood merged commit f779bab into astral-sh:main Apr 16, 2024
17 checks passed
@AlexWaygood AlexWaygood deleted the improve-builtins-handling branch April 16, 2024 10:37
AlexWaygood added a commit that referenced this pull request Apr 17, 2024
…ltin_symbol` or `match_builtin_expr` (#10982)

## Summary

This PR switches more callsites of `SemanticModel::is_builtin` to move
over to the new methods I introduced in #10919, which are more concise
and more accurate. I missed these calls in the first PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linter Related to the linter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants