Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace LALRPOP parser with hand-written parser #10036

Merged
merged 114 commits into from Apr 18, 2024
Merged

Replace LALRPOP parser with hand-written parser #10036

merged 114 commits into from Apr 18, 2024

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Feb 19, 2024

(Supersedes #9152, authored by @LaBatata101)

Summary

This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser.

It also updates the grammar for PEP 646 so that the parser outputs the correct AST. For example, in data[*x], the index expression is now a tuple with a single starred expression instead of just a starred expression.

Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will stop at the first syntax error. This will be updated in the following months.

For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post.

Test Plan

Write lots and lots of tests for both valid and invalid syntax and verify the output.

Acknowledgements

Copy link

codspeed-hq bot commented Mar 7, 2024

CodSpeed Performance Report

Merging #10036 will improve performances by ×2.4

Comparing dhruv/parser (98a7669) with main (e09180b)

Summary

⚡ 7 improvements
✅ 23 untouched benchmarks

Benchmarks breakdown

Benchmark main dhruv/parser Change
linter/all-rules[unicode/pypinyin.py] 9.6 ms 8.6 ms +11.63%
linter/all-with-preview-rules[unicode/pypinyin.py] 11.1 ms 10.1 ms +9.95%
parser[large/dataset.py] 63.6 ms 26.6 ms ×2.4
parser[numpy/ctypeslib.py] 10.8 ms 5 ms ×2.2
parser[numpy/globals.py] 964.6 µs 424.9 µs ×2.3
parser[pydantic/types.py] 24.4 ms 10.9 ms ×2.2
parser[unicode/pypinyin.py] 3.8 ms 1.7 ms ×2.2

Copy link
Contributor

github-actions bot commented Mar 12, 2024

ruff-ecosystem results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+1 -1 violations, +0 -0 fixes in 1 projects; 43 projects unchanged)

demisto/content (+1 -1 violations, +0 -0 fixes)

+ Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:12: E999 SyntaxError: Multiple exception types must be parenthesized
- Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:21: E999 SyntaxError: Unexpected token ','

Changes by rule (1 rules affected)

code total + violation - violation + fix - fix
E999 2 1 1 0 0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+1 -1 violations, +0 -0 fixes in 1 projects; 43 projects unchanged)

demisto/content (+1 -1 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

+ Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:12: E999 SyntaxError: Multiple exception types must be parenthesized
- Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:21: E999 SyntaxError: Unexpected token ','

Changes by rule (1 rules affected)

code total + violation - violation + fix - fix
E999 2 1 1 0 0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

dhruvmanila and others added 20 commits April 18, 2024 15:40
## Summary

This PR reviews the logic for IPython escape command parsing and makes
the required changes to almost match the old parsing logic.

~One change I've not incorporated is the `unparse_expr`. The new parser
can use `src_text` directly but this means that any whitespace is
preserved~ Migrated the `unparse_expr` logic

## Test plan

Use the new parser for testing escape commands and verify that there are
no snapshot changes.
## Summary

This PR updates the `ruff_new_parser_equiv` fuzzer target to reject the
corpus which produces an empty range for the AST from LALRPOP. If there
are no nodes, then LALRPOP produces an empty range while the new parser
uses the correct range of the source itself.
Co-authored-by: Micha Reiser <micha@reiser.io>
## Summary

This PR updates the LALRPOP to use the complete source range for module
by looking at the tokens before the parsing begins. This means the
module range now corresponds to the actual source even if it only
contains trivia tokens. This is mainly so that the fuzzer doesn't crash
because the AST doesn't match.
> `simple` is a boolean integer set to `True` for a
[Name](https://docs.python.org/3/library/ast.html#ast.Name) node in
`target` that do not appear in between parenthesis and are hence pure
names and not expressions.

Reference: https://docs.python.org/3/library/ast.html#ast.AnnAssign
## Summary

Dotted names aren't allowed in `from ... import` statement. They're only
allowed in `import` statement.

### Alternative

Another solution would be to parse the dotted name, check if there's a
`.` in the parsed string and raise an error.

I choose not to do this because it didn't make sense to do `contains`
for every import name.

## Test Plan

Add invalid syntax test cases to verify the logic.
## Summary

This PR fixes a bug in parenthesized with-items parsing where the parser
should expect a comma but it didn't.

This is the case in the speculative parsing loop of parenthesized
with-items. Once the parser finds any token other than comma, it breaks
out of the loop. Then, once the parser has determined the with item
kind, it needs to expect a comma if it's not the end of parenthesized
with-items.

## Test Plan

Add inline test cases, verify the snapshots.
## Summary

This PR avoids the panic when trying to recover from an invalid pattern.

The panic occurs because there's no way to represent `x as y` syntax as
a Python expression. There were few cases which were not accounted for
when working on this.

The cases where it would panic are the ones where only a subset of
pattern is valid in a position but all patterns are valid in a nested
position of the mentioned top-level pattern.

For example, consider a mapping pattern and the following points:
* Key position can only be a `MatchValue` and `MatchSingleton`
* `MatchClass` can contain a `MatchAs` as an argument
* `MatchAs` pattern with both pattern and name (`a as b`) cannot be
converted to a valid expression

Considering this, if a mapping pattern has a class pattern with an `as`
pattern as one of it's argument, the parser would panic.

```py
match subject:
	case {Foo(a as b): 1}: ...
#        ^^^^^^^^^^^^^^^^ mapping pattern
#         ^^^^^^^^^^^     mapping pattern key, class pattern
#             ^^^^^^      as pattern with both pattern and name
```

The same case is for complex literal patterns as well.

I'm not sure what the recovery should look like but for now it's better
to avoid the panic and create an invalid expression node in it's place.

## Test Plan

Add new test cases and update the snapshots.
## Summary

This fixes a bug in the parser when `Mode::Expression` is used where the
parser didn't recover from unexpected token after the expression.

The current solution isn't ideal as the parser will drop all the
remaining tokens after the initial expression has been parsed. This is
to avoid panicking especially in the context of forward annotation like:

```python
x: Literal["foo", "bar baz"]
```

Here, the parser is being given the `bar baz` as the source to be parsed
using `Mode::Expression`. The parser will parse `bar` as a name
expression and then stop because the expression ends there. But, it
seems like there's more tokens remaining. This specific case could
potentially be recovered by using a `ExprTuple` and raising a "missing
comma" error but I've avoided that complexity for now.

## Test Plan

Add new test cases for `Mode::Expression` and verified the snapshots.
## Summary

This PR fixes the bug where the parser would fail to parse the following
code:

```python
for (x in y)[0] in z: ...
```

This is because when parsing the target of a `for` statement, the
context flag for the parser is set to `FOR_TARGET` and this is used by
binary expression parsing to stop the expression parsing when it sees
the `in` keyword. But, in parenthesized context, this is allowed.

The solution implemented in this PR is to reset the parser context when
parsing a parenthesized expression and set it back to the original
context later.

An alternative solution would be to introduce
[`ExpressionContext`](https://github.com/biomejs/biome/blob/838ccb442370e72197e625a7d0e4456e6e77e498/crates/biome_js_parser/src/syntax/expr.rs#L42-L43)
but that would require a lot of updates as almost all of expression
parsing function would need to be updated to take this context. A
benefit of this would be that both `AllowStarredExpression` and
`AllowNamedExpression` can be merged in the context and everything
around `ParserCtxFlags` can be removed. It could possibly also simplify
certain parsing functions.

## Test Plan

Add test cases and verified the snapshots.
## Summary

This PR removes a test case from the formatter which uses a named
expression as `del` target. This is an invalid syntax in the new parser.
## Summary

This fixes a bug for code like:

```python
match subject:
    case a, b if expr: ...
```

Previously, the `:` was the only terminator token so the list parsing
wouldn't stop after `b`. The `if` token has been added to the terminator
list to fix this.

## Test Plan

Add test cases and verified the snapshots.
## Summary

There have been some grammar changes in [PEP 646] which were not
accounted for in the old parser. The new parser has been updated with
the correct AST. This is the case when there's a starred expression
inside a subscript expression like the following example:

```python
data[*x]
```

This gives us the AST where the slice element is actually a tuple
expression with one element (starred expression) instead of just a
starred expression. Now, the formatter's current behavior is to always
add a trailing comma in a tuple with a single element.

This PR updates the formatter to use the "preserve" behavior in trailing
comma as well. So, trailing comma will not be added in the above example
and if there's a trailing comma in the above example, it'll be
preserved. This retains the current behavior without the AST change.

[PEP 646]:
https://peps.python.org/pep-0646/#change-1-star-expressions-in-indexes

## Test Plan

Run `cargo insta test -p ruff_python_formatter` and make sure there are
no changes.
@dhruvmanila dhruvmanila marked this pull request as ready for review April 18, 2024 10:26
crates/ruff_python_ast/src/comparable.rs Outdated Show resolved Hide resolved
@MichaReiser
Copy link
Member

One obligatory comment :P Excellent work. Looking forward to my next git pull

@addisoncrump
Copy link
Contributor

🚀 Let me know if I can help with further testing.

@dhruvmanila dhruvmanila merged commit 13ffb5b into main Apr 18, 2024
17 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/parser branch April 18, 2024 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ruff fails to parse WithItem with a starred expression as optional_vars
4 participants