Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the regex crate #978

Merged
merged 20 commits into from Jul 5, 2023
Merged

rewrite the regex crate #978

merged 20 commits into from Jul 5, 2023

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Apr 18, 2023

This effectively replaces the existing regex crate internals with public APIs (that anyone can use) from regex-automata.

I'll fill our this PR a little bit more, but I don't intend to merge this for a bit until current master has been released. I have this up in a PR primarily to make sure CI is working as intended, and in particular, to ensure that the 1.8 release is actually sufficient preparation for the 1.9 release. (It would be bad to put out 1.8 and then realize I need another breaking change in regex-syntax for example in order to wire everything up.)

For those who want to browse the regex-automata API, you can do so here (top-level docs haven't been written yet): https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/

@BurntSushi BurntSushi force-pushed the ag/regex-automata branch 22 times, most recently from b534e18 to 0eb9aee Compare April 21, 2023 00:48
@lopopolo
Copy link
Contributor

@BurntSushi I'm not sure if this is something you're interested in, but using git subtree you can import regex-automata and all of its history into the regex repository: https://manpages.ubuntu.com/manpages/bionic/man1/git-subtree.1.html

@BurntSushi
Copy link
Member Author

@lopopolo Aye thanks for that idea. I'm going to pass for now primarily because the regex-automata I'm importing here is basically a total rewrite. It doesn't really have much meaningful history going back to 0.1. It's a nearly totally different thing.

@BurntSushi BurntSushi force-pushed the ag/regex-automata branch 5 times, most recently from d12ba51 to 93f3f6e Compare April 27, 2023 13:52
BurntSushi and others added 13 commits June 5, 2023 14:03
We are going to remove the old benchmark harness, but it seems like a
good idea to save the old measurements.

In the future, benchmarks will be maintained by rebar:
https://github.com/BurntSushi/rebar
As stated in a previous commit, we'll be moving to rebar. (rebar isn't
actually published at time of writing, but it's essentially ready to
go.)
It's still not as good as it could be, but we add fuzz targets for
regex-lite and DFA deserialization in regex-automata.
This feature makes all of the AST types derive the 'Arbitrary' trait,
which is in turn quite useful for fuzz testing.
Basically, whenever a counted repetition is applied to a sub-expression
that can only ever match the empty string, the counted repetition can be
reduced to 1. We can achieve that optimization very easily via the
Hir::repetition smart constructor.

This is somewhat important to do because otherwise one can write
something like \B{10000}. The higher level infrastructure is somewhat
dumb about this and will happily try to match \B over and over again. We
should probably improve the higher level aspects of this (because this
is not the only case that can cause the same assertions being repeatedly
evaluated at the same position), but this fixes the most obvious ones at
the HIR level.
This makes a couple of the fuzzer targets a bit nicer by just asking for
structured data instead of trying to manifest it ourselves out of a
&[u8].

Closes #821
The fuzzer sometimes runs into situations where it builds regexes that
can take a while to execute, such as `\B{10000}`. They fit within the
default size limit, but the search times aren't great. But it's not a
bug. So try to decrease the size limit a bit to try and prevent
timeouts.

We might consider trying to optimize cases like `\B{10000}`. A naive
optimization would be to remove any redundant conditional epsilon
transitions within a single epsilon closure, but that can be tricky to
do a priori. The case of `\B{100000}` is probably easy to detect, but
they can be arbitrarily complex.

Another way to attack this would be to modify, say, the PikeVM to only
compute whether a conditional epsilon transition should be followed once
per haystack position. Right now, I think it is re-computing them even
though it doesn't have to.
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
This adds a regression test for a bug found in the *old* regex crate
that isn't present with the regex-automata rewrite. I discovered this
while doing differential fuzzing. I didn't do a root cause analysis of
the bug, but my guess is a literal optimization problem.
This makes it a little easier to introspect what the regex crate is
doing. Just pass RUST_LOG=debug to get a general sense of things, and
RUST_LOG=trace to get a lot more output.
... and add some more fuzz testing based on it.

Closes #991
@BurntSushi BurntSushi force-pushed the ag/regex-automata branch 3 times, most recently from 893c7cd to b4a8948 Compare June 20, 2023 01:36
@BurntSushi BurntSushi force-pushed the ag/regex-automata branch 7 times, most recently from b34bf39 to 73e45ea Compare July 4, 2023 22:27
This commit grew into a monster. I ran out of energy trying to split
everything up. For the most part, this commit is about polishing and
writing docs.
I usually close tickets on a commit-by-commit basis, but this refactor
was so big that it wasn't feasible to do that. So ticket closures are
marked here.

Closes #244
Closes #259
Closes #476
Closes #644
Closes #675
Closes #824
Closes #961

Closes #68
Closes #510
Closes #787
Closes #891

Closes #429
Closes #517
Closes #579
Closes #779
Closes #850
Closes #921
Closes #976
Closes #1002

Closes #656
@BurntSushi BurntSushi merged commit aa64e6d into master Jul 5, 2023
15 checks passed
@BurntSushi BurntSushi deleted the ag/regex-automata branch July 5, 2023 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants