-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite the regex crate #978
Commits on Jun 5, 2023
-
impl: initial import of regex-automata
This effectively copies my regex-automata work into this crate and does a bunch of rejiggering to make it work. In particular, we wire up its new test harness to the public regex crate API. In this commit, that means the regex crate API is being simultaneously tested using both the old and new test suites. This does *not* get rid of the old regex crate implementation. That will happen in a subsequent commit. This is just a staging commit to prepare for that.
Configuration menu - View commit details
-
Copy full SHA for d0cc048 - Browse repository at this point
Copy the full SHA d0cc048View commit details -
Configuration menu - View commit details
-
Copy full SHA for 48f42e1 - Browse repository at this point
Copy the full SHA 48f42e1View commit details -
scripts: remove 'frequencies' script
If we need this again, we should just rewrite it in Rust and put it in 'regex-cli'.
Configuration menu - View commit details
-
Copy full SHA for 3df1b6b - Browse repository at this point
Copy the full SHA 3df1b6bView commit details -
All of the old tests should be covered by either porting them over explicitly, or in the TOML test suite.
Configuration menu - View commit details
-
Copy full SHA for 777ac9b - Browse repository at this point
Copy the full SHA 777ac9bView commit details -
bench: record last results with old benchmark suite
We're going to drop the old benchmark suite in favor of rebar, but it's worth recording some final results. This ensures we get a fair comparison with the regex crate before and after its internals have been rewritten.
Configuration menu - View commit details
-
Copy full SHA for 2d0f824 - Browse repository at this point
Copy the full SHA 2d0f824View commit details -
bench: move the old recordings to 'record' directory
We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar
Configuration menu - View commit details
-
Copy full SHA for e9ffe3f - Browse repository at this point
Copy the full SHA e9ffe3fView commit details -
As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)
Configuration menu - View commit details
-
Copy full SHA for 3cc7a3a - Browse repository at this point
Copy the full SHA 3cc7a3aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 75366f9 - Browse repository at this point
Copy the full SHA 75366f9View commit details -
It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.
Configuration menu - View commit details
-
Copy full SHA for 7cf6102 - Browse repository at this point
Copy the full SHA 7cf6102View commit details -
syntax: add new 'arbitrary' crate feature
This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.
Configuration menu - View commit details
-
Copy full SHA for 03eb47f - Browse repository at this point
Copy the full SHA 03eb47fView commit details -
Basically, whenever a counted repetition is applied to a sub-expression that can only ever match the empty string, the counted repetition can be reduced to 1. We can achieve that optimization very easily via the Hir::repetition smart constructor. This is somewhat important to do because otherwise one can write something like \B{10000}. The higher level infrastructure is somewhat dumb about this and will happily try to match \B over and over again. We should probably improve the higher level aspects of this (because this is not the only case that can cause the same assertions being repeatedly evaluated at the same position), but this fixes the most obvious ones at the HIR level.
Configuration menu - View commit details
-
Copy full SHA for 385c681 - Browse repository at this point
Copy the full SHA 385c681View commit details -
fuzz: use structured fuzzer input
This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821
Configuration menu - View commit details
-
Copy full SHA for 05b50be - Browse repository at this point
Copy the full SHA 05b50beView commit details -
fuzz: add size limit to regex building
The fuzzer sometimes runs into situations where it builds regexes that can take a while to execute, such as `\B{10000}`. They fit within the default size limit, but the search times aren't great. But it's not a bug. So try to decrease the size limit a bit to try and prevent timeouts. We might consider trying to optimize cases like `\B{10000}`. A naive optimization would be to remove any redundant conditional epsilon transitions within a single epsilon closure, but that can be tricky to do a priori. The case of `\B{100000}` is probably easy to detect, but they can be arbitrarily complex. Another way to attack this would be to modify, say, the PikeVM to only compute whether a conditional epsilon transition should be followed once per haystack position. Right now, I think it is re-computing them even though it doesn't have to.
Configuration menu - View commit details
-
Copy full SHA for cb436fd - Browse repository at this point
Copy the full SHA cb436fdView commit details -
fuzz: add syntactic structurally aware fuzzers
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848
Configuration menu - View commit details
-
Copy full SHA for bc1f1af - Browse repository at this point
Copy the full SHA bc1f1afView commit details -
regex-automata: fix bug in DFA quit behavior
It turns out that the way we were dealing with quit states in the DFA was not quite right. Basically, if we entered a quit state and a match had been found, then we were returning the match instead of the error. But the match might not be the correct leftmost-first match, and so, we really shouldn't return it. Otherwise a regex like '\B.*' could match much less than it should. This was caught by a differential fuzzer developed in #848.
Configuration menu - View commit details
-
Copy full SHA for 9afca7d - Browse repository at this point
Copy the full SHA 9afca7dView commit details -
This adds a regression test for a bug found in the *old* regex crate that isn't present with the regex-automata rewrite. I discovered this while doing differential fuzzing. I didn't do a root cause analysis of the bug, but my guess is a literal optimization problem.
Configuration menu - View commit details
-
Copy full SHA for 7bcca88 - Browse repository at this point
Copy the full SHA 7bcca88View commit details -
fuzz: add logging support to the fuzzer
This makes it a little easier to introspect what the regex crate is doing. Just pass RUST_LOG=debug to get a general sense of things, and RUST_LOG=trace to get a lot more output.
Configuration menu - View commit details
-
Copy full SHA for 4d4f011 - Browse repository at this point
Copy the full SHA 4d4f011View commit details -
fuzz: improve Arbitrary impl for Unicode classes
... and add some more fuzz testing based on it. Closes #991
Configuration menu - View commit details
-
Copy full SHA for e34b914 - Browse repository at this point
Copy the full SHA e34b914View commit details
Commits on Jul 5, 2023
-
This commit grew into a monster. I ran out of energy trying to split everything up. For the most part, this commit is about polishing and writing docs.
Configuration menu - View commit details
-
Copy full SHA for e96bf8f - Browse repository at this point
Copy the full SHA e96bf8fView commit details -
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
Configuration menu - View commit details
-
Copy full SHA for 8513751 - Browse repository at this point
Copy the full SHA 8513751View commit details