Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the regex crate #978

Merged
merged 20 commits into from
Jul 5, 2023
Merged

rewrite the regex crate #978

merged 20 commits into from
Jul 5, 2023

Commits on Jun 5, 2023

  1. impl: initial import of regex-automata

    This effectively copies my regex-automata work into this crate and does
    a bunch of rejiggering to make it work. In particular, we wire up its
    new test harness to the public regex crate API. In this commit, that
    means the regex crate API is being simultaneously tested using both the
    old and new test suites.
    
    This does *not* get rid of the old regex crate implementation. That will
    happen in a subsequent commit. This is just a staging commit to prepare
    for that.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    d0cc048 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    48f42e1 View commit details
    Browse the repository at this point in the history
  3. scripts: remove 'frequencies' script

    If we need this again, we should just rewrite it in Rust and put it in
    'regex-cli'.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    3df1b6b View commit details
    Browse the repository at this point in the history
  4. tests: drop old tests

    All of the old tests should be covered by either porting them over
    explicitly, or in the TOML test suite.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    777ac9b View commit details
    Browse the repository at this point in the history
  5. bench: record last results with old benchmark suite

    We're going to drop the old benchmark suite in favor of rebar, but it's
    worth recording some final results. This ensures we get a fair
    comparison with the regex crate before and after its internals have been
    rewritten.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    2d0f824 View commit details
    Browse the repository at this point in the history
  6. bench: move the old recordings to 'record' directory

    We are going to remove the old benchmark harness, but it seems like a
    good idea to save the old measurements.
    
    In the future, benchmarks will be maintained by rebar:
    https://github.com/BurntSushi/rebar
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    e9ffe3f View commit details
    Browse the repository at this point in the history
  7. bench: remove the old harness

    As stated in a previous commit, we'll be moving to rebar. (rebar isn't
    actually published at time of writing, but it's essentially ready to
    go.)
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    3cc7a3a View commit details
    Browse the repository at this point in the history
  8. api: introduce new regex-lite crate

    Closes #961
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    75366f9 View commit details
    Browse the repository at this point in the history
  9. fuzz: improve fuzz testing

    It's still not as good as it could be, but we add fuzz targets for
    regex-lite and DFA deserialization in regex-automata.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    7cf6102 View commit details
    Browse the repository at this point in the history
  10. syntax: add new 'arbitrary' crate feature

    This feature makes all of the AST types derive the 'Arbitrary' trait,
    which is in turn quite useful for fuzz testing.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    03eb47f View commit details
    Browse the repository at this point in the history
  11. syntax: optimize \B{10000}

    Basically, whenever a counted repetition is applied to a sub-expression
    that can only ever match the empty string, the counted repetition can be
    reduced to 1. We can achieve that optimization very easily via the
    Hir::repetition smart constructor.
    
    This is somewhat important to do because otherwise one can write
    something like \B{10000}. The higher level infrastructure is somewhat
    dumb about this and will happily try to match \B over and over again. We
    should probably improve the higher level aspects of this (because this
    is not the only case that can cause the same assertions being repeatedly
    evaluated at the same position), but this fixes the most obvious ones at
    the HIR level.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    385c681 View commit details
    Browse the repository at this point in the history
  12. fuzz: use structured fuzzer input

    This makes a couple of the fuzzer targets a bit nicer by just asking for
    structured data instead of trying to manifest it ourselves out of a
    &[u8].
    
    Closes #821
    5225225 authored and BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    05b50be View commit details
    Browse the repository at this point in the history
  13. fuzz: add size limit to regex building

    The fuzzer sometimes runs into situations where it builds regexes that
    can take a while to execute, such as `\B{10000}`. They fit within the
    default size limit, but the search times aren't great. But it's not a
    bug. So try to decrease the size limit a bit to try and prevent
    timeouts.
    
    We might consider trying to optimize cases like `\B{10000}`. A naive
    optimization would be to remove any redundant conditional epsilon
    transitions within a single epsilon closure, but that can be tricky to
    do a priori. The case of `\B{100000}` is probably easy to detect, but
    they can be arbitrarily complex.
    
    Another way to attack this would be to modify, say, the PikeVM to only
    compute whether a conditional epsilon transition should be followed once
    per haystack position. Right now, I think it is re-computing them even
    though it doesn't have to.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    cb436fd View commit details
    Browse the repository at this point in the history
  14. fuzz: add syntactic structurally aware fuzzers

    This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
    fuzzing much more targeted and complete.
    
    Closes #848
    addisoncrump authored and BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    bc1f1af View commit details
    Browse the repository at this point in the history
  15. regex-automata: fix bug in DFA quit behavior

    It turns out that the way we were dealing with quit states in the DFA
    was not quite right. Basically, if we entered a quit state and a match
    had been found, then we were returning the match instead of the error.
    But the match might not be the correct leftmost-first match, and so, we
    really shouldn't return it. Otherwise a regex like '\B.*' could match
    much less than it should.
    
    This was caught by a differential fuzzer developed in #848.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    9afca7d View commit details
    Browse the repository at this point in the history
  16. fuzz: add regression test

    This adds a regression test for a bug found in the *old* regex crate
    that isn't present with the regex-automata rewrite. I discovered this
    while doing differential fuzzing. I didn't do a root cause analysis of
    the bug, but my guess is a literal optimization problem.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    7bcca88 View commit details
    Browse the repository at this point in the history
  17. fuzz: add logging support to the fuzzer

    This makes it a little easier to introspect what the regex crate is
    doing. Just pass RUST_LOG=debug to get a general sense of things, and
    RUST_LOG=trace to get a lot more output.
    BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    4d4f011 View commit details
    Browse the repository at this point in the history
  18. fuzz: improve Arbitrary impl for Unicode classes

    ... and add some more fuzz testing based on it.
    
    Closes #991
    addisoncrump authored and BurntSushi committed Jun 5, 2023
    Configuration menu
    Copy the full SHA
    e34b914 View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2023

  1. *: lots of polish

    This commit grew into a monster. I ran out of energy trying to split
    everything up. For the most part, this commit is about polishing and
    writing docs.
    BurntSushi committed Jul 5, 2023
    Configuration menu
    Copy the full SHA
    e96bf8f View commit details
    Browse the repository at this point in the history
  2. changelog: 1.9.0

    I usually close tickets on a commit-by-commit basis, but this refactor
    was so big that it wasn't feasible to do that. So ticket closures are
    marked here.
    
    Closes #244
    Closes #259
    Closes #476
    Closes #644
    Closes #675
    Closes #824
    Closes #961
    
    Closes #68
    Closes #510
    Closes #787
    Closes #891
    
    Closes #429
    Closes #517
    Closes #579
    Closes #779
    Closes #850
    Closes #921
    Closes #976
    Closes #1002
    
    Closes #656
    BurntSushi committed Jul 5, 2023
    Configuration menu
    Copy the full SHA
    8513751 View commit details
    Browse the repository at this point in the history