Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first phase of migrating to regex-automata #977

Merged
merged 79 commits into from
Apr 17, 2023

Commits on Apr 17, 2023

  1. msrv: set to Rust 1.60.0

    This sets 'rust-version' to 1.60 and also increases the pinned Rust
    version that we test against in CI to 1.60.0.
    
    Rust 1.60.0 was released over a year ago and contains some important
    stuff. Notably, it includes namespaced and weak dependency features that
    are used in the (soon to be) released aho-corasick 1.0. They will also
    be extensively used in regex-automata 0.3, which is coming to a
    rust-lang/regex repository near you Real Soon Now.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    f43d745 View commit details
    Browse the repository at this point in the history
  2. capi: add missing void

    Apparently in C, an empty parameter list means "the function takes an
    unspecified number of arguments." (lol.) But an explicit void means
    "the function takes zero arguments." The latter is indeed what we want
    here.
    
    Ref: https://softwareengineering.stackexchange.com/questions/286490/what-is-the-difference-between-function-and-functionvoid
    
    Closes #942
    thechampagne authored and BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    b68896d View commit details
    Browse the repository at this point in the history
  3. api: impl Default for RegexSet

    This is justified by the fact that a RegexSet is, after all, a set. And
    a set has a very obvious default value: the empty set. Plus, this is
    exactly what you get by passing a default `Vec` or an empty iterator to
    the `RegexSet::new` constructor.
    
    We specifically do not add a `Default` impl for Regex because it has no
    obvious default value.
    
    Fixes #905, Closes #906
    sourcefrog authored and BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    caf0141 View commit details
    Browse the repository at this point in the history
  4. regex-debug: this removes regex-debug

    There will be a new 'regex-cli' tool that will supplant this (and more).
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    544374b View commit details
    Browse the repository at this point in the history
  5. syntax: \p{Sc} should map to \p{Currency_Symbol}

    'sc' refers to the 'Currency_Symbol' general category, but is also
    the abbreviation for the 'Script' property. So when going through the
    canonicalization process, it would get normalized to 'Script' before
    being checked as a general category. We fix it by special casing it.
    
    See also #719
    
    Fixes #835, #899
    snsmac authored and BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    345f18a View commit details
    Browse the repository at this point in the history
  6. syntax: \p{Lc} should map to \p{Cased_Letter}

    This is more similar to the \p{Cf} bug than the \p{Sc} bug, but
    basically, 'lc' is an abbreviation for both 'Cased_Letter' and
    'Lowercase_Mapping'. Since we don't support the latter (currently), we
    make 'lc' map to 'Cased_Letter'.
    
    If we do ever add 'Lowercase_Mapping' in the future, then we will just
    require users to type out its full form.
    
    Fixes #965
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    6bbb064 View commit details
    Browse the repository at this point in the history
  7. syntax: add 'try_case_fold_simple' to 'Class'

    Previously this was only defined on 'ClassUnicode', but since 'Class'
    might contain a 'ClassUnicode', it should be defined here too.
    
    We don't need to update any call sites since this crate doesn't
    actually use 'Class::case_fold_simple' directly, and instead
    manipulates the underlying 'ClassUnicode' or 'ClassBytes'.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    906d149 View commit details
    Browse the repository at this point in the history
  8. syntax: switch to Rust 2021

    This effectively bumps the MSRV of 'regex' to Rust 1.56, which was
    released in Oct 2021. It's not quite a year at the time of writing, but
    I expect it will be a year by the time this change is released.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    f59ebfa View commit details
    Browse the repository at this point in the history
  9. syntax: remove all uses of 'as'

    It turns out that all uses of 'as' in the regex-syntax crate can be
    replaced with either explicitly infallible routines (like
    'u32::from(char)'), or with routines that will panic on failure. These
    panics are strictly better than truncating casts that might otherwise
    lead to subtle bugs in the context of this crate. (Namely, we never
    really care about the perf effects here, since regex parsing is just
    never a bottleneck.)
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    a23911d View commit details
    Browse the repository at this point in the history
  10. syntax: remove 'std::error::Error::description' impls

    This method was deprecated a while ago, but we kept it around because it
    wasn't worth a breaking release to remove them.
    
    This also simplifies some imports.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    b147fe3 View commit details
    Browse the repository at this point in the history
  11. syntax: remove '__Nonexhaustive' hack, use #[non_exhaustive]

    This marks the various error types as '#[non_exhaustive]' instead of
    using a __Nonexhaustive variant hack.
    
    Closes #884
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    06df9ac View commit details
    Browse the repository at this point in the history
  12. syntax: permit empty character classes

    An empty character class is effectively a way to write something that
    can never match anything. The regex crate has pretty much always
    returned an error for such things because it was never taught how to
    handle "always fail" states. Partly because I just didn't think about it
    when initially writing the regex engines and partly because it isn't
    often useful.
    
    With that said, it should be supported for completeness and because
    there is no real reason to not support it. Moreover, it can be useful in
    certain contexts where regexes are generated and you want to insert an
    expression that can never match. It's somewhat contrived, but it
    happens when the interface is a regex pattern.
    
    Previously, the ban on empty character classes was implemented in the
    regex-syntax crate. But with the rewrite in #656 getting closer and
    closer to landing, it's now time to relax this restriction. However, we
    do keep the overall restriction in the 'regex' API by returning an error
    in the NFA compiler. Once #656 is done, the new regex engines will
    permit this case.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    5a770dc View commit details
    Browse the repository at this point in the history
  13. syntax: reject '(?-u)\W' when UTF-8 mode is enabled

    When Unicode mode is disabled (i.e., (?-u)), the Perl character classes
    (\w, \d and \s) revert to their ASCII definitions. The negated forms
    of these classes are also derived from their ASCII definitions, and this
    means that they may actually match bytes outside of ASCII and thus
    possibly invalid UTF-8. For this reason, when the translator is
    configured to only produce HIR that matches valid UTF-8, '(?-u)\W'
    should be rejected.
    
    Previously, it was not being rejected, which could actually lead to
    matches that produced offsets that split codepoints, and thus lead to
    panics when match offsets are used to slice a string. For example, this
    code
    
      fn main() {
          let re = regex::Regex::new(r"(?-u)\W").unwrap();
          let haystack = "☃";
          if let Some(m) = re.find(haystack) {
              println!("{:?}", &haystack[m.range()]);
          }
      }
    
    panics with
    
      byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃`
    
    That is, it reports a match at 0..1, which is technically correct, but
    the regex itself should have been rejected in the first place since the
    top-level Regex API always has UTF-8 mode enabled.
    
    Also, many of the replacement tests were using '(?-u)\W' (or similar)
    for some reason. I'm not sure why, so I just removed the '(?-u)' to make
    those tests pass. Whether Unicode is enabled or not doesn't seem to be
    an interesting detail for those tests. (All haystacks and replacements
    appear to be ASCII.)
    
    Fixes #895, Partially addresses #738
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    2b2e20a View commit details
    Browse the repository at this point in the history
  14. syntax: add 'std' feature

    In effect, this adds support for no_std by depending on only core and
    alloc. There is still currently some benefit to enabling std support,
    namely, getting the 'std::error::Error' trait impls for the various
    error types. (Although, it seems like the 'Error' trait is going to get
    moved to 'core' finally.) Otherwise, the only 'std' things we use are in
    tests for tweaking stack sizes.
    
    This is the first step in an effort to make 'regex' itself work without
    depending on 'std'. 'regex' itself will be more precarious since it uses
    things like HashMap and Mutex that we'll need to find a way around.
    Getting around HashMap is easy (just use BTreeMap), but figuring out how
    to synchronize the threadpool will be interesting.
    
    Ref #476, Ref #477
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    377232b View commit details
    Browse the repository at this point in the history
  15. syntax: enable 'doc_auto_cfg'

    I wish this feature were stable and enabled by default. I suspect that
    it maybe doesn't work correctly 100% of the time, but it's super useful.
    And manually annotating APIs is a huge pain, so it's worth at least
    attempting.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    5d9746d View commit details
    Browse the repository at this point in the history
  16. syntax: switch to rustdoc intra links

    Get rid of those old crusty HTML links!
    
    Also, if an intradoc link is used that is bunk, fail the build.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    7bd2d9a View commit details
    Browse the repository at this point in the history
  17. syntax: simplify hir::GroupKind

    I'm not sure exactly why I used three variants instead of two like how
    I've defined it in this patch. Possibly because the AST uses three
    variants? (The AST needs to do a little more work to store a span
    associated with where the name actually is in the expression, so it
    maybe makes a little more sense there.)
    
    In any case, this is the first step of many in simplifying the HIR.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    52d5393 View commit details
    Browse the repository at this point in the history
  18. syntax: remove WordBoundary::is_negated method

    This is apparently not used anywhere. So drop it.
    
    Also motivated by wanting to squash look-around assertions into a single
    enum. So 'is_negated' won't make sense on its own anymore.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    00ea571 View commit details
    Browse the repository at this point in the history
  19. syntax: flatten look-around assertions

    Instead of having both 'HirKind::Anchor' and 'HirKind::WordBoundary',
    this patch flattens them into one 'hirKind::Look'.
    
    Why do this? I think they make more sense grouped together. Namely, they
    are all simplistic look-around assertions and they all tend to be
    handled with very similar logic.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    1f707e7 View commit details
    Browse the repository at this point in the history
  20. syntax: simplify hir::Repetition

    This greatly simplifies how repetitions are represented in the HIR from
    a sprawling set of variants down to just a simple `(u32, Option<u32>)`.
    This is much simpler and still permits us to specialize all of the cases
    we did before if necessary.
    
    This also simplifies some of the HIR printer's output. e.g., 'a{1}' is
    just 'a'.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    aa0c117 View commit details
    Browse the repository at this point in the history
  21. syntax: fix HIR printer

    This fixes some corner cases in the HIR printer where it would print the
    concrete syntax of a regex that does not match the natural
    interpretation of the HIR. One such example of this is:
    
        concat(a, alt(b, c))
    
    This would get printed as
    
        ab|c
    
    But clearly, it should be printed as:
    
        a(?:b|c)
    
    The issue here is that the printer only considers the current HirKind
    when determining how to print it. Sometimes a group is needed to print
    an alt (and even a concat, in the case of 'rep(+, concat(a, b))'), but
    sometimes it isn't.
    
    We could address this in a few different ways:
    
    1) Always print concats and alts inside a non-capturing group.
    2) Make the printer detect precisely the cases where a non-capturing
       group is needed.
    3) Make the HIR smart constructors insert non-capturing groups when
       needed.
    4) Do some other thing to change the HIR to prevent these sorts of
       things by construction.
    
    This patch goes with (1). The reason in favor of it is that HIR printer
    was always about printing an equivalent regex and never about trying to
    print a "nice" regex. Indeed, the HIR printer can't print a nice regex,
    because the HIR represents a rigorously simplifed view of a regex to
    make analysis easier. (The most obvious such example are Unicode
    character classes. For example, the HIR printer never prints '\w'.) So
    inserting some extra groups (which it already does) even when they
    aren't strictly needed is perfectly okay.
    
    But still, it's useful to say why we didn't do the other choices:
    
    2) Modifying the printer to only print groups when they're actually
       needed is pretty difficult. I tried this briefly, and handling this
       case requires some notion of what the parent expression is. This
       winds up being a possible but hairy change.
    3) Making the HIR more complicated to make the printer correct seems
       like it's optimizing for the wrong thing. Inserting extra groups in
       places just obfuscates HIR values that already have clear semantics.
       That is, use concat(a, alt(b, c)) over concat(a, group(alt(b, c))).
    4) It's not clear how we would change the HIR to guarantee this sort of
       thing wouldn't happen. At the very least, it seems likely it would
       require a more complex data type.
    
    At first, I had thought (1) seemed inelegant. But the more I thought
    about it, the more it seemed quite consistent with how the HIR printer
    already worked. So that's the path I took here.
    
    Closes #516, Closes #731
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    6e59f32 View commit details
    Browse the repository at this point in the history
  22. syntax: 'a{0}' should compile to Hir::empty

    No matter what 'a' is, 'a{0}' is always equivalent to an empty regex.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    62802fa View commit details
    Browse the repository at this point in the history
  23. syntax: switch to 'Vec<u8>' to represent literals

    This gets rid of the old 'Literal' type:
    
      enum Literal {
        Unicode(char),
        Byte(u8),
      }
    
    and replaces it with
    
      struct Literal(Box<[u8]>);
    
    I did this primarily because I perceive the new version to be a bit
    simpler and is very likely to be more space efficient given some of the
    changes I have in mind (upcoming in subsequent commits). Namely, I want
    to include more analysis information beyond just simply booleans, and
    this means using up more space. Putting that analysis information on
    every single byte/char seems gratuitous. But putting it on every single
    sequence of byte/chars seems more justifiable.
    
    I also have a hand-wavy idea that this might make analysis a bit easier.
    And another hand-wavy idea that debug-printing such an HIR will make it
    a bit more comprehensible.
    
    Overall, this isn't a completely obvious win and I do wonder whether
    I'll regret this. For one thing, the translator is now a fair bit
    more complicated in exchange for not creating a 'Vec<u8>' for every
    'ast::Literal' node.
    
    This also gives up the Unicode vs byte distinct and just commits to "all
    bytes." Instead, we do a UTF-8 check on every 'Hir::literal' call, and
    that in turn sets the UTF-8 property. This does seem a bit wasteful, and
    indeed, we do another UTF-8 check in the compiler (even though we could
    use 'unsafe' correctly and avoid it). However, once the new NFA compiler
    lands from regex-automata, it operates purely in byte-land and will not
    need to do another UTF-8 check. Moreover, a UTF-8 check, even on every
    literal, is likely barely measureable in the grand scheme of things.
    
    I do also worry that this is overwrought. In particular, the AST creates
    a node for each character. Then the HIR smooths them out to sequences of
    characters (that is, Vec<u8>). And then NFA compilation splits them back
    out into states where a state handles at most one character (or range of
    characters). But, I am taking somewhat of a leap-of-judgment here that
    this will make analysis easier and will overall use less space. But
    we'll see.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    9c2b01e View commit details
    Browse the repository at this point in the history
  24. syntax: improve Debug impls

    This makes the Debug impls for Literal and ClassRangeBytes a bit better.
    The former in particular. Instead of just printing a sequence of decimal
    numbers, we now print them as characters.
    
    Given the lackluster support for Vec<u8> as a string in the standard
    library, we copy a little bit of code from regex-automata to make the
    debug print for the Vec<u8> basically as nice as a String.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    9f6f367 View commit details
    Browse the repository at this point in the history
  25. syntax: replace HirInfo with new Properties type

    This commit completely rewrites how HIR properties are computed
    inductively.
    
    Firstly, 'Properties' is now boxed, so that it contributes less space to
    each HIR value. This does add an allocation for each HIR expression, but
    most HIR expressions already require at least one alloc anyway. And
    there should be far fewer of them now that we collapse literals
    together.
    
    Secondly, 'Properties' now computes far more general attributes instead
    of hyper-specific things. For example, instead of 'is_match_empty', we
    now have 'minimum_len' and 'maximum_len'. Similarly, instead of
    'is_anchored_start' and 'is_anchored_end', we now compute sets of
    look-around assertions found anywhere, only as a prefix and only as a
    suffix.
    
    We also remove 'is_line_anchored_{start,end}'. There were only used in
    the 'grep-regex' crate and they seem unnecessary. They were otherwise
    fairly weird properties to compute.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    c2daa3b View commit details
    Browse the repository at this point in the history
  26. syntax: rejigger Hir::{dot,any}

    Instead of using a boolean parameter, we just split them into dot_char,
    dot_byte, any_char, any_byte.
    
    Another path would be to use an enum, but this appeals to me a little
    more.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    2c119ea View commit details
    Browse the repository at this point in the history
  27. syntax: remove non-capturing groups from HIR

    It turns out they are completely superfluous in the HIR, so we can drop
    them completely. We only need to explicitly represent capturing groups.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    3a8313e View commit details
    Browse the repository at this point in the history
  28. syntax: small HIR simplifications

    This makes it so 'a{1}' is rewritten as 'a' and '[a]' is rewritten as
    'a'.
    
    A lot of the tests expected '[a]' to get preserved as a class in the
    HIR, so this required a bit of surgery.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    05cf861 View commit details
    Browse the repository at this point in the history
  29. syntax: add 'Hir::dot' method to replace 'Hir::{any,dot}_{char,byte}'

    In a previous commit, I replaced 'Hir::{any,dot}' a total of four
    methods. Essentially, I expanded out the boolean parameter to
    'Hir::{any,dot}'.
    
    I later realized that we'll probably need a "dot except for CR and LF"
    too. And having four methods all for the same 'dot' construct seemed a
    bit much. So I've turned it into one method with a new 'Dot' enum.
    Eventually, that enum should grow two more variants: 'AnyCharExceptCRLF'
    and 'AnyByteExceptCRLF'. That sort of expansion would have been pretty
    annoying to do (because of naming) in the prior scheme.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    22a3612 View commit details
    Browse the repository at this point in the history
  30. syntax: tweak concat and alternation construction

    We simplify construction a bit to prepare for bigger simplifications.
    
    We also fix a bug in 'Hir::alternation' where it would incorrectly
    return 'Hir::empty()' when given an empty alternation. That's correct
    for an empty concatenation, but an alternation with no branches is
    equivalent to an expression that never matches anything.
    
    To fix that, we create a new 'Hir::fail' that canonicalizes the HIR
    value used to indicate "impossible to match."
    
    Thankfully this bug was unlikely to be observed unless one was
    constructing HIR values manually. Namely, it is impossible to spell
    "empty alternation" in the concrete syntax of a regex.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    a5ee3cc View commit details
    Browse the repository at this point in the history
  31. syntax: tweak Debug impl for Hir

    The default derive(Debug) impl for Hir is very noisy because it lists
    out the properties for every Hir value. We change the default to just
    print out the actual expressions and omit the properties. But one can
    opt back into seeing the properties via the "alternate" impl. i.e.,
    {:#?} instead of {:?}.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    7e41247 View commit details
    Browse the repository at this point in the history
  32. syntax: flatten concatenations

    This makes the Hir::concat constructor a bit smarter by combining
    adjacent literals and flattening child concatenations into the parent
    concatenation.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    73518e9 View commit details
    Browse the repository at this point in the history
  33. syntax: tweak Hir's debug impl again

    Just always strip Properties. It's so annoying to see it when you really
    just want to see the syntax.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    d9922cc View commit details
    Browse the repository at this point in the history
  34. syntax: simplify alternations

    This commit simplifies alternations by flattening them, similar
    to how a recent commit flattened concatenations. Although, this is
    simpler than concatenations, because we can't do anything with
    literals.
    
    Like concatenations, we only need to look one layer deep, since
    this is applied inductively.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    232256e View commit details
    Browse the repository at this point in the history
  35. syntax: simplify single char alternations

    In short, simplify 'a|b|..|z' to '[a-z]'.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    05f38ba View commit details
    Browse the repository at this point in the history
  36. syntax: fix empty char class bug in HIR printer

    When a character class is empty, the HIR printer would emit '[]', which
    is not a valid regex. (Since if a ']' immediately follows an opening
    '[', then the ']' is interpreted literally and not a closing bracket.)
    
    Instead, we write '[a&&b]'. We could also do things like '(?u:\P{any})'
    or '(?-u:[\x00-\xFF])', but '[a&&b]' doesn't require any flags and also
    seems really obvious: the intersection of two distinct characters is
    obviously empty.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    25d103d View commit details
    Browse the repository at this point in the history
  37. syntax: add some 'inline' annotations

    Since these functions are tiny and not polymorphic, we should permit
    them to be inlined across crate boundaries.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    d92cb55 View commit details
    Browse the repository at this point in the history
  38. syntax: fix utf-8 decoder

    We need to know the length of the next codepoint we want to debug,
    otherwise it's possible for a naive 'slice[..4]' to fail if the end of
    the slice happens to split a codepoint.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    561ed40 View commit details
    Browse the repository at this point in the history
  39. syntax: add new LookSet::contains_word convenience routine

    And also add some inline annotations on non-generic but tiny
    functions.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    781d264 View commit details
    Browse the repository at this point in the history
  40. syntax: rewrite literal extraction

    After years of saying "literal extraction needs to be rewritten," I've
    finally gathered up the courage to do it.
    
    While this commit doesn't show it, this is actually now the third time
    I rewrote it. I rewrote it a second time about a week prior to this and
    got close to the finish line when I realized I had to throw it away. In
    that approach, I tried to abandon the "mark each individual literal as
    exact" idea in the original literal extraction code and instead treat
    the entire set of literals as "exact" or not. (I also changed the
    terminology from "complete" to "exact," which I think is maybe a bit
    better. I also got rid of "cut" and instead use "inexact.")
    
    The main problem with not marking each individual literal as exact or
    not is that it potentially inhibits longer literal extraction. For
    example, in the regex 'ab*cd', with individual literals marked as exact,
    we can extract the sequence [inexact(ab), exact(acd)]. But with the
    entire set being all exact or all inexact, there's no real way to let
    extraction continue through the empty string produced by the '*'
    repetition operator.
    
    There were some other problems with my second rewrite around
    short-circuiting concats/alternations when sequences got too big, but I
    think I could have resolved them.
    
    In the end, the third rewrite is quite good. It actually roughly
    corresponds to the original code, but is cleaned up and much more
    principled. The original code didn't do these things for example:
    
    1. Didn't care about order and thus didn't correctly produce literals in
       a sequence for which leftmost-first match semantics were preserved.
    2. Didn't differentiate between "empty set" and "infinite set." These
       are two pretty subtle cases and them not being distinct in the code
       was really quite messy.
    3. The old code tried to carry a literal set throughout extraction and
       this has the effect of forcing every part of extraction to care about
       concatenation. But now we just force a stronger separation of
       responsibility. We might wind up with a few more allocs, but the
       in-practice small set size limits and short circuiting means that it
       usually doesn't matter relative to the other costs of parsing,
       translating and compiling regexes.
    
    I ported over pretty much all of the older tests and added more of my
    own. Overall, I feel much more confident about this new literal
    extraction than I do the old.
    
    We do also insert some heuristics for trimming literal sets in
    src/exec.rs that didn't exist before. This is because the new extraction
    code tends to the respect the limits a bit more faithfully and sometimes
    returned bigger sets than the old code. This is bad because more
    literals means prefilters are probably less effective. So we write a
    little bit of code to mitigate this.
    
    We also do let a few cases get slower for the time being. The suffix
    handling is not quite ideal, so many of the easy/medium/hard benchmarks
    are now a little slower.
    
    The name_alt3_nocase benchmark is also slower because the new extraction
    code notices that the literals blow the limits and only returns an
    infinite sequence. The old extraction code had (some in practice and
    unprincipled) techniques for shrinking its set as it went, and this
    caused literals to get extracted for it. We can fix this, but it will
    take a little more effort that I don't want to spend right now.
    
    In any case, the hope is to smooth out any issues as we head towards
    bringing regex-automata in.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    c15240b View commit details
    Browse the repository at this point in the history
  41. syntax: add --lib to syntax tests

    I couldn't figure how how to easily make doc tests run with 'no_std'
    enabled, which regex-syntax now does. The '?' in particular was tripping
    me up.
    
    We still get doctest coverage from the top-level 'cargo test'.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    724ae3e View commit details
    Browse the repository at this point in the history
  42. syntax: rewrite 'cls1|..|clsN' as '[cls1..clsN]'

    Whenever we have an alternation where each of its branches are just
    classes, we can always combined that into a single class. Single classes
    are generally going to be cheaper to process further down the pipeline.
    Namely, instead of needing to branch between them at a higher level in
    an NFA graph, they can handled as one single unit.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    60b9a6c View commit details
    Browse the repository at this point in the history
  43. syntax: remove 'deny(warnings)'

    This is generally overall pretty annoying.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    6d254aa View commit details
    Browse the repository at this point in the history
  44. syntax: add Properties::{union,captures_len}

    This factors out the constructor for properties for an alternation into
    a public API method called "union." This is useful for collapsing
    multiple the properties for multiple regexes down into one analyzeable
    unit.
    
    The 'captures_len' method is also useful for making decisions like "if
    this regex has no captures and is all literals, then we don't ever need
    to use a regex engine under any circumstance."
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    01a89b6 View commit details
    Browse the repository at this point in the history
  45. syntax: add more convenience routines to LookSet

    This makes it a little terser to check different types of
    word boundaries in the lookset.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    e995b73 View commit details
    Browse the repository at this point in the history
  46. syntax: move around somethings

    This gets rid of the AsRef<[u8]> FromIterator impl for Seq,
    which is unfortunate, but it lets us provide an AsRef<[u8]>
    impl for Literal. The latter ends up being quite useful to
    avoid copying and/or extra allocs.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    72187cb View commit details
    Browse the repository at this point in the history
  47. syntax: add 'optimize' routines to 'hir::literal::Seq'

    Their docs explain their utility. In the old literal extraction
    setup, some (but not all) of this "optimization" was somewhat
    baked into the extraction itself, but now we codify it a bit
    more explicitly.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    7a75222 View commit details
    Browse the repository at this point in the history
  48. syntax: rename 'allow_invalid_utf8' to 'utf8'

    This also inverts its meaning, i.e., utf8=!allow_invalid_utf8. This
    naming is consistent with the naming used in regex-automata. In general,
    I find that using names without negations in them to be clearer, since
    it avoids double negations.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    c5754ef View commit details
    Browse the repository at this point in the history
  49. syntax: trim literal sequence if necessary

    On some occasions, it can make sense to trim the current
    literal sequences before doing a 'union' IF doing that
    union would cause the sequences to become infinite because
    of a blown limit. If we can keep literal extraction going
    by trimming things down, that's usually beneficial.
    
    For now, we just kind of guess that '3' is a good sweet
    spot for this.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    a0454c2 View commit details
    Browse the repository at this point in the history
  50. syntax: factor out common prefixes of alternations

    It is generally quite subtle to reason clearly about how this
    actually helps things in a finite automata based regex engine, but this
    sort of factoring can lead to lots of improvements:
    
    * We do use a bounded backtracker, so "pushing branches" down will help
    things there, just like it would with a classical backtracker.
    * It may lead to better literal extraction due to the simpler regex.
    Whether prefix factoring is really to blame here is somewhat unclear,
    but some downstream optimizations are more brittle than others. For
    example, the "reverse inner" optimization requires examining a "top
    level" concatenation to find literals to search for. By factoring out a
    common prefix, we potentially expand the number of regexes that have a
    top-level concat. For example, `\wfoo|\wbar` has no top-level concat but
    `\w(?:foo|bar)` does.
    * It should lead to faster matching even in finite automata oriented
    engines like the PikeVM, and also faster construction of DFAs (lazy or
    not). Namely, by pushing the branches down, we make it so they are
    visited less frequently, and thus the constant state shuffling caused by
    branches is reduced.
    
    The prefix extraction could be better, as mentioned in the comments, but
    this is a good start.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    557f0ea View commit details
    Browse the repository at this point in the history
  51. syntax: support (?< syntax for named groups

    It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
    common among regex engines. There are several that support just one or
    the other. Until this commit, the regex crate only supported the former,
    along with both RE2, RE2/J and Go's regexp package. There are also
    several regex engines that only supported the latter, such as Onigmo,
    Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
    and because there is somewhat little cost to doing so, we elect to
    support both.
    
    It looks like perhaps RE2 and Go's regexp package will go the same
    route, but it isn't fully decided yet:
    golang/go#58458
    
    Closes #955, Closes #956
    01mf02 authored and BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    541aa42 View commit details
    Browse the repository at this point in the history
  52. dfa: fix approximate cache size

    Unbelievably, this was using the size of the compiled prog *and* the
    heap memory used by the cache to compute the total memory used by the
    cache. The effect of this is that the reported size might be much bigger
    than what is actually used by the cache. This in turn would result in
    the lazy DFA thrashing the cache and going quite slow.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    19b29cf View commit details
    Browse the repository at this point in the history
  53. impl: switch to aho-corasick 1.0

    This is a transitory commit that will need to be updated once
    aho-corasick 1.0 is actually released. Its purpose is to make it so the
    regex crate, the "old" regex crate and regex-automata all agree on the
    same version of aho-corasick to use while in development.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    ca03c73 View commit details
    Browse the repository at this point in the history
  54. syntax: rename 'Group' to 'Capture'

    Now that it *only* represents a capturing group, it makes sense to give
    it a more specific name.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    99d8436 View commit details
    Browse the repository at this point in the history
  55. syntax: rename 'hir' to 'sub'

    Where 'sub' is short for 'sub-expression.'
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    6cb02d9 View commit details
    Browse the repository at this point in the history
  56. syntax: add support for CRLF-aware line anchors

    This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
    'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
    causes '.' to *not* match \r in addition to \n (unless the 's' flag is
    enabled of course).
    
    The intended semantics are that CRLF mode makes \r\n, \r and \n line
    terminators but with one key property: \r\n is treated as a single line
    terminator. That is, ^/$ do not match between \r and \n.
    
    This partially addresses #244 by adding syntax support. Currently, if
    you try to use this new flag, the regex compiler will report an error.
    We intend to finish support for this once #656 is complete. (Indeed, at
    time of writing, CRLF matching works in regex-automata.)
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    0114235 View commit details
    Browse the repository at this point in the history
  57. syntax: polish and doc updates

    This updates docs in a number of places, including adding examples.
    
    We also make it so zero-width matches never impact the 'utf8' property.
    In practice, this means '(?-u:\B)' is now considered to match valid
    UTF-8, which is consistent with the fact that 'a*' is considered to
    match valid UTF-8 too.
    
    We also do a refresh of the 'Look' and 'LookSet' APIs.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    e4006af View commit details
    Browse the repository at this point in the history
  58. syntax: permit most no-op escape sequences

    This resolves a long-standing (but somewhat minor) complaint that folks
    have with the regex crate: it does not permit escaping punctuation
    characters in cases where those characters do not need to be escaped. So
    things like \/, \" and \! would result in parse errors. Most other regex
    engines permit these, even in cases where they aren't needed.
    
    I had been against doing this for future evolution purposes, but it's
    incredibly unlikely that we're ever going to add a new meta character to
    the syntax. I literally cannot think of any conceivable future in which
    that might happen.
    
    However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
    conceivable that we might add new escape sequences for those characters.
    (And 0-9 are already banned by virtue of them looking too much like
    backreferences, which aren't supported.) For example, we could add
    \Q...\E literal syntax. Or \< and \> as start and end word boundaries,
    as found in POSIX regex engines.
    
    Fixes #501
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    fbdc4a9 View commit details
    Browse the repository at this point in the history
  59. syntax: allow Unicode in capture names

    This changes the rules for capture names to be much less restrictive.
    Namely, the requirements are now:
    
    1. Must begin with an `_` or any alphabetic codepoint.
    2. After the first codepoint, the name may contain any sequence of
       alpha-numeric codepoints along with `_`, `.`, `[` and `]`.
    
    Closes #595
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    0732763 View commit details
    Browse the repository at this point in the history
  60. api: add new 'Regex::static_captures_len' method

    This adds a new routine for computing the static number of capture
    groups that will appear in every match. If the number of groups is not
    invariant across all matches, then there is no static capture length.
    
    This is meant to help implement higher level convenience APIs for
    extracting capture groups, such as the one described in #824. We may
    wind up including such APIs in the regex crate itself, but this commit
    stops short of that. Instead, we just add this new property which should
    permit those APIs to exist outside of this crate for now.
    
    Closes #908
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    8a0bf38 View commit details
    Browse the repository at this point in the history
  61. syntax: rename 'captures_len' to 'explicit_captures_len'

    And do the same for 'static_captures_len'.
    
    The motivation for this is that the top-level Regex API had equivalently
    named methods 'captures_len' and 'static_captures_len', except those
    included the implicit group and were therefore always 1 more than the
    same APIs on Hir. We distinguish them by renaming the routines on HIR.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    cf34553 View commit details
    Browse the repository at this point in the history
  62. syntax: optimize case folding

    It turns out that it's not too hard to get HIR translation to run pretty
    slowly with some carefully crafted regexes. For example:
    
        (?i:[[:^space:]------------------------------------------------------------------------])
    
    This regex is actually a [:^space:] class that has an empty class
    subtracted from it 36 times. For each subtraction, the resulting
    class--despite it not having changed---goes through Unicode case folding
    again. This in turn slows things way down.
    
    We introduce a fairly basic optimization that basically keeps track of
    whether an interval set has been folded or not. The idea was taken from
    PR #893, but was tweaked slightly. The magic of how it works is that if
    two interval sets have already been folded, then they retain that
    property after any of the set operations: negation, union, difference,
    intersection and symmetric difference. So case folding should generally
    only need to be run once for each "base" class, but then not again as
    operations are performed.
    
    Some benchmarks were added to rebar (which isn't public yet at time of
    writing).
    
    Closes #893
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    7212a03 View commit details
    Browse the repository at this point in the history
  63. syntax: drop some Result type aliases

    I'm overall coming around to the opinion that these tend to make the
    code harder to read. So I've been steadily dropping the Result aliases.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    0dd5853 View commit details
    Browse the repository at this point in the history
  64. syntax: refactor and optimize case folding

    This rewrites how Unicode simple case folding worked. Instead of just
    defining a single function and expecting callers to deal with the
    fallout, we know define a stateful type that "knows" about the structure
    of the case folding table. For example, it now knows enough to avoid
    binary search lookups in most cases. All we really have to do is require
    that callers lookup codepoints in sequence, which is perfectly fine for
    our use case.
    
    Ref #893
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    627c997 View commit details
    Browse the repository at this point in the history
  65. syntax: improve Debug impl for Class

    Previously, classes would show up in the debug representation as very
    deeply nested things, making them more difficult to read than they need
    to be. This removes at least a few pretty redundant layers and uses a
    more compact range notation.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    3c49615 View commit details
    Browse the repository at this point in the history
  66. bug: fix CaptureLocations::get to handle invalid offsets

    The contract of this function says that any invalid group offset should
    result in a return value of None. In general, it worked fine, unless the
    offset was so big that some internal multiplication overflowed. That
    could in turn produce an incorrect result or a panic. So we fix that
    here with checked arithmetic.
    
    Fixes #738, Fixes #950
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    b8ab381 View commit details
    Browse the repository at this point in the history
  67. doc: add wording about Unicode scalar values

    This makes it clearer that the regex engine works by *logically*
    treating a haystack as a sequence of codepoints. Or more specifically,
    Unicode scalar values.
    
    Fixes #854
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    e65ba17 View commit details
    Browse the repository at this point in the history
  68. doc: add more explanation to 'CompiledTooBig' error

    The existing docs were pretty paltry, and it turns out we can be a bit
    more helpful for folks when they hit this error.
    
    Fixes #846
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    d04ea10 View commit details
    Browse the repository at this point in the history
  69. api: add Match::{is_empty, len}

    Adding these methods has almost no cost and they can be convenient to
    have in some cases.
    
    Closes #810
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    07c453d View commit details
    Browse the repository at this point in the history
  70. doc: tweak docs for 'shortest_match'

    The name is somewhat unfortunate, but it's actually kind of difficult to
    capture the right semantics in the name. The key bit is that the
    function returns the offset at the point at which a match is known, and
    that point might vary depending on which internal regex engine was used.
    
    Fixes #747
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    be2afa1 View commit details
    Browse the repository at this point in the history
  71. doc: clarify verbose mode

    This clarifies that `x` is "verbose mode," and that whitespace becomes
    insignificant everywhere, including in character classes. We also add
    guidance for how to insert a space: either escape it or use a hex
    literal.
    
    Fixes #660
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    061dd68 View commit details
    Browse the repository at this point in the history
  72. doc: clarify meaning of SetMatches::len

    It is really unfortunate, but SetMatches::len and
    SetMatcher::iter().count() do not correspond go the same thing. It's not
    clear why I even added the SetMatches::len method in the first place,
    but it always returns the number of regexes in the set, and not the
    number of regexes that matched.
    
    We can't change the name (or remove the method) obviously, but we do add
    a warning to the docs.
    
    Fixes #625
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    67a60cf View commit details
    Browse the repository at this point in the history
  73. doc: add example that uses an alternation

    And we make it an interesting example, i.e., one that demonstrates
    preference order semantics.
    
    Closes #610
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    f3de42b View commit details
    Browse the repository at this point in the history
  74. api: add Regex::captures_at

    This isn't *strictly* needed because of the existence of
    Regex::captures_read_at, but it does fill out the singular missing
    method. Namely, all other search routines have an *_at variant, so we
    might as well add it for Regex::captures too.
    
    Closes #547
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    cdf6325 View commit details
    Browse the repository at this point in the history
  75. api: improve Debug impl for Match

    This makes it so the Debug impl for Match only shows the actual matched
    text. Otherwise, the Match shows the entire haystack, which is likely to
    be misleading.
    
    Fixes #514
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    3988431 View commit details
    Browse the repository at this point in the history
  76. syntax: add 'Repetition::with'

    This is useful when doing structural recursion on a '&Hir' to produce a
    new 'Hir' derived from it.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    3f4bfa6 View commit details
    Browse the repository at this point in the history
  77. syntax: add 'Properties::memory_usage'

    Since it uses heap memory and because it's something you typically hang
    on to in a regex engine, we expose a routine for computing heap memory.
    
    We might consider doing this for other types in regex-syntax, but there
    hasn't been a strong need for it yet.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    e166658 View commit details
    Browse the repository at this point in the history
  78. doc: tweak presentation of \pN syntax

    The wording appears to be a little unclear, so we switch it around a
    bit.
    
    Fixes #975
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    a34c1a7 View commit details
    Browse the repository at this point in the history
  79. changelog: add entry for regex 1.8

    This will need to be updated again to add a date (maybe today?), but
    this should cover everything from the commit log.
    BurntSushi committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    82b0f0d View commit details
    Browse the repository at this point in the history