first phase of migrating to regex-automata #977

This sets 'rust-version' to 1.60 and also increases the pinned Rust version that we test against in CI to 1.60.0. Rust 1.60.0 was released over a year ago and contains some important stuff. Notably, it includes namespaced and weak dependency features that are used in the (soon to be) released aho-corasick 1.0. They will also be extensively used in regex-automata 0.3, which is coming to a rust-lang/regex repository near you Real Soon Now.

Apparently in C, an empty parameter list means "the function takes an unspecified number of arguments." (lol.) But an explicit void means "the function takes zero arguments." The latter is indeed what we want here. Ref: https://softwareengineering.stackexchange.com/questions/286490/what-is-the-difference-between-function-and-functionvoid Closes #942

This is justified by the fact that a RegexSet is, after all, a set. And a set has a very obvious default value: the empty set. Plus, this is exactly what you get by passing a default `Vec` or an empty iterator to the `RegexSet::new` constructor. We specifically do not add a `Default` impl for Regex because it has no obvious default value. Fixes #905, Closes #906

There will be a new 'regex-cli' tool that will supplant this (and more).

'sc' refers to the 'Currency_Symbol' general category, but is also the abbreviation for the 'Script' property. So when going through the canonicalization process, it would get normalized to 'Script' before being checked as a general category. We fix it by special casing it. See also #719 Fixes #835, #899

This is more similar to the \p{Cf} bug than the \p{Sc} bug, but basically, 'lc' is an abbreviation for both 'Cased_Letter' and 'Lowercase_Mapping'. Since we don't support the latter (currently), we make 'lc' map to 'Cased_Letter'. If we do ever add 'Lowercase_Mapping' in the future, then we will just require users to type out its full form. Fixes #965

Previously this was only defined on 'ClassUnicode', but since 'Class' might contain a 'ClassUnicode', it should be defined here too. We don't need to update any call sites since this crate doesn't actually use 'Class::case_fold_simple' directly, and instead manipulates the underlying 'ClassUnicode' or 'ClassBytes'.

This effectively bumps the MSRV of 'regex' to Rust 1.56, which was released in Oct 2021. It's not quite a year at the time of writing, but I expect it will be a year by the time this change is released.

It turns out that all uses of 'as' in the regex-syntax crate can be replaced with either explicitly infallible routines (like 'u32::from(char)'), or with routines that will panic on failure. These panics are strictly better than truncating casts that might otherwise lead to subtle bugs in the context of this crate. (Namely, we never really care about the perf effects here, since regex parsing is just never a bottleneck.)

This method was deprecated a while ago, but we kept it around because it wasn't worth a breaking release to remove them. This also simplifies some imports.

This marks the various error types as '#[non_exhaustive]' instead of using a __Nonexhaustive variant hack. Closes #884

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

When Unicode mode is disabled (i.e., (?-u)), the Perl character classes (\w, \d and \s) revert to their ASCII definitions. The negated forms of these classes are also derived from their ASCII definitions, and this means that they may actually match bytes outside of ASCII and thus possibly invalid UTF-8. For this reason, when the translator is configured to only produce HIR that matches valid UTF-8, '(?-u)\W' should be rejected. Previously, it was not being rejected, which could actually lead to matches that produced offsets that split codepoints, and thus lead to panics when match offsets are used to slice a string. For example, this code fn main() { let re = regex::Regex::new(r"(?-u)\W").unwrap(); let haystack = "☃"; if let Some(m) = re.find(haystack) { println!("{:?}", &haystack[m.range()]); } } panics with byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃` That is, it reports a match at 0..1, which is technically correct, but the regex itself should have been rejected in the first place since the top-level Regex API always has UTF-8 mode enabled. Also, many of the replacement tests were using '(?-u)\W' (or similar) for some reason. I'm not sure why, so I just removed the '(?-u)' to make those tests pass. Whether Unicode is enabled or not doesn't seem to be an interesting detail for those tests. (All haystacks and replacements appear to be ASCII.) Fixes #895, Partially addresses #738

In effect, this adds support for no_std by depending on only core and alloc. There is still currently some benefit to enabling std support, namely, getting the 'std::error::Error' trait impls for the various error types. (Although, it seems like the 'Error' trait is going to get moved to 'core' finally.) Otherwise, the only 'std' things we use are in tests for tweaking stack sizes. This is the first step in an effort to make 'regex' itself work without depending on 'std'. 'regex' itself will be more precarious since it uses things like HashMap and Mutex that we'll need to find a way around. Getting around HashMap is easy (just use BTreeMap), but figuring out how to synchronize the threadpool will be interesting. Ref #476, Ref #477

I wish this feature were stable and enabled by default. I suspect that it maybe doesn't work correctly 100% of the time, but it's super useful. And manually annotating APIs is a huge pain, so it's worth at least attempting.

Get rid of those old crusty HTML links! Also, if an intradoc link is used that is bunk, fail the build.

I'm not sure exactly why I used three variants instead of two like how I've defined it in this patch. Possibly because the AST uses three variants? (The AST needs to do a little more work to store a span associated with where the name actually is in the expression, so it maybe makes a little more sense there.) In any case, this is the first step of many in simplifying the HIR.

This is apparently not used anywhere. So drop it. Also motivated by wanting to squash look-around assertions into a single enum. So 'is_negated' won't make sense on its own anymore.

Instead of having both 'HirKind::Anchor' and 'HirKind::WordBoundary', this patch flattens them into one 'hirKind::Look'. Why do this? I think they make more sense grouped together. Namely, they are all simplistic look-around assertions and they all tend to be handled with very similar logic.

This greatly simplifies how repetitions are represented in the HIR from a sprawling set of variants down to just a simple `(u32, Option<u32>)`. This is much simpler and still permits us to specialize all of the cases we did before if necessary. This also simplifies some of the HIR printer's output. e.g., 'a{1}' is just 'a'.

This fixes some corner cases in the HIR printer where it would print the concrete syntax of a regex that does not match the natural interpretation of the HIR. One such example of this is: concat(a, alt(b, c)) This would get printed as ab|c But clearly, it should be printed as: a(?:b|c) The issue here is that the printer only considers the current HirKind when determining how to print it. Sometimes a group is needed to print an alt (and even a concat, in the case of 'rep(+, concat(a, b))'), but sometimes it isn't. We could address this in a few different ways: 1) Always print concats and alts inside a non-capturing group. 2) Make the printer detect precisely the cases where a non-capturing group is needed. 3) Make the HIR smart constructors insert non-capturing groups when needed. 4) Do some other thing to change the HIR to prevent these sorts of things by construction. This patch goes with (1). The reason in favor of it is that HIR printer was always about printing an equivalent regex and never about trying to print a "nice" regex. Indeed, the HIR printer can't print a nice regex, because the HIR represents a rigorously simplifed view of a regex to make analysis easier. (The most obvious such example are Unicode character classes. For example, the HIR printer never prints '\w'.) So inserting some extra groups (which it already does) even when they aren't strictly needed is perfectly okay. But still, it's useful to say why we didn't do the other choices: 2) Modifying the printer to only print groups when they're actually needed is pretty difficult. I tried this briefly, and handling this case requires some notion of what the parent expression is. This winds up being a possible but hairy change. 3) Making the HIR more complicated to make the printer correct seems like it's optimizing for the wrong thing. Inserting extra groups in places just obfuscates HIR values that already have clear semantics. That is, use concat(a, alt(b, c)) over concat(a, group(alt(b, c))). 4) It's not clear how we would change the HIR to guarantee this sort of thing wouldn't happen. At the very least, it seems likely it would require a more complex data type. At first, I had thought (1) seemed inelegant. But the more I thought about it, the more it seemed quite consistent with how the HIR printer already worked. So that's the path I took here. Closes #516, Closes #731

No matter what 'a' is, 'a{0}' is always equivalent to an empty regex.

This gets rid of the old 'Literal' type: enum Literal { Unicode(char), Byte(u8), } and replaces it with struct Literal(Box<[u8]>); I did this primarily because I perceive the new version to be a bit simpler and is very likely to be more space efficient given some of the changes I have in mind (upcoming in subsequent commits). Namely, I want to include more analysis information beyond just simply booleans, and this means using up more space. Putting that analysis information on every single byte/char seems gratuitous. But putting it on every single sequence of byte/chars seems more justifiable. I also have a hand-wavy idea that this might make analysis a bit easier. And another hand-wavy idea that debug-printing such an HIR will make it a bit more comprehensible. Overall, this isn't a completely obvious win and I do wonder whether I'll regret this. For one thing, the translator is now a fair bit more complicated in exchange for not creating a 'Vec<u8>' for every 'ast::Literal' node. This also gives up the Unicode vs byte distinct and just commits to "all bytes." Instead, we do a UTF-8 check on every 'Hir::literal' call, and that in turn sets the UTF-8 property. This does seem a bit wasteful, and indeed, we do another UTF-8 check in the compiler (even though we could use 'unsafe' correctly and avoid it). However, once the new NFA compiler lands from regex-automata, it operates purely in byte-land and will not need to do another UTF-8 check. Moreover, a UTF-8 check, even on every literal, is likely barely measureable in the grand scheme of things. I do also worry that this is overwrought. In particular, the AST creates a node for each character. Then the HIR smooths them out to sequences of characters (that is, Vec<u8>). And then NFA compilation splits them back out into states where a state handles at most one character (or range of characters). But, I am taking somewhat of a leap-of-judgment here that this will make analysis easier and will overall use less space. But we'll see.

This makes the Debug impls for Literal and ClassRangeBytes a bit better. The former in particular. Instead of just printing a sequence of decimal numbers, we now print them as characters. Given the lackluster support for Vec<u8> as a string in the standard library, we copy a little bit of code from regex-automata to make the debug print for the Vec<u8> basically as nice as a String.

This commit completely rewrites how HIR properties are computed inductively. Firstly, 'Properties' is now boxed, so that it contributes less space to each HIR value. This does add an allocation for each HIR expression, but most HIR expressions already require at least one alloc anyway. And there should be far fewer of them now that we collapse literals together. Secondly, 'Properties' now computes far more general attributes instead of hyper-specific things. For example, instead of 'is_match_empty', we now have 'minimum_len' and 'maximum_len'. Similarly, instead of 'is_anchored_start' and 'is_anchored_end', we now compute sets of look-around assertions found anywhere, only as a prefix and only as a suffix. We also remove 'is_line_anchored_{start,end}'. There were only used in the 'grep-regex' crate and they seem unnecessary. They were otherwise fairly weird properties to compute.

Instead of using a boolean parameter, we just split them into dot_char, dot_byte, any_char, any_byte. Another path would be to use an enum, but this appeals to me a little more.

It turns out they are completely superfluous in the HIR, so we can drop them completely. We only need to explicitly represent capturing groups.

This makes it so 'a{1}' is rewritten as 'a' and '[a]' is rewritten as 'a'. A lot of the tests expected '[a]' to get preserved as a class in the HIR, so this required a bit of surgery.

In a previous commit, I replaced 'Hir::{any,dot}' a total of four methods. Essentially, I expanded out the boolean parameter to 'Hir::{any,dot}'. I later realized that we'll probably need a "dot except for CR and LF" too. And having four methods all for the same 'dot' construct seemed a bit much. So I've turned it into one method with a new 'Dot' enum. Eventually, that enum should grow two more variants: 'AnyCharExceptCRLF' and 'AnyByteExceptCRLF'. That sort of expansion would have been pretty annoying to do (because of naming) in the prior scheme.

We simplify construction a bit to prepare for bigger simplifications. We also fix a bug in 'Hir::alternation' where it would incorrectly return 'Hir::empty()' when given an empty alternation. That's correct for an empty concatenation, but an alternation with no branches is equivalent to an expression that never matches anything. To fix that, we create a new 'Hir::fail' that canonicalizes the HIR value used to indicate "impossible to match." Thankfully this bug was unlikely to be observed unless one was constructing HIR values manually. Namely, it is impossible to spell "empty alternation" in the concrete syntax of a regex.

The default derive(Debug) impl for Hir is very noisy because it lists out the properties for every Hir value. We change the default to just print out the actual expressions and omit the properties. But one can opt back into seeing the properties via the "alternate" impl. i.e., {:#?} instead of {:?}.

This makes the Hir::concat constructor a bit smarter by combining adjacent literals and flattening child concatenations into the parent concatenation.

Just always strip Properties. It's so annoying to see it when you really just want to see the syntax.

This commit simplifies alternations by flattening them, similar to how a recent commit flattened concatenations. Although, this is simpler than concatenations, because we can't do anything with literals. Like concatenations, we only need to look one layer deep, since this is applied inductively.

In short, simplify 'a|b|..|z' to '[a-z]'.

When a character class is empty, the HIR printer would emit '[]', which is not a valid regex. (Since if a ']' immediately follows an opening '[', then the ']' is interpreted literally and not a closing bracket.) Instead, we write '[a&&b]'. We could also do things like '(?u:\P{any})' or '(?-u:[\x00-\xFF])', but '[a&&b]' doesn't require any flags and also seems really obvious: the intersection of two distinct characters is obviously empty.

Since these functions are tiny and not polymorphic, we should permit them to be inlined across crate boundaries.

We need to know the length of the next codepoint we want to debug, otherwise it's possible for a naive 'slice[..4]' to fail if the end of the slice happens to split a codepoint.

And also add some inline annotations on non-generic but tiny functions.

After years of saying "literal extraction needs to be rewritten," I've finally gathered up the courage to do it. While this commit doesn't show it, this is actually now the third time I rewrote it. I rewrote it a second time about a week prior to this and got close to the finish line when I realized I had to throw it away. In that approach, I tried to abandon the "mark each individual literal as exact" idea in the original literal extraction code and instead treat the entire set of literals as "exact" or not. (I also changed the terminology from "complete" to "exact," which I think is maybe a bit better. I also got rid of "cut" and instead use "inexact.") The main problem with not marking each individual literal as exact or not is that it potentially inhibits longer literal extraction. For example, in the regex 'ab*cd', with individual literals marked as exact, we can extract the sequence [inexact(ab), exact(acd)]. But with the entire set being all exact or all inexact, there's no real way to let extraction continue through the empty string produced by the '*' repetition operator. There were some other problems with my second rewrite around short-circuiting concats/alternations when sequences got too big, but I think I could have resolved them. In the end, the third rewrite is quite good. It actually roughly corresponds to the original code, but is cleaned up and much more principled. The original code didn't do these things for example: 1. Didn't care about order and thus didn't correctly produce literals in a sequence for which leftmost-first match semantics were preserved. 2. Didn't differentiate between "empty set" and "infinite set." These are two pretty subtle cases and them not being distinct in the code was really quite messy. 3. The old code tried to carry a literal set throughout extraction and this has the effect of forcing every part of extraction to care about concatenation. But now we just force a stronger separation of responsibility. We might wind up with a few more allocs, but the in-practice small set size limits and short circuiting means that it usually doesn't matter relative to the other costs of parsing, translating and compiling regexes. I ported over pretty much all of the older tests and added more of my own. Overall, I feel much more confident about this new literal extraction than I do the old. We do also insert some heuristics for trimming literal sets in src/exec.rs that didn't exist before. This is because the new extraction code tends to the respect the limits a bit more faithfully and sometimes returned bigger sets than the old code. This is bad because more literals means prefilters are probably less effective. So we write a little bit of code to mitigate this. We also do let a few cases get slower for the time being. The suffix handling is not quite ideal, so many of the easy/medium/hard benchmarks are now a little slower. The name_alt3_nocase benchmark is also slower because the new extraction code notices that the literals blow the limits and only returns an infinite sequence. The old extraction code had (some in practice and unprincipled) techniques for shrinking its set as it went, and this caused literals to get extracted for it. We can fix this, but it will take a little more effort that I don't want to spend right now. In any case, the hope is to smooth out any issues as we head towards bringing regex-automata in.

I couldn't figure how how to easily make doc tests run with 'no_std' enabled, which regex-syntax now does. The '?' in particular was tripping me up. We still get doctest coverage from the top-level 'cargo test'.

Whenever we have an alternation where each of its branches are just classes, we can always combined that into a single class. Single classes are generally going to be cheaper to process further down the pipeline. Namely, instead of needing to branch between them at a higher level in an NFA graph, they can handled as one single unit.

This is generally overall pretty annoying.

This factors out the constructor for properties for an alternation into a public API method called "union." This is useful for collapsing multiple the properties for multiple regexes down into one analyzeable unit. The 'captures_len' method is also useful for making decisions like "if this regex has no captures and is all literals, then we don't ever need to use a regex engine under any circumstance."

This makes it a little terser to check different types of word boundaries in the lookset.

This gets rid of the AsRef<[u8]> FromIterator impl for Seq, which is unfortunate, but it lets us provide an AsRef<[u8]> impl for Literal. The latter ends up being quite useful to avoid copying and/or extra allocs.

Their docs explain their utility. In the old literal extraction setup, some (but not all) of this "optimization" was somewhat baked into the extraction itself, but now we codify it a bit more explicitly.

This also inverts its meaning, i.e., utf8=!allow_invalid_utf8. This naming is consistent with the naming used in regex-automata. In general, I find that using names without negations in them to be clearer, since it avoids double negations.

On some occasions, it can make sense to trim the current literal sequences before doing a 'union' IF doing that union would cause the sequences to become infinite because of a blown limit. If we can keep literal extraction going by trimming things down, that's usually beneficial. For now, we just kind of guess that '3' is a good sweet spot for this.

It is generally quite subtle to reason clearly about how this actually helps things in a finite automata based regex engine, but this sort of factoring can lead to lots of improvements: * We do use a bounded backtracker, so "pushing branches" down will help things there, just like it would with a classical backtracker. * It may lead to better literal extraction due to the simpler regex. Whether prefix factoring is really to blame here is somewhat unclear, but some downstream optimizations are more brittle than others. For example, the "reverse inner" optimization requires examining a "top level" concatenation to find literals to search for. By factoring out a common prefix, we potentially expand the number of regexes that have a top-level concat. For example, `\wfoo|\wbar` has no top-level concat but `\w(?:foo|bar)` does. * It should lead to faster matching even in finite automata oriented engines like the PikeVM, and also faster construction of DFAs (lazy or not). Namely, by pushing the branches down, we make it so they are visited less frequently, and thus the constant state shuffling caused by branches is reduced. The prefix extraction could be better, as mentioned in the comments, but this is a good start.

It turns out that both '(?P<name>...)' and '(?<name>...)' are rather common among regex engines. There are several that support just one or the other. Until this commit, the regex crate only supported the former, along with both RE2, RE2/J and Go's regexp package. There are also several regex engines that only supported the latter, such as Onigmo, Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction, and because there is somewhat little cost to doing so, we elect to support both. It looks like perhaps RE2 and Go's regexp package will go the same route, but it isn't fully decided yet: golang/go#58458 Closes #955, Closes #956

Unbelievably, this was using the size of the compiled prog *and* the heap memory used by the cache to compute the total memory used by the cache. The effect of this is that the reported size might be much bigger than what is actually used by the cache. This in turn would result in the lazy DFA thrashing the cache and going quite slow.

This is a transitory commit that will need to be updated once aho-corasick 1.0 is actually released. Its purpose is to make it so the regex crate, the "old" regex crate and regex-automata all agree on the same version of aho-corasick to use while in development.

Now that it *only* represents a capturing group, it makes sense to give it a more specific name.

Where 'sub' is short for 'sub-expression.'

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

This updates docs in a number of places, including adding examples. We also make it so zero-width matches never impact the 'utf8' property. In practice, this means '(?-u:\B)' is now considered to match valid UTF-8, which is consistent with the fact that 'a*' is considered to match valid UTF-8 too. We also do a refresh of the 'Look' and 'LookSet' APIs.

This resolves a long-standing (but somewhat minor) complaint that folks have with the regex crate: it does not permit escaping punctuation characters in cases where those characters do not need to be escaped. So things like \/, \" and \! would result in parse errors. Most other regex engines permit these, even in cases where they aren't needed. I had been against doing this for future evolution purposes, but it's incredibly unlikely that we're ever going to add a new meta character to the syntax. I literally cannot think of any conceivable future in which that might happen. However, we do continue to ban escapes for [0-9A-Za-z<>], because it is conceivable that we might add new escape sequences for those characters. (And 0-9 are already banned by virtue of them looking too much like backreferences, which aren't supported.) For example, we could add \Q...\E literal syntax. Or \< and \> as start and end word boundaries, as found in POSIX regex engines. Fixes #501

This changes the rules for capture names to be much less restrictive. Namely, the requirements are now: 1. Must begin with an `_` or any alphabetic codepoint. 2. After the first codepoint, the name may contain any sequence of alpha-numeric codepoints along with `_`, `.`, `[` and `]`. Closes #595

This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908

And do the same for 'static_captures_len'. The motivation for this is that the top-level Regex API had equivalently named methods 'captures_len' and 'static_captures_len', except those included the implicit group and were therefore always 1 more than the same APIs on Hir. We distinguish them by renaming the routines on HIR.

It turns out that it's not too hard to get HIR translation to run pretty slowly with some carefully crafted regexes. For example: (?i:[[:^space:]------------------------------------------------------------------------]) This regex is actually a [:^space:] class that has an empty class subtracted from it 36 times. For each subtraction, the resulting class--despite it not having changed---goes through Unicode case folding again. This in turn slows things way down. We introduce a fairly basic optimization that basically keeps track of whether an interval set has been folded or not. The idea was taken from PR #893, but was tweaked slightly. The magic of how it works is that if two interval sets have already been folded, then they retain that property after any of the set operations: negation, union, difference, intersection and symmetric difference. So case folding should generally only need to be run once for each "base" class, but then not again as operations are performed. Some benchmarks were added to rebar (which isn't public yet at time of writing). Closes #893

I'm overall coming around to the opinion that these tend to make the code harder to read. So I've been steadily dropping the Result aliases.

This rewrites how Unicode simple case folding worked. Instead of just defining a single function and expecting callers to deal with the fallout, we know define a stateful type that "knows" about the structure of the case folding table. For example, it now knows enough to avoid binary search lookups in most cases. All we really have to do is require that callers lookup codepoints in sequence, which is perfectly fine for our use case. Ref #893

Previously, classes would show up in the debug representation as very deeply nested things, making them more difficult to read than they need to be. This removes at least a few pretty redundant layers and uses a more compact range notation.

The contract of this function says that any invalid group offset should result in a return value of None. In general, it worked fine, unless the offset was so big that some internal multiplication overflowed. That could in turn produce an incorrect result or a panic. So we fix that here with checked arithmetic. Fixes #738, Fixes #950

This makes it clearer that the regex engine works by *logically* treating a haystack as a sequence of codepoints. Or more specifically, Unicode scalar values. Fixes #854

The existing docs were pretty paltry, and it turns out we can be a bit more helpful for folks when they hit this error. Fixes #846

Adding these methods has almost no cost and they can be convenient to have in some cases. Closes #810

The name is somewhat unfortunate, but it's actually kind of difficult to capture the right semantics in the name. The key bit is that the function returns the offset at the point at which a match is known, and that point might vary depending on which internal regex engine was used. Fixes #747

This clarifies that `x` is "verbose mode," and that whitespace becomes insignificant everywhere, including in character classes. We also add guidance for how to insert a space: either escape it or use a hex literal. Fixes #660

It is really unfortunate, but SetMatches::len and SetMatcher::iter().count() do not correspond go the same thing. It's not clear why I even added the SetMatches::len method in the first place, but it always returns the number of regexes in the set, and not the number of regexes that matched. We can't change the name (or remove the method) obviously, but we do add a warning to the docs. Fixes #625

And we make it an interesting example, i.e., one that demonstrates preference order semantics. Closes #610

This isn't *strictly* needed because of the existence of Regex::captures_read_at, but it does fill out the singular missing method. Namely, all other search routines have an *_at variant, so we might as well add it for Regex::captures too. Closes #547

This makes it so the Debug impl for Match only shows the actual matched text. Otherwise, the Match shows the entire haystack, which is likely to be misleading. Fixes #514

This is useful when doing structural recursion on a '&Hir' to produce a new 'Hir' derived from it.

Since it uses heap memory and because it's something you typically hang on to in a regex engine, we expose a routine for computing heap memory. We might consider doing this for other types in regex-syntax, but there hasn't been a strong need for it yet.

The wording appears to be a little unclear, so we switch it around a bit. Fixes #975

This will need to be updated again to add a date (maybe today?), but this should cover everything from the commit log.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first phase of migrating to regex-automata #977

first phase of migrating to regex-automata #977

Commits on Apr 17, 2023