-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first phase of migrating to regex-automata #977
Commits on Apr 17, 2023
-
This sets 'rust-version' to 1.60 and also increases the pinned Rust version that we test against in CI to 1.60.0. Rust 1.60.0 was released over a year ago and contains some important stuff. Notably, it includes namespaced and weak dependency features that are used in the (soon to be) released aho-corasick 1.0. They will also be extensively used in regex-automata 0.3, which is coming to a rust-lang/regex repository near you Real Soon Now.
Configuration menu - View commit details
-
Copy full SHA for f43d745 - Browse repository at this point
Copy the full SHA f43d745View commit details -
Apparently in C, an empty parameter list means "the function takes an unspecified number of arguments." (lol.) But an explicit void means "the function takes zero arguments." The latter is indeed what we want here. Ref: https://softwareengineering.stackexchange.com/questions/286490/what-is-the-difference-between-function-and-functionvoid Closes #942
Configuration menu - View commit details
-
Copy full SHA for b68896d - Browse repository at this point
Copy the full SHA b68896dView commit details -
api: impl Default for RegexSet
This is justified by the fact that a RegexSet is, after all, a set. And a set has a very obvious default value: the empty set. Plus, this is exactly what you get by passing a default `Vec` or an empty iterator to the `RegexSet::new` constructor. We specifically do not add a `Default` impl for Regex because it has no obvious default value. Fixes #905, Closes #906
Configuration menu - View commit details
-
Copy full SHA for caf0141 - Browse repository at this point
Copy the full SHA caf0141View commit details -
regex-debug: this removes regex-debug
There will be a new 'regex-cli' tool that will supplant this (and more).
Configuration menu - View commit details
-
Copy full SHA for 544374b - Browse repository at this point
Copy the full SHA 544374bView commit details -
syntax: \p{Sc} should map to \p{Currency_Symbol}
'sc' refers to the 'Currency_Symbol' general category, but is also the abbreviation for the 'Script' property. So when going through the canonicalization process, it would get normalized to 'Script' before being checked as a general category. We fix it by special casing it. See also #719 Fixes #835, #899
Configuration menu - View commit details
-
Copy full SHA for 345f18a - Browse repository at this point
Copy the full SHA 345f18aView commit details -
syntax: \p{Lc} should map to \p{Cased_Letter}
This is more similar to the \p{Cf} bug than the \p{Sc} bug, but basically, 'lc' is an abbreviation for both 'Cased_Letter' and 'Lowercase_Mapping'. Since we don't support the latter (currently), we make 'lc' map to 'Cased_Letter'. If we do ever add 'Lowercase_Mapping' in the future, then we will just require users to type out its full form. Fixes #965
Configuration menu - View commit details
-
Copy full SHA for 6bbb064 - Browse repository at this point
Copy the full SHA 6bbb064View commit details -
syntax: add 'try_case_fold_simple' to 'Class'
Previously this was only defined on 'ClassUnicode', but since 'Class' might contain a 'ClassUnicode', it should be defined here too. We don't need to update any call sites since this crate doesn't actually use 'Class::case_fold_simple' directly, and instead manipulates the underlying 'ClassUnicode' or 'ClassBytes'.
Configuration menu - View commit details
-
Copy full SHA for 906d149 - Browse repository at this point
Copy the full SHA 906d149View commit details -
This effectively bumps the MSRV of 'regex' to Rust 1.56, which was released in Oct 2021. It's not quite a year at the time of writing, but I expect it will be a year by the time this change is released.
Configuration menu - View commit details
-
Copy full SHA for f59ebfa - Browse repository at this point
Copy the full SHA f59ebfaView commit details -
syntax: remove all uses of 'as'
It turns out that all uses of 'as' in the regex-syntax crate can be replaced with either explicitly infallible routines (like 'u32::from(char)'), or with routines that will panic on failure. These panics are strictly better than truncating casts that might otherwise lead to subtle bugs in the context of this crate. (Namely, we never really care about the perf effects here, since regex parsing is just never a bottleneck.)
Configuration menu - View commit details
-
Copy full SHA for a23911d - Browse repository at this point
Copy the full SHA a23911dView commit details -
syntax: remove 'std::error::Error::description' impls
This method was deprecated a while ago, but we kept it around because it wasn't worth a breaking release to remove them. This also simplifies some imports.
Configuration menu - View commit details
-
Copy full SHA for b147fe3 - Browse repository at this point
Copy the full SHA b147fe3View commit details -
syntax: remove '__Nonexhaustive' hack, use #[non_exhaustive]
This marks the various error types as '#[non_exhaustive]' instead of using a __Nonexhaustive variant hack. Closes #884
Configuration menu - View commit details
-
Copy full SHA for 06df9ac - Browse repository at this point
Copy the full SHA 06df9acView commit details -
syntax: permit empty character classes
An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.
Configuration menu - View commit details
-
Copy full SHA for 5a770dc - Browse repository at this point
Copy the full SHA 5a770dcView commit details -
syntax: reject '(?-u)\W' when UTF-8 mode is enabled
When Unicode mode is disabled (i.e., (?-u)), the Perl character classes (\w, \d and \s) revert to their ASCII definitions. The negated forms of these classes are also derived from their ASCII definitions, and this means that they may actually match bytes outside of ASCII and thus possibly invalid UTF-8. For this reason, when the translator is configured to only produce HIR that matches valid UTF-8, '(?-u)\W' should be rejected. Previously, it was not being rejected, which could actually lead to matches that produced offsets that split codepoints, and thus lead to panics when match offsets are used to slice a string. For example, this code fn main() { let re = regex::Regex::new(r"(?-u)\W").unwrap(); let haystack = "☃"; if let Some(m) = re.find(haystack) { println!("{:?}", &haystack[m.range()]); } } panics with byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃` That is, it reports a match at 0..1, which is technically correct, but the regex itself should have been rejected in the first place since the top-level Regex API always has UTF-8 mode enabled. Also, many of the replacement tests were using '(?-u)\W' (or similar) for some reason. I'm not sure why, so I just removed the '(?-u)' to make those tests pass. Whether Unicode is enabled or not doesn't seem to be an interesting detail for those tests. (All haystacks and replacements appear to be ASCII.) Fixes #895, Partially addresses #738
Configuration menu - View commit details
-
Copy full SHA for 2b2e20a - Browse repository at this point
Copy the full SHA 2b2e20aView commit details -
In effect, this adds support for no_std by depending on only core and alloc. There is still currently some benefit to enabling std support, namely, getting the 'std::error::Error' trait impls for the various error types. (Although, it seems like the 'Error' trait is going to get moved to 'core' finally.) Otherwise, the only 'std' things we use are in tests for tweaking stack sizes. This is the first step in an effort to make 'regex' itself work without depending on 'std'. 'regex' itself will be more precarious since it uses things like HashMap and Mutex that we'll need to find a way around. Getting around HashMap is easy (just use BTreeMap), but figuring out how to synchronize the threadpool will be interesting. Ref #476, Ref #477
Configuration menu - View commit details
-
Copy full SHA for 377232b - Browse repository at this point
Copy the full SHA 377232bView commit details -
I wish this feature were stable and enabled by default. I suspect that it maybe doesn't work correctly 100% of the time, but it's super useful. And manually annotating APIs is a huge pain, so it's worth at least attempting.
Configuration menu - View commit details
-
Copy full SHA for 5d9746d - Browse repository at this point
Copy the full SHA 5d9746dView commit details -
syntax: switch to rustdoc intra links
Get rid of those old crusty HTML links! Also, if an intradoc link is used that is bunk, fail the build.
Configuration menu - View commit details
-
Copy full SHA for 7bd2d9a - Browse repository at this point
Copy the full SHA 7bd2d9aView commit details -
syntax: simplify hir::GroupKind
I'm not sure exactly why I used three variants instead of two like how I've defined it in this patch. Possibly because the AST uses three variants? (The AST needs to do a little more work to store a span associated with where the name actually is in the expression, so it maybe makes a little more sense there.) In any case, this is the first step of many in simplifying the HIR.
Configuration menu - View commit details
-
Copy full SHA for 52d5393 - Browse repository at this point
Copy the full SHA 52d5393View commit details -
syntax: remove WordBoundary::is_negated method
This is apparently not used anywhere. So drop it. Also motivated by wanting to squash look-around assertions into a single enum. So 'is_negated' won't make sense on its own anymore.
Configuration menu - View commit details
-
Copy full SHA for 00ea571 - Browse repository at this point
Copy the full SHA 00ea571View commit details -
syntax: flatten look-around assertions
Instead of having both 'HirKind::Anchor' and 'HirKind::WordBoundary', this patch flattens them into one 'hirKind::Look'. Why do this? I think they make more sense grouped together. Namely, they are all simplistic look-around assertions and they all tend to be handled with very similar logic.
Configuration menu - View commit details
-
Copy full SHA for 1f707e7 - Browse repository at this point
Copy the full SHA 1f707e7View commit details -
syntax: simplify hir::Repetition
This greatly simplifies how repetitions are represented in the HIR from a sprawling set of variants down to just a simple `(u32, Option<u32>)`. This is much simpler and still permits us to specialize all of the cases we did before if necessary. This also simplifies some of the HIR printer's output. e.g., 'a{1}' is just 'a'.
Configuration menu - View commit details
-
Copy full SHA for aa0c117 - Browse repository at this point
Copy the full SHA aa0c117View commit details -
This fixes some corner cases in the HIR printer where it would print the concrete syntax of a regex that does not match the natural interpretation of the HIR. One such example of this is: concat(a, alt(b, c)) This would get printed as ab|c But clearly, it should be printed as: a(?:b|c) The issue here is that the printer only considers the current HirKind when determining how to print it. Sometimes a group is needed to print an alt (and even a concat, in the case of 'rep(+, concat(a, b))'), but sometimes it isn't. We could address this in a few different ways: 1) Always print concats and alts inside a non-capturing group. 2) Make the printer detect precisely the cases where a non-capturing group is needed. 3) Make the HIR smart constructors insert non-capturing groups when needed. 4) Do some other thing to change the HIR to prevent these sorts of things by construction. This patch goes with (1). The reason in favor of it is that HIR printer was always about printing an equivalent regex and never about trying to print a "nice" regex. Indeed, the HIR printer can't print a nice regex, because the HIR represents a rigorously simplifed view of a regex to make analysis easier. (The most obvious such example are Unicode character classes. For example, the HIR printer never prints '\w'.) So inserting some extra groups (which it already does) even when they aren't strictly needed is perfectly okay. But still, it's useful to say why we didn't do the other choices: 2) Modifying the printer to only print groups when they're actually needed is pretty difficult. I tried this briefly, and handling this case requires some notion of what the parent expression is. This winds up being a possible but hairy change. 3) Making the HIR more complicated to make the printer correct seems like it's optimizing for the wrong thing. Inserting extra groups in places just obfuscates HIR values that already have clear semantics. That is, use concat(a, alt(b, c)) over concat(a, group(alt(b, c))). 4) It's not clear how we would change the HIR to guarantee this sort of thing wouldn't happen. At the very least, it seems likely it would require a more complex data type. At first, I had thought (1) seemed inelegant. But the more I thought about it, the more it seemed quite consistent with how the HIR printer already worked. So that's the path I took here. Closes #516, Closes #731
Configuration menu - View commit details
-
Copy full SHA for 6e59f32 - Browse repository at this point
Copy the full SHA 6e59f32View commit details -
syntax: 'a{0}' should compile to Hir::empty
No matter what 'a' is, 'a{0}' is always equivalent to an empty regex.
Configuration menu - View commit details
-
Copy full SHA for 62802fa - Browse repository at this point
Copy the full SHA 62802faView commit details -
syntax: switch to 'Vec<u8>' to represent literals
This gets rid of the old 'Literal' type: enum Literal { Unicode(char), Byte(u8), } and replaces it with struct Literal(Box<[u8]>); I did this primarily because I perceive the new version to be a bit simpler and is very likely to be more space efficient given some of the changes I have in mind (upcoming in subsequent commits). Namely, I want to include more analysis information beyond just simply booleans, and this means using up more space. Putting that analysis information on every single byte/char seems gratuitous. But putting it on every single sequence of byte/chars seems more justifiable. I also have a hand-wavy idea that this might make analysis a bit easier. And another hand-wavy idea that debug-printing such an HIR will make it a bit more comprehensible. Overall, this isn't a completely obvious win and I do wonder whether I'll regret this. For one thing, the translator is now a fair bit more complicated in exchange for not creating a 'Vec<u8>' for every 'ast::Literal' node. This also gives up the Unicode vs byte distinct and just commits to "all bytes." Instead, we do a UTF-8 check on every 'Hir::literal' call, and that in turn sets the UTF-8 property. This does seem a bit wasteful, and indeed, we do another UTF-8 check in the compiler (even though we could use 'unsafe' correctly and avoid it). However, once the new NFA compiler lands from regex-automata, it operates purely in byte-land and will not need to do another UTF-8 check. Moreover, a UTF-8 check, even on every literal, is likely barely measureable in the grand scheme of things. I do also worry that this is overwrought. In particular, the AST creates a node for each character. Then the HIR smooths them out to sequences of characters (that is, Vec<u8>). And then NFA compilation splits them back out into states where a state handles at most one character (or range of characters). But, I am taking somewhat of a leap-of-judgment here that this will make analysis easier and will overall use less space. But we'll see.
Configuration menu - View commit details
-
Copy full SHA for 9c2b01e - Browse repository at this point
Copy the full SHA 9c2b01eView commit details -
This makes the Debug impls for Literal and ClassRangeBytes a bit better. The former in particular. Instead of just printing a sequence of decimal numbers, we now print them as characters. Given the lackluster support for Vec<u8> as a string in the standard library, we copy a little bit of code from regex-automata to make the debug print for the Vec<u8> basically as nice as a String.
Configuration menu - View commit details
-
Copy full SHA for 9f6f367 - Browse repository at this point
Copy the full SHA 9f6f367View commit details -
syntax: replace HirInfo with new Properties type
This commit completely rewrites how HIR properties are computed inductively. Firstly, 'Properties' is now boxed, so that it contributes less space to each HIR value. This does add an allocation for each HIR expression, but most HIR expressions already require at least one alloc anyway. And there should be far fewer of them now that we collapse literals together. Secondly, 'Properties' now computes far more general attributes instead of hyper-specific things. For example, instead of 'is_match_empty', we now have 'minimum_len' and 'maximum_len'. Similarly, instead of 'is_anchored_start' and 'is_anchored_end', we now compute sets of look-around assertions found anywhere, only as a prefix and only as a suffix. We also remove 'is_line_anchored_{start,end}'. There were only used in the 'grep-regex' crate and they seem unnecessary. They were otherwise fairly weird properties to compute.
Configuration menu - View commit details
-
Copy full SHA for c2daa3b - Browse repository at this point
Copy the full SHA c2daa3bView commit details -
syntax: rejigger Hir::{dot,any}
Instead of using a boolean parameter, we just split them into dot_char, dot_byte, any_char, any_byte. Another path would be to use an enum, but this appeals to me a little more.
Configuration menu - View commit details
-
Copy full SHA for 2c119ea - Browse repository at this point
Copy the full SHA 2c119eaView commit details -
syntax: remove non-capturing groups from HIR
It turns out they are completely superfluous in the HIR, so we can drop them completely. We only need to explicitly represent capturing groups.
Configuration menu - View commit details
-
Copy full SHA for 3a8313e - Browse repository at this point
Copy the full SHA 3a8313eView commit details -
syntax: small HIR simplifications
This makes it so 'a{1}' is rewritten as 'a' and '[a]' is rewritten as 'a'. A lot of the tests expected '[a]' to get preserved as a class in the HIR, so this required a bit of surgery.
Configuration menu - View commit details
-
Copy full SHA for 05cf861 - Browse repository at this point
Copy the full SHA 05cf861View commit details -
syntax: add 'Hir::dot' method to replace 'Hir::{any,dot}_{char,byte}'
In a previous commit, I replaced 'Hir::{any,dot}' a total of four methods. Essentially, I expanded out the boolean parameter to 'Hir::{any,dot}'. I later realized that we'll probably need a "dot except for CR and LF" too. And having four methods all for the same 'dot' construct seemed a bit much. So I've turned it into one method with a new 'Dot' enum. Eventually, that enum should grow two more variants: 'AnyCharExceptCRLF' and 'AnyByteExceptCRLF'. That sort of expansion would have been pretty annoying to do (because of naming) in the prior scheme.
Configuration menu - View commit details
-
Copy full SHA for 22a3612 - Browse repository at this point
Copy the full SHA 22a3612View commit details -
syntax: tweak concat and alternation construction
We simplify construction a bit to prepare for bigger simplifications. We also fix a bug in 'Hir::alternation' where it would incorrectly return 'Hir::empty()' when given an empty alternation. That's correct for an empty concatenation, but an alternation with no branches is equivalent to an expression that never matches anything. To fix that, we create a new 'Hir::fail' that canonicalizes the HIR value used to indicate "impossible to match." Thankfully this bug was unlikely to be observed unless one was constructing HIR values manually. Namely, it is impossible to spell "empty alternation" in the concrete syntax of a regex.
Configuration menu - View commit details
-
Copy full SHA for a5ee3cc - Browse repository at this point
Copy the full SHA a5ee3ccView commit details -
syntax: tweak Debug impl for Hir
The default derive(Debug) impl for Hir is very noisy because it lists out the properties for every Hir value. We change the default to just print out the actual expressions and omit the properties. But one can opt back into seeing the properties via the "alternate" impl. i.e., {:#?} instead of {:?}.
Configuration menu - View commit details
-
Copy full SHA for 7e41247 - Browse repository at this point
Copy the full SHA 7e41247View commit details -
syntax: flatten concatenations
This makes the Hir::concat constructor a bit smarter by combining adjacent literals and flattening child concatenations into the parent concatenation.
Configuration menu - View commit details
-
Copy full SHA for 73518e9 - Browse repository at this point
Copy the full SHA 73518e9View commit details -
syntax: tweak Hir's debug impl again
Just always strip Properties. It's so annoying to see it when you really just want to see the syntax.
Configuration menu - View commit details
-
Copy full SHA for d9922cc - Browse repository at this point
Copy the full SHA d9922ccView commit details -
This commit simplifies alternations by flattening them, similar to how a recent commit flattened concatenations. Although, this is simpler than concatenations, because we can't do anything with literals. Like concatenations, we only need to look one layer deep, since this is applied inductively.
Configuration menu - View commit details
-
Copy full SHA for 232256e - Browse repository at this point
Copy the full SHA 232256eView commit details -
syntax: simplify single char alternations
In short, simplify 'a|b|..|z' to '[a-z]'.
Configuration menu - View commit details
-
Copy full SHA for 05f38ba - Browse repository at this point
Copy the full SHA 05f38baView commit details -
syntax: fix empty char class bug in HIR printer
When a character class is empty, the HIR printer would emit '[]', which is not a valid regex. (Since if a ']' immediately follows an opening '[', then the ']' is interpreted literally and not a closing bracket.) Instead, we write '[a&&b]'. We could also do things like '(?u:\P{any})' or '(?-u:[\x00-\xFF])', but '[a&&b]' doesn't require any flags and also seems really obvious: the intersection of two distinct characters is obviously empty.
Configuration menu - View commit details
-
Copy full SHA for 25d103d - Browse repository at this point
Copy the full SHA 25d103dView commit details -
syntax: add some 'inline' annotations
Since these functions are tiny and not polymorphic, we should permit them to be inlined across crate boundaries.
Configuration menu - View commit details
-
Copy full SHA for d92cb55 - Browse repository at this point
Copy the full SHA d92cb55View commit details -
We need to know the length of the next codepoint we want to debug, otherwise it's possible for a naive 'slice[..4]' to fail if the end of the slice happens to split a codepoint.
Configuration menu - View commit details
-
Copy full SHA for 561ed40 - Browse repository at this point
Copy the full SHA 561ed40View commit details -
syntax: add new LookSet::contains_word convenience routine
And also add some inline annotations on non-generic but tiny functions.
Configuration menu - View commit details
-
Copy full SHA for 781d264 - Browse repository at this point
Copy the full SHA 781d264View commit details -
syntax: rewrite literal extraction
After years of saying "literal extraction needs to be rewritten," I've finally gathered up the courage to do it. While this commit doesn't show it, this is actually now the third time I rewrote it. I rewrote it a second time about a week prior to this and got close to the finish line when I realized I had to throw it away. In that approach, I tried to abandon the "mark each individual literal as exact" idea in the original literal extraction code and instead treat the entire set of literals as "exact" or not. (I also changed the terminology from "complete" to "exact," which I think is maybe a bit better. I also got rid of "cut" and instead use "inexact.") The main problem with not marking each individual literal as exact or not is that it potentially inhibits longer literal extraction. For example, in the regex 'ab*cd', with individual literals marked as exact, we can extract the sequence [inexact(ab), exact(acd)]. But with the entire set being all exact or all inexact, there's no real way to let extraction continue through the empty string produced by the '*' repetition operator. There were some other problems with my second rewrite around short-circuiting concats/alternations when sequences got too big, but I think I could have resolved them. In the end, the third rewrite is quite good. It actually roughly corresponds to the original code, but is cleaned up and much more principled. The original code didn't do these things for example: 1. Didn't care about order and thus didn't correctly produce literals in a sequence for which leftmost-first match semantics were preserved. 2. Didn't differentiate between "empty set" and "infinite set." These are two pretty subtle cases and them not being distinct in the code was really quite messy. 3. The old code tried to carry a literal set throughout extraction and this has the effect of forcing every part of extraction to care about concatenation. But now we just force a stronger separation of responsibility. We might wind up with a few more allocs, but the in-practice small set size limits and short circuiting means that it usually doesn't matter relative to the other costs of parsing, translating and compiling regexes. I ported over pretty much all of the older tests and added more of my own. Overall, I feel much more confident about this new literal extraction than I do the old. We do also insert some heuristics for trimming literal sets in src/exec.rs that didn't exist before. This is because the new extraction code tends to the respect the limits a bit more faithfully and sometimes returned bigger sets than the old code. This is bad because more literals means prefilters are probably less effective. So we write a little bit of code to mitigate this. We also do let a few cases get slower for the time being. The suffix handling is not quite ideal, so many of the easy/medium/hard benchmarks are now a little slower. The name_alt3_nocase benchmark is also slower because the new extraction code notices that the literals blow the limits and only returns an infinite sequence. The old extraction code had (some in practice and unprincipled) techniques for shrinking its set as it went, and this caused literals to get extracted for it. We can fix this, but it will take a little more effort that I don't want to spend right now. In any case, the hope is to smooth out any issues as we head towards bringing regex-automata in.
Configuration menu - View commit details
-
Copy full SHA for c15240b - Browse repository at this point
Copy the full SHA c15240bView commit details -
syntax: add --lib to syntax tests
I couldn't figure how how to easily make doc tests run with 'no_std' enabled, which regex-syntax now does. The '?' in particular was tripping me up. We still get doctest coverage from the top-level 'cargo test'.
Configuration menu - View commit details
-
Copy full SHA for 724ae3e - Browse repository at this point
Copy the full SHA 724ae3eView commit details -
syntax: rewrite 'cls1|..|clsN' as '[cls1..clsN]'
Whenever we have an alternation where each of its branches are just classes, we can always combined that into a single class. Single classes are generally going to be cheaper to process further down the pipeline. Namely, instead of needing to branch between them at a higher level in an NFA graph, they can handled as one single unit.
Configuration menu - View commit details
-
Copy full SHA for 60b9a6c - Browse repository at this point
Copy the full SHA 60b9a6cView commit details -
syntax: remove 'deny(warnings)'
This is generally overall pretty annoying.
Configuration menu - View commit details
-
Copy full SHA for 6d254aa - Browse repository at this point
Copy the full SHA 6d254aaView commit details -
syntax: add Properties::{union,captures_len}
This factors out the constructor for properties for an alternation into a public API method called "union." This is useful for collapsing multiple the properties for multiple regexes down into one analyzeable unit. The 'captures_len' method is also useful for making decisions like "if this regex has no captures and is all literals, then we don't ever need to use a regex engine under any circumstance."
Configuration menu - View commit details
-
Copy full SHA for 01a89b6 - Browse repository at this point
Copy the full SHA 01a89b6View commit details -
syntax: add more convenience routines to LookSet
This makes it a little terser to check different types of word boundaries in the lookset.
Configuration menu - View commit details
-
Copy full SHA for e995b73 - Browse repository at this point
Copy the full SHA e995b73View commit details -
syntax: move around somethings
This gets rid of the AsRef<[u8]> FromIterator impl for Seq, which is unfortunate, but it lets us provide an AsRef<[u8]> impl for Literal. The latter ends up being quite useful to avoid copying and/or extra allocs.
Configuration menu - View commit details
-
Copy full SHA for 72187cb - Browse repository at this point
Copy the full SHA 72187cbView commit details -
syntax: add 'optimize' routines to 'hir::literal::Seq'
Their docs explain their utility. In the old literal extraction setup, some (but not all) of this "optimization" was somewhat baked into the extraction itself, but now we codify it a bit more explicitly.
Configuration menu - View commit details
-
Copy full SHA for 7a75222 - Browse repository at this point
Copy the full SHA 7a75222View commit details -
syntax: rename 'allow_invalid_utf8' to 'utf8'
This also inverts its meaning, i.e., utf8=!allow_invalid_utf8. This naming is consistent with the naming used in regex-automata. In general, I find that using names without negations in them to be clearer, since it avoids double negations.
Configuration menu - View commit details
-
Copy full SHA for c5754ef - Browse repository at this point
Copy the full SHA c5754efView commit details -
syntax: trim literal sequence if necessary
On some occasions, it can make sense to trim the current literal sequences before doing a 'union' IF doing that union would cause the sequences to become infinite because of a blown limit. If we can keep literal extraction going by trimming things down, that's usually beneficial. For now, we just kind of guess that '3' is a good sweet spot for this.
Configuration menu - View commit details
-
Copy full SHA for a0454c2 - Browse repository at this point
Copy the full SHA a0454c2View commit details -
syntax: factor out common prefixes of alternations
It is generally quite subtle to reason clearly about how this actually helps things in a finite automata based regex engine, but this sort of factoring can lead to lots of improvements: * We do use a bounded backtracker, so "pushing branches" down will help things there, just like it would with a classical backtracker. * It may lead to better literal extraction due to the simpler regex. Whether prefix factoring is really to blame here is somewhat unclear, but some downstream optimizations are more brittle than others. For example, the "reverse inner" optimization requires examining a "top level" concatenation to find literals to search for. By factoring out a common prefix, we potentially expand the number of regexes that have a top-level concat. For example, `\wfoo|\wbar` has no top-level concat but `\w(?:foo|bar)` does. * It should lead to faster matching even in finite automata oriented engines like the PikeVM, and also faster construction of DFAs (lazy or not). Namely, by pushing the branches down, we make it so they are visited less frequently, and thus the constant state shuffling caused by branches is reduced. The prefix extraction could be better, as mentioned in the comments, but this is a good start.
Configuration menu - View commit details
-
Copy full SHA for 557f0ea - Browse repository at this point
Copy the full SHA 557f0eaView commit details -
syntax: support
(?<
syntax for named groupsIt turns out that both '(?P<name>...)' and '(?<name>...)' are rather common among regex engines. There are several that support just one or the other. Until this commit, the regex crate only supported the former, along with both RE2, RE2/J and Go's regexp package. There are also several regex engines that only supported the latter, such as Onigmo, Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction, and because there is somewhat little cost to doing so, we elect to support both. It looks like perhaps RE2 and Go's regexp package will go the same route, but it isn't fully decided yet: golang/go#58458 Closes #955, Closes #956
Configuration menu - View commit details
-
Copy full SHA for 541aa42 - Browse repository at this point
Copy the full SHA 541aa42View commit details -
dfa: fix approximate cache size
Unbelievably, this was using the size of the compiled prog *and* the heap memory used by the cache to compute the total memory used by the cache. The effect of this is that the reported size might be much bigger than what is actually used by the cache. This in turn would result in the lazy DFA thrashing the cache and going quite slow.
Configuration menu - View commit details
-
Copy full SHA for 19b29cf - Browse repository at this point
Copy the full SHA 19b29cfView commit details -
impl: switch to aho-corasick 1.0
This is a transitory commit that will need to be updated once aho-corasick 1.0 is actually released. Its purpose is to make it so the regex crate, the "old" regex crate and regex-automata all agree on the same version of aho-corasick to use while in development.
Configuration menu - View commit details
-
Copy full SHA for ca03c73 - Browse repository at this point
Copy the full SHA ca03c73View commit details -
syntax: rename 'Group' to 'Capture'
Now that it *only* represents a capturing group, it makes sense to give it a more specific name.
Configuration menu - View commit details
-
Copy full SHA for 99d8436 - Browse repository at this point
Copy the full SHA 99d8436View commit details -
Where 'sub' is short for 'sub-expression.'
Configuration menu - View commit details
-
Copy full SHA for 6cb02d9 - Browse repository at this point
Copy the full SHA 6cb02d9View commit details -
syntax: add support for CRLF-aware line anchors
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)
Configuration menu - View commit details
-
Copy full SHA for 0114235 - Browse repository at this point
Copy the full SHA 0114235View commit details -
syntax: polish and doc updates
This updates docs in a number of places, including adding examples. We also make it so zero-width matches never impact the 'utf8' property. In practice, this means '(?-u:\B)' is now considered to match valid UTF-8, which is consistent with the fact that 'a*' is considered to match valid UTF-8 too. We also do a refresh of the 'Look' and 'LookSet' APIs.
Configuration menu - View commit details
-
Copy full SHA for e4006af - Browse repository at this point
Copy the full SHA e4006afView commit details -
syntax: permit most no-op escape sequences
This resolves a long-standing (but somewhat minor) complaint that folks have with the regex crate: it does not permit escaping punctuation characters in cases where those characters do not need to be escaped. So things like \/, \" and \! would result in parse errors. Most other regex engines permit these, even in cases where they aren't needed. I had been against doing this for future evolution purposes, but it's incredibly unlikely that we're ever going to add a new meta character to the syntax. I literally cannot think of any conceivable future in which that might happen. However, we do continue to ban escapes for [0-9A-Za-z<>], because it is conceivable that we might add new escape sequences for those characters. (And 0-9 are already banned by virtue of them looking too much like backreferences, which aren't supported.) For example, we could add \Q...\E literal syntax. Or \< and \> as start and end word boundaries, as found in POSIX regex engines. Fixes #501
Configuration menu - View commit details
-
Copy full SHA for fbdc4a9 - Browse repository at this point
Copy the full SHA fbdc4a9View commit details -
syntax: allow Unicode in capture names
This changes the rules for capture names to be much less restrictive. Namely, the requirements are now: 1. Must begin with an `_` or any alphabetic codepoint. 2. After the first codepoint, the name may contain any sequence of alpha-numeric codepoints along with `_`, `.`, `[` and `]`. Closes #595
Configuration menu - View commit details
-
Copy full SHA for 0732763 - Browse repository at this point
Copy the full SHA 0732763View commit details -
api: add new 'Regex::static_captures_len' method
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
Configuration menu - View commit details
-
Copy full SHA for 8a0bf38 - Browse repository at this point
Copy the full SHA 8a0bf38View commit details -
syntax: rename 'captures_len' to 'explicit_captures_len'
And do the same for 'static_captures_len'. The motivation for this is that the top-level Regex API had equivalently named methods 'captures_len' and 'static_captures_len', except those included the implicit group and were therefore always 1 more than the same APIs on Hir. We distinguish them by renaming the routines on HIR.
Configuration menu - View commit details
-
Copy full SHA for cf34553 - Browse repository at this point
Copy the full SHA cf34553View commit details -
It turns out that it's not too hard to get HIR translation to run pretty slowly with some carefully crafted regexes. For example: (?i:[[:^space:]------------------------------------------------------------------------]) This regex is actually a [:^space:] class that has an empty class subtracted from it 36 times. For each subtraction, the resulting class--despite it not having changed---goes through Unicode case folding again. This in turn slows things way down. We introduce a fairly basic optimization that basically keeps track of whether an interval set has been folded or not. The idea was taken from PR #893, but was tweaked slightly. The magic of how it works is that if two interval sets have already been folded, then they retain that property after any of the set operations: negation, union, difference, intersection and symmetric difference. So case folding should generally only need to be run once for each "base" class, but then not again as operations are performed. Some benchmarks were added to rebar (which isn't public yet at time of writing). Closes #893
Configuration menu - View commit details
-
Copy full SHA for 7212a03 - Browse repository at this point
Copy the full SHA 7212a03View commit details -
syntax: drop some Result type aliases
I'm overall coming around to the opinion that these tend to make the code harder to read. So I've been steadily dropping the Result aliases.
Configuration menu - View commit details
-
Copy full SHA for 0dd5853 - Browse repository at this point
Copy the full SHA 0dd5853View commit details -
syntax: refactor and optimize case folding
This rewrites how Unicode simple case folding worked. Instead of just defining a single function and expecting callers to deal with the fallout, we know define a stateful type that "knows" about the structure of the case folding table. For example, it now knows enough to avoid binary search lookups in most cases. All we really have to do is require that callers lookup codepoints in sequence, which is perfectly fine for our use case. Ref #893
Configuration menu - View commit details
-
Copy full SHA for 627c997 - Browse repository at this point
Copy the full SHA 627c997View commit details -
syntax: improve Debug impl for Class
Previously, classes would show up in the debug representation as very deeply nested things, making them more difficult to read than they need to be. This removes at least a few pretty redundant layers and uses a more compact range notation.
Configuration menu - View commit details
-
Copy full SHA for 3c49615 - Browse repository at this point
Copy the full SHA 3c49615View commit details -
bug: fix CaptureLocations::get to handle invalid offsets
The contract of this function says that any invalid group offset should result in a return value of None. In general, it worked fine, unless the offset was so big that some internal multiplication overflowed. That could in turn produce an incorrect result or a panic. So we fix that here with checked arithmetic. Fixes #738, Fixes #950
Configuration menu - View commit details
-
Copy full SHA for b8ab381 - Browse repository at this point
Copy the full SHA b8ab381View commit details -
doc: add wording about Unicode scalar values
This makes it clearer that the regex engine works by *logically* treating a haystack as a sequence of codepoints. Or more specifically, Unicode scalar values. Fixes #854
Configuration menu - View commit details
-
Copy full SHA for e65ba17 - Browse repository at this point
Copy the full SHA e65ba17View commit details -
doc: add more explanation to 'CompiledTooBig' error
The existing docs were pretty paltry, and it turns out we can be a bit more helpful for folks when they hit this error. Fixes #846
Configuration menu - View commit details
-
Copy full SHA for d04ea10 - Browse repository at this point
Copy the full SHA d04ea10View commit details -
api: add Match::{is_empty, len}
Adding these methods has almost no cost and they can be convenient to have in some cases. Closes #810
Configuration menu - View commit details
-
Copy full SHA for 07c453d - Browse repository at this point
Copy the full SHA 07c453dView commit details -
doc: tweak docs for 'shortest_match'
The name is somewhat unfortunate, but it's actually kind of difficult to capture the right semantics in the name. The key bit is that the function returns the offset at the point at which a match is known, and that point might vary depending on which internal regex engine was used. Fixes #747
Configuration menu - View commit details
-
Copy full SHA for be2afa1 - Browse repository at this point
Copy the full SHA be2afa1View commit details -
This clarifies that `x` is "verbose mode," and that whitespace becomes insignificant everywhere, including in character classes. We also add guidance for how to insert a space: either escape it or use a hex literal. Fixes #660
Configuration menu - View commit details
-
Copy full SHA for 061dd68 - Browse repository at this point
Copy the full SHA 061dd68View commit details -
doc: clarify meaning of SetMatches::len
It is really unfortunate, but SetMatches::len and SetMatcher::iter().count() do not correspond go the same thing. It's not clear why I even added the SetMatches::len method in the first place, but it always returns the number of regexes in the set, and not the number of regexes that matched. We can't change the name (or remove the method) obviously, but we do add a warning to the docs. Fixes #625
Configuration menu - View commit details
-
Copy full SHA for 67a60cf - Browse repository at this point
Copy the full SHA 67a60cfView commit details -
doc: add example that uses an alternation
And we make it an interesting example, i.e., one that demonstrates preference order semantics. Closes #610
Configuration menu - View commit details
-
Copy full SHA for f3de42b - Browse repository at this point
Copy the full SHA f3de42bView commit details -
This isn't *strictly* needed because of the existence of Regex::captures_read_at, but it does fill out the singular missing method. Namely, all other search routines have an *_at variant, so we might as well add it for Regex::captures too. Closes #547
Configuration menu - View commit details
-
Copy full SHA for cdf6325 - Browse repository at this point
Copy the full SHA cdf6325View commit details -
api: improve Debug impl for Match
This makes it so the Debug impl for Match only shows the actual matched text. Otherwise, the Match shows the entire haystack, which is likely to be misleading. Fixes #514
Configuration menu - View commit details
-
Copy full SHA for 3988431 - Browse repository at this point
Copy the full SHA 3988431View commit details -
syntax: add 'Repetition::with'
This is useful when doing structural recursion on a '&Hir' to produce a new 'Hir' derived from it.
Configuration menu - View commit details
-
Copy full SHA for 3f4bfa6 - Browse repository at this point
Copy the full SHA 3f4bfa6View commit details -
syntax: add 'Properties::memory_usage'
Since it uses heap memory and because it's something you typically hang on to in a regex engine, we expose a routine for computing heap memory. We might consider doing this for other types in regex-syntax, but there hasn't been a strong need for it yet.
Configuration menu - View commit details
-
Copy full SHA for e166658 - Browse repository at this point
Copy the full SHA e166658View commit details -
doc: tweak presentation of \pN syntax
The wording appears to be a little unclear, so we switch it around a bit. Fixes #975
Configuration menu - View commit details
-
Copy full SHA for a34c1a7 - Browse repository at this point
Copy the full SHA a34c1a7View commit details -
changelog: add entry for regex 1.8
This will need to be updated again to add a date (maybe today?), but this should cover everything from the commit log.
Configuration menu - View commit details
-
Copy full SHA for 82b0f0d - Browse repository at this point
Copy the full SHA 82b0f0dView commit details