Add a onepass dfa matcher. #467

ethanpailes · 2018-04-23T00:21:16Z

This PR adds the onepass DFA that I've been working on. I have marked it as a WIP because there is still one failing test case that I need to dig my teeth into.

Currently the only way to active the onepass matcher is to call .onepass() on ExecBuilder. I was hoping to get this merged in without having it enabled by default at first, then I can do a seperate PR which adds more testing/benchmarking and actually enables it. That approach also provides more time to discuss how exactly the onepass matcher will get used by the execution planner. My guess is that the onepass matcher should always be used when it can be, but it may turn out that shoveling capture groups around is pricier than the DFA startup cost.

We talked about test coverage over in #68, so I did some simple checks and found that about 300/700 of the tests in the test suite trigger the onepass matcher.

Closes #68

BurntSushi · 2018-04-23T13:55:50Z

🎉 🎉 🎉 🎉 🎉

I can't wait to find time to dig into this. It still might be a bit though.

We talked about test coverage over in #68, so I did some simple checks and found that about 300/700 of the tests in the test suite trigger the onepass matcher.

Wow. That's awesome news. One thing that I think is important here is to create an additional test suite that disables the one-pass matcher so that those 300 tests hit the other engines as well.

(How the various engines are tested desperately needs to be improved, but that's well outside the scope here I think. It's definitely slated for a rethink as I refactor regex internals.)

ethanpailes · 2018-04-23T16:06:24Z

I'm probably going to get bit for posting before CI finishes, but that last commit should turn all the tests green. The issue turned out to be a silly config bug.

ethanpailes · 2018-04-30T14:45:08Z

Another possible optimization: don't record last_match until we leave a match state. Search for "another inner loop!" in the src/dfa.rs source code.

ethanpailes · 2018-06-21T14:08:18Z

Given that there have been a number of new features and a new minimum Rust version since I first opened this PR, I've rebased onto the current master. I've also squashed the various polishing patches that I added, so there are now only two patches in the PR. This is still a pretty big diff, so I understand that it will probably take time to digest (if it would help reduce the work at all I would be happy to talk through the design online or in person).

bors · 2018-06-21T21:26:22Z

☔ The latest upstream changes (presumably 77140e7) made this pull request unmergeable. Please resolve the merge conflicts.

bors · 2018-07-18T13:07:59Z

☔ The latest upstream changes (presumably 2918aab) made this pull request unmergeable. Please resolve the merge conflicts.

This patch adds the analysis function, `is_onepass` found in `analysis.rs`, which is required in order to determine if a particular regex can be executed using the onepass DFA. A regex is said to be onepass iff there are no non-deterministic splits in it. An example of a non-determinism in a regex is `/alex|apple/`. Here we can't know which branch to take because both of them start with `a`. A more subtle example is `/(?:alex)*apple/`. After every iteration of the Kleene star, we might branch back to `alex` or continue on to `apple`.

This patch adds a onepass matcher, which is a DFA that has all the abilities of an NFA! There are lots of expressions that a onepass matcher can't handle, namely those cases where a regex contains non-determinism. The general approach we take is as follows: 1. Check if a regex is onepass using `src/onepass.rs::is_onepass`. 2. Compile a new regex program using the compiler with the bytes flag set. 3. Compile a onepass DFA from the program produced in step 2. We will roughly map each instruction to a state in the DFA, though instructions like `split` don't get states. a. Make a new transition table for the first instruction. b. For each child of the first instruction: - If it is a bytes instruction, add a transition to the table for every byte class in the instruction. - If it is an instruction which consumes zero input (like `EmptyLook` or `Save`), emit a job to a DAG asking to forward the first instruction state to the state for the non-consuming instruction. - Push the child instruction to a queue of instructions to process. c. Peel off an instruction from the queue and go back to step a, processing the instruction as if it was the first instruction. If the queue is empty, continue with step d. d. Topologically sort the forwarding jobs, and shuffle the transitions from the forwarding targets to the forwarding sources in topological order. e. Bake the intermediary transition tables down into a single flat vector. States which require some action (`EmptyLook` and `Save`) get an extra entry in the baked transition table that contains metadata instructing them on how to perform their actions. 4. Wait for the user to give us some input. 5. Execute the DFA: - The inner loop is basically: while at < text.len(): state_ptr = baked_table[text[at]] at += 1 - There is a lot of window dressing to handle special states. The idea of a onepass matcher comes from Russ Cox and his RE2 library. I haven't been as good about reading the RE2 source as I should have, but I've gotten the impression that the RE2 onepass matcher is more in the spirit of an NFA simulation without threads than a DFA.

ethanpailes · 2018-09-15T12:23:29Z

I resolved the merge conflicts.

BurntSushi · 2018-09-15T12:26:08Z

@ethanpailes Thanks! I haven't forgotten about this. I'm working up to it. Soon. :)

ethanpailes · 2018-09-15T12:37:31Z

Thanks!

RReverser · 2018-09-15T13:58:47Z

regex-syntax/src/hir/interval.rs

@@ -309,6 +309,11 @@ impl<I: Interval> IntervalSet<I> {
        }
        true
    }
+
+    /// Returns true iff this class is empty.


Typo: if.

I don't think it's a typo. iff is parlance for "if and only if."

RReverser · 2018-09-15T13:58:55Z

regex-syntax/src/hir/mod.rs

@@ -797,6 +797,16 @@ impl ClassUnicode {
    pub fn symmetric_difference(&mut self, other: &ClassUnicode) {
        self.set.symmetric_difference(&other.set);
    }
+
+    /// Returns true iff this character class contains no characters.


RReverser · 2018-09-15T13:59:05Z

regex-syntax/src/hir/mod.rs

@@ -998,6 +1008,16 @@ impl ClassBytes {
    pub fn is_all_ascii(&self) -> bool {
        self.set.intervals().last().map_or(true, |r| r.end <= 0x7F)
    }
+
+    /// Returns true iff this character class contains no characters.


Same (probably copy-paste).

BurntSushi

I'm determined to review this PR. In particular, I want to get this merged before I start overhauling internals. :-)

I haven't review the whole thing yet. It will take me a bit. But I'm going to try and do this incrementally because my previous strategy of finding the time and will to review the whole thing has failed. Sorry! :(

BurntSushi · 2018-09-20T17:11:00Z

src/analysis.rs

+                self.next()
+            }
+            _ => Some(&es[idx]),
+        }


@ethanpailes I believe this routine uses a stack size proportional to the depth of the HIR. It seems like it should be possible to flatten this with a loop.

Nice catch. Thanks! I'll fix this with a loop as you suggest.

Note to future self: The degenerate case is when a bunch of concats are nested, with the child concat always coming first.

BurntSushi · 2018-09-20T17:12:56Z

src/analysis.rs

+    // there is byte-level nondeterminism in the \d character
+    // class, and we care about things in the byte space rather
+    // than the character space. If you do a onepass engine at
+    // the character level, Cox's example is indeed onepass.


Could we add a test for said regex?

IIRC, RE2 uses a byte level representation as well. I wonder if the real reason here is that \d is Unicode aware by default here, but is only ASCII aware in RE2. That is, would (?-u)(\d+)-(\d+) be one pass?

I think overall I would expect to see quite a bit more tests that check the top level is_onepass routine here. :-)

My impression from skimming the RE2 & go onepass code was that it was basically a variant of the PikeVM with some extra operations, and as such operates at a character level. The work on this branch uses a different algorithm because I didn't see a good reason to pay the cost of NFA simulation in the face of an unambiguous regex. So even if \d was Unicode aware in RE2 \d+- would still be onepass in RE2, but not in this branch.

I will go back and try to enumerate a much more cases, this test module is a bit on the sparse side.

(I might be off the mark about the RE2 onepass matcher. To be honest I didn't read the RE2 & go onepass code as closely as I probably should have before I dove into this implementation :P)

Yeah, the RE2 onepass matcher is definitely byte oriented because RE2 doesn't have any character level byte codes. Even in this crate, the pike VM can match Unicode character classes without character level byte codes. For RE2, see: https://github.com/google/re2/blob/e6acac839be126c8a20a20183f20a23fa9f4895e/re2/onepass.cc#L459-L524

The Go onepass matcher---along with all of the matchers in Go---are however character oriented. There is no byte oriented matching in Go's regexp engine IIRC.

ethanpailes · 2018-09-20T22:43:55Z

The work on this branch started off with me under the impression that I was just engaged in porting the onepass matcher from RE2 to this crate, but as I learned more along the way the algorithm I settled on drifted further away from the one used there. In order to avoid confusion it might be worth giving it a new name, such as "unambigious_dfa". "onepass" sounds cooler, but it is probably worth being descriptive.

BurntSushi · 2018-09-20T22:48:07Z

@ethanpailes SGTM. How much work would it be to write up a high level description of how the detection works in a comment in the code? That would help me a lot.

I don't think it needs to be done in this PR, but one thing worth thinking about is whether the algorithm can be built incrementally as HIR are constructed. This would avoid the need for having a second pass over the HIR. You can see how I'm doing this with a variety of other attributes by looking at the various HIR constructors. e.g.,

regex/regex-syntax/src/hir/mod.rs

Line 361 in cd6f1e2

pub fn concat(mut exprs: Vec<Hir>) -> Hir {

--- This is something I'd be happy to adapt the current code to, so I'm in no way asking for you to rewrite it here, but getting your thoughts on the feasibility of the idea would be very useful!

ethanpailes · 2018-09-21T02:22:11Z

I decided to do a bit more due diligence and actually read the RE2 code more closely. It turns out the algorithm is actually pretty much the same. I think I was just thrown off the first time by the references to a "onepass nfa" in the code, even though the RE2's onepass matcher is compiling and executing a dfa just like this branch does.

As for determining is_onepass all in one compiler pass when going from Ast -> Hir, it might be possible, but we might also be able to just determine onepassness by examining the bytecode. This seems to be what RE2 does. Are you just worried about the performance/memory implications of letting the crate sprout a middle end? A separate pass seems more maintainable to me.

RReverser · 2018-09-21T09:38:08Z

Not sure how feasible this would be, but could one pass matcher be also used for regexps which normally are not eligible due to parallel threads with capture groups, but are called with .is_match, .find or similar method which ignores capture groups anyway? I imagine this would increase memory consumption, but might be worth it in terms of speed?

BurntSushi · 2018-09-21T10:54:20Z

As for determining is_onepass all in one compiler pass when going from Ast -> Hir

Hmm, I think you might have misunderstood me. My suggestion isn't really about the translation from AST to HIR, but rather, the construction of any HIR. There is no "pass" here; the idea is that the property is defined inductively on the structure of the HIR. That is, given any two HIRs x and y and a known value for whether each is onepass or not, can you compute the following in time proportional to the size of its non-recursive sub-expressions?

Concat(x, y) => onepass(x) && onepass(y) && no_intersection(tail(x), head(y))
Alternation(x, y) => ...
Repeat(x) => ...
...

and so on. My guess is that there might be additional state required but that it otherwise seems plausible.

Are you just worried about the performance/memory implications of letting the crate sprout a middle end? A separate pass seems more maintainable to me.

Computing properties upon construction of the HIR is what feels like the most ideal solution is, with respect to performance, memory and maintenance. :-) It is explicitly inductive and impossible to forget, because it's baked directly into the construction of the HIR.

I am totally fine starting with a separate pass. I don't think having a separate pass is bad. I can give more thoughts on the feasibility of the explicitly inductive approach once I understand the detection algorithm better.

but we might also be able to just determine onepassness by examining the bytecode

Yeah, that is plausible. For now, I like the idea of using the HIR. I don't have a good feel for what role the bytecode will play in analysis passes. In the past, I've found the bytecode more difficult to do analysis on, but it's been a while. My intent is to rethink the bytecode completely.

…f the input.

BurntSushi

@ethanpailes All righty, I think I've finally made it through the analysis portion of this PR. Unfortunately, I think there are significant issues that need to be addressed there in terms of the runtime complexity of the algorithm. Hopefully I didn't misunderstand!

I will work on the DFA portion soon.

And thanks so much for the doc comment that you wrote. That really helped speed up my review.

BurntSushi · 2018-09-21T17:29:49Z

src/analysis.rs

+/// sets. The first set of an expression is the set of
+/// bytes which might begin a word in the language of that
+/// expression. If the expression can accept the empty string,
+/// the first set takes note of that as well.


Ah! Like Glushkov's construction algorithm. The "first" set is computed in addition to the "last" set (and the set of factors of length 2).

BurntSushi · 2018-09-21T17:34:19Z

src/analysis.rs

+/// We only ever care about the first byte of a particular character,
+/// because the onepass DFA is implemented in the byte space, not the
+/// character space. This means, for example, that a branch between
+/// lowercase delta and uppercase delta is actually non-deterministic.


So I think this means that any regex that uses \d, \w and (probably) \s is also not one pass either, right? But I think all of the ASCII variants are onepass?

(This question is more to just check my understanding here.)

BurntSushi · 2018-09-21T17:47:24Z

src/analysis.rs

+        &HirKind::Alternation(ref es) => {
+            let mut fset = FirstSet::empty();
+            for e in es {
+                fset.union(&fset_of(e));


Yeah... Hmm... So this is also going to use stack size proportional to the depth of the HIR. Basically, fset_of here is implemented using normal structural recursion. Fixing this is pretty tricky, so I recommend that we mush on at this point. In particular, the literal analysis also suffers from this same problem, but it is a TODO item to fix.

The only way to fix this is to wire first set construction into the visitor itself (which is a pain), or to take the approach of building the onepass analysis (and "first" sets) as the HIR is itself constructed.

With all that said, popping up a level, I think the actual algorithm you've implemented here is quadratic in the depth of the HIR. Namely, visit_pre above is called on every single HIR expression, and for every HIR expression with sub-expressions, you in turn call fset_of, which itself does structural induction on the HIR. So I think you wind up repeating the entire analysis quite a bit.

I am OK with using structural induction for now and just documenting the stack growth as a temporary downside, but I do think we need to get rid of the quadratic behavior. For now, I think it would probably be best to abandon the use of the HIR visitor. At least, that feels like the simplest path to me.

Yeah, this is pretty unforgivable on my part. The solution is definitely going to have to include caching results somewhere, so I've started work rewriting this module as an extra thing that gets constructed while building Hirs like you've suggested elsewhere. I should have it fixed soon.

@ethanpailes Awesome! I just want to re-iterate that I am OK with doing structural recursion here for the time being. Literal anaylsis already does this, so it's not introducing a new regression.

The problem that could potentially arise with doing this at HIR construction time is that I think you'll need to keep a copy of fsets for every HIR expression, and that could prove to be quite precarious. This isn't something I realized at the outset when I suggested it. So it's not clear to me that it is definitely the path we want to take.

Sorry for the back and forth on this, but I'm learning as I go. :)

I was kinda worried about the additional memory usage, but the fact that FirstSets are represented with a ClassBytes instead of a [u8; 256] does help a bit. What's the use case to worry about for an expanded Hir memory footprint? I would think that there would be more memory used during execution of the regex (at least in the worst case), and I don't think adding a ClassBytes to each Hir node will take up more memory than the 3 Programs and a OnePass that make up an ExecReadOnly. Does the Hir make it into a Regex object somewhere that I havn't noticed?

If I was trigger happy in doing the re-write I can revert that last patch and just try to either add some more ephemeral caching or ditch the visitor to get rid of the quadratic behavior.

It's certainly hand wavy, sure, but for example, a naive implementation would add an fset to every character in a pattern. Moreover, the size of an fset is proportion to the size of the fsets of its sub-expressions, so the growth there could be worrisome.

But honestly, I don't know. It's just a concern. This is definitely something I'm going to have to play with.

Looking at your latest commit, I don't think the implementation is actually what we want anyway. If you're looking at sub-sub-expressions during construction (such as with NestedConcat), then that's basically wrong and defeats the purpose of computing this property at construction. That is, use of NestedConcat violates the notion that the property is computed inductively because you're using more information than just the previous step, but rather, information from many previous steps.

Perhaps the best path forward here is to just do the straight-forward thing and use structural recursion. No visitors. No construction time properties. I can sort this out later, perhaps as part of sorting out literal analysis.

I think we are saved from storing an fset per character by the Literal hir node, and the size of an fset shouldn't grow any bigger a bit less than 256 bytes. Plus larger fsets should have a smaller memory foot print, unless I'm off the mark on how ClassBytes works.

The NestedConcat issue is a bit trickier. I added NestedConcat while I was working to get the whole test suite to pass. It is only needed for some pretty weird regex, so I can go back to revisit the idea to see if there is a way to get rid of it. (Also isn't strong induction still induction?)

I think we are saved from storing an fset per character by the Literal hir node, and the size of an fset shouldn't grow any bigger a bit less than 256 bytes. Plus larger fsets should have a smaller memory foot print, unless I'm off the mark on how ClassBytes works.

I'm afraid this isn't true. Look at the HIR for a simple expression:

$ regex-debug hir 'abc' Hir { kind: Concat( [ Hir { kind: Literal( Unicode( 'a' ) ), info: HirInfo { bools: 1 } }, Hir { kind: Literal( Unicode( 'b' ) ), info: HirInfo { bools: 1 } }, Hir { kind: Literal( Unicode( 'c' ) ), info: HirInfo { bools: 1 } } ] ), info: HirInfo { bools: 1 } }

Notice that every character has its own HirInfo value. Your commit adds an fset to HirInfo, which means every character will have an fset, which means we have quite a bit of overheard per character. Not only is this problematic in terms of size, but this will actually wind up being a separate independent allocation per character.

Now, this isn't necessarily a death blow to computing the onepass property on construction, but I think it's pretty clear that it requires more work than I thought---potentially even refactoring the HIR itself.

(Also isn't strong induction still induction?)

Only logically. If you implement induction by actually using every previous step, then you've effectively lost one of the benefits of building this stuff by construction because you still wind up doing work proportional to the depth of the HIR, and that in turn means you've implementing a quadratic algorithm. The really key idea here is that the construction of an HIR expression should never take time proportional to the depth of its sub-expressions. It should always take time proportional to the number of direct sub-expressions.

I'd like to nudge you towards moving back to structural recursion where things are a bit simpler to reason about. I'd also like to say that I don't think we can really permit quadratic algorithms anywhere in this code, unless we explicitly bound them by some small (heuristically chosen) constant.

I didn't notice that Literals in Hir only have one character now (I was still thinking of them as the old Expr::Literal). The allocation-per-character thing is an issue. We could solve it by special casing an fset that contains only one byte, but that seems a bit hacky.

I'm going to look into avoiding the use of NestedConcat to fix the worst-case quadratic behavior. I wasn't saying that NestedConcat doesn't make the runtime worst-case quadratic, just quibbling unnecessarily over the word inductive.

I don't think you need to fix the use of nested concat because that isn't enough to make this approach work. Could you please go back to the structural recursion implementation as an explicit analysis pass?

src/analysis.rs

ethanpailes · 2018-09-22T16:31:50Z

@RReverser I'm not sure I understand the optimization you are talking about; maybe you can expand on it a bit. The onepass dfa has full support for capture groups. I think if you call .is_match or .find on a regex with capture groups the normal lazy dfa will get used, but its been a little while since I read the execution planner.

src/onepass.rs

ethanpailes · 2018-09-23T02:44:22Z

src/onepass.rs

+
+        // Iterate over the children, visiting lower priority
+        // children first.
+        let mut resume = match &self.prog[inst_idx] {


If the cost of allocating this resume queue for each Bytes instruction is a problem, this might be a candidate for a SmallVec. Probably not worth the dependency without some measurements.

ethanpailes · 2018-09-23T02:59:41Z

src/onepass.rs

+    /// The arrow flows in a funny direction because we want the jobs
+    /// with no dependencies to live at the roots of the DAG so that
+    /// we can process them first.
+    fn perform_forward(&mut self, fwd: Forward) {


I didn't see this whole forwarding scheme in the RE2 code, and I think that the reason boils down to the fact that there is DFA states in the RE2 encoding occupy a fixed amount of space in the transition table, so you can point to a state that has not yet been computed pretty easily. To keep the DFA state encoding small, RE2 devotes more of its equivalent of StatePointers to flags and control. One result of this is that RE2's onepass engine only supports a finite number of capture groups.

It might be worth trying to improve the OnePassCompiler performance by simplifying our DFA encoding a bit, but I think that would probably fit better in a separate PR.

Another concern with forwarding is the time complexity. The graph algorithms involved are O(N*E) where N is the number of instructions and E is the edges between those instructions. It shouldn't be bad if the number of edges is O(N), which feels like it ought to be true, but I don't have an off the cuff proof about why. Its at least worth thinking about.

This reverts commit 6355ce5.

RReverser · 2018-09-23T12:05:15Z

@ethanpailes Right, I suppose I misread description from the original issue

Detecting whether a regex is deterministic or not. A regex is said to be deterministic if, at any point during the execution of a regex program, at most one thread can lead to a match state

as talking about captures but now I see that it talks about matching in general.

ethanpailes · 2018-09-23T17:27:22Z

I just reverted the commit checking for onepassness at Hir construction time and replaced it with a version of the analysis.rs module that uses direct recursion on the Hir. It still uses strong induction instead of weak induction because of the NestedConcat iterator, but it shouldn't hurt the time complexity because I don't recurse on the nested concatenation Hir nodes, only their children.

BurntSushi · 2018-09-23T17:36:46Z

@ethanpailes Great, thank you!

ethanpailes · 2018-12-08T19:16:56Z

@BurntSushi, I know reviewing this change is asking you to do a lot of work, and even though you mentioned trying to get the review done before launching into a big crate-level refactoring I don't want this PR to block you from working on your ideas. I'm perfectly happy to rebase on top of major changes and re-work this PR. The diff for this PR looks rather invasive because of the integration with the execution planner and whatnot, but most of the work went into onepass.rs, which is fairly self-contained. Instruction format changes would be the only thing that really required big changes to onepass.rs, and even then it should be fairly manageable unless the semantics change a lot.

I'm also just selfishly excited to see how you end up re-working the compiler, because I'm sure that I will learn a lot.

BurntSushi · 2018-12-08T19:23:14Z

@ethanpailes Thanks for the encouragement! That's good to know. I have tried looking more at your onepass DFA a couple times, and I've failed to make good progress reviewing it each time. It's a ton of new code.

The big crate level refactoring I have in mind will probably happen at the time scale of years unfortunately. I'm working on some bits and pieces independently of this crate in order to learn/experiment, and see what I can manage to work back into this crate. But I don't have a super concrete plan yet.

ethanpailes · 2018-12-08T20:52:28Z

Ah, ok. I was worried that you already had a grand vision brewing and this change was getting in the way.

I tried to make the code a clean as possible, but it is just inherently a big ball of low-level state. If you make an attempt at reviewing it and run into trouble because of the volume I would be happy to respond to questions, refactor to make the code clearer, and add comments to explain the guts. Another thing is that if the style/structure of the code is a barrier to understanding I would be happy to re-work it (for example maybe my decision to use the iterator protocol to view a DAG in topological order has a more direct, clearer solution that I didn't think of). I trust your judgement about what makes code grokkable significantly more than my own. I'm happy to respond to partial reviews of that sort for however long it takes for you to either get comfortable with the code or decide that it doesn't fit in with the regex crate, and again I don't want this PR blocking other changes.

bors · 2019-03-30T14:47:54Z

☔ The latest upstream changes (presumably 8f9ca96) made this pull request unmergeable. Please resolve the merge conflicts.

ethanpailes · 2020-10-13T15:51:28Z

@BurntSushi I'm going to close the PR for now, though if you ever want to re-open it later, I'd be happy to work to get it merged.

The code in this PR is very algorithmically dense, and while there are some real performance wins they are only for a subset of cases. The longer a regex gets, the less likely it is to be deterministic. Also, because of the lazy DFA, the onepass matcher would probably only bring big wins when looking for capture groups. On the other hand, re2 does include a onepass matcher, and Russ Cox is pretty clever.

Given that this PR has been open for years and you've attempted to give it a proper review several times without being able to really digest it all, I think it is mostly just a drain on maintainer resources and you are already stretched pretty thin between all your different crates. (Also closing it now would give the PR queue for regex inbox zero for the first time in years 🙂 ).

BurntSushi · 2020-10-13T16:07:15Z

@ethanpailes Thanks! It's probably fine to close this for now, you're right. Basically, I'm still working on #656, and my plan there is still to incorporate a one-pass matcher inside of regex-automata. What I'd like to do is think about the problem from first principles for myself, then review this PR and see what I can pull out and move into regex-automata.

Optimizing the capture group case is rather important and comes up more often than you might think. One common example is parsing out the sections of each line in a log file. Right now, that's a pretty common case that is very slow with the regex crate that I think something like the one-pass matcher could help out with a lot.

Believe me, I've had this on my mind all this time, but you're right, I've consistently bounced off of this. But I'm still hoping to work my way toward it. :-)

(regex-automata is coming along fairly nice. I have almost the entire DFA component done with tons of improvements to the Thompson NFA compiler. That will add another type of matching engine to the regex crate. It will also provide a lower level "expert" API for thornier cases such as stream searching or more powerful RegexSet-like functionality. Namely, when using a full DFA, you can search for multiple regexes simultaneously and get their start/end offsets, just like how the aho-corasick crate works.)

BurntSushi · 2021-07-20T00:11:58Z

@ethanpailes So I've started to let my mind wonder about how to do a one-pass regex engine in regex-automata and finally circled back around to your PR to check out what you did in more detail (particularly with respect to the DFA).

My suspicion is that the technique of using a DFA for this in the context of the regex crate is probably dead-on-arrival unfortunately. I'm sorry if I led you down that path. While I haven't taken any measurements, just looking at the compilation phase, for example, suggests that it's probably going to spend a lot of time and memory building the DFA. The thing with DFAs is that we don't just want to avoid them because of the possibility of exponential state blow-up, but also because they are pigs in general. There's not much you can do about the fact that a DFA state is just plain huge (even with the byte class optimization) without sacrificing some portion of its speed advantages. This is why the lazy DFA has its cache capped at a couple MB (IIRC). We really just do not want to be using that much memory per regex.

While I haven't measured anything here---so I'm mostly just guessing---it kinda looks like the one-pass compiler here is probably way too expensive. To a first approximation, the NFA compiler itself is already almost too expensive, particularly when it comes to dealing with huge Unicode character classes.

I haven't circled back around to the analysis phase yet, so it might still be possible to salvage that component I think.

ethanpailes · 2021-07-20T19:00:36Z

@BurntSushi that makes sense.

OnePass regex tend to be on the smaller side since the more is in a regular expression the more chances of an ambiguous transition. It's been a few years since I wrote this, so I don't have everything swapped in, but I think that despite the complexity here there is a chance the space usage won't be too bad because of the lack of exponential blow up and general small size of regex that manage to qualify as OnePass.

Adding a space limit to the compiler (or maybe trying to do it in the analysis pass) might be a way to salvage this work, but I understand if you don't want to open that can of worms.

If you're just using this as an artefact to study for what you do for real, I think it is worth mentioning that this implementation gets a lot of its complexity from trying to handle arbitrary numbers of capture expressions. To do this, states with control codes take up double the space of normal states, which makes the math for figuring out where a given state starts a little harder than if they were all fixed size. The re2 implementation just packs all its control flags in the high order bits of the address words and does not suffer from this problem and much of the complexity it implies.

BurntSushi · 2021-07-20T19:32:38Z

Aye, thanks for the response. The main problem here is somewhat fundamental. A single state in a DFA takes size_of::<idtype>() * alphabet.len() bytes all on its own, and this adds up pretty quickly unfortunately.

Keep in mind that compilation time is also an issue.

The other thing I'll likely try to address is whether there's a way to make things like Unicode-aware \w fit into the one-pass model. Not sure if it's possible, but if Unicode character classes kill the one-pass optimization, then it makes it a lot less useful unfortunately.

ethanpailes · 2021-07-21T13:10:37Z

The main problem here is somewhat fundamental. A single state in a DFA takes size_of::() * alphabet.len() bytes all on its own, and this adds up pretty quickly unfortunately.

Yeah, I can't think of a way around that except the byte class compression that this already does. I guess maybe you could make it lazy? That seems like a lot of effort for onepass regex though.

What about ditching the idea of a DFA entirely and just writing an NFA VM that just takes advantage of the fact that there can be no non-determinism in one pass regex? I think such an NFA engine would not have to deal with juggling thread state arrays like in the Pike VM or any sort of backtracking. There might be some perf wins to be had there, though it might be that the backtracking VM is good enough since it will just never backtrack and already skips the cost of the thread state arrays.

ethanpailes changed the title ~~[WIP] Add a onepass dfa matcher.~~ Add a onepass dfa matcher. Apr 23, 2018

ethanpailes force-pushed the onepass-dfa branch from 51e06ef to 49df592 Compare April 24, 2018 03:01

ethanpailes force-pushed the onepass-dfa branch 3 times, most recently from d39e254 to ff79b1b Compare June 20, 2018 13:12

ethanpailes force-pushed the onepass-dfa branch from ff79b1b to 5061373 Compare June 21, 2018 23:46

ethanpailes force-pushed the onepass-dfa branch from 5061373 to cc907b8 Compare July 10, 2018 14:38

Ethan Pailes added 2 commits September 15, 2018 01:17

ethanpailes force-pushed the onepass-dfa branch from cc907b8 to 3669881 Compare September 15, 2018 12:22

RReverser reviewed Sep 15, 2018

View reviewed changes

BurntSushi reviewed Sep 20, 2018

View reviewed changes

Add doc comment to is_onepass

b6fab61

rewrite NestedConcat to avoid stack space usage liniear in the size o…

5ff7c80

…f the input.

BurntSushi reviewed Sep 21, 2018

View reviewed changes

refactor is_onepass to be part of Hir construction.

6355ce5

ethanpailes commented Sep 23, 2018

View reviewed changes

src/onepass.rs Outdated Show resolved Hide resolved

ethanpailes commented Sep 23, 2018

View reviewed changes

Ethan Pailes added 3 commits September 23, 2018 03:59

Revert "refactor is_onepass to be part of Hir construction."

6ec93a6

This reverts commit 6355ce5.

switch is_onepass to direct recursion instead of visitor.

b5b8f82

remove one allocation per instruction by directly mutating state_edge

b31e36f

BurntSushi mentioned this pull request Mar 24, 2020

plan: moving regex engines to regex-automata #656

Closed

ethanpailes closed this Oct 13, 2020

BurntSushi mentioned this pull request Dec 29, 2020

Performance issues #389

Closed

Add a onepass dfa matcher. #467

Add a onepass dfa matcher. #467

Conversation

ethanpailes commented Apr 23, 2018

BurntSushi commented Apr 23, 2018

ethanpailes commented Apr 23, 2018

ethanpailes commented Apr 30, 2018

ethanpailes commented Jun 21, 2018

bors commented Jun 21, 2018

bors commented Jul 18, 2018

ethanpailes commented Sep 15, 2018

BurntSushi commented Sep 15, 2018

ethanpailes commented Sep 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethanpailes commented Sep 20, 2018

BurntSushi commented Sep 20, 2018 • edited

ethanpailes commented Sep 21, 2018

RReverser commented Sep 21, 2018 • edited

BurntSushi commented Sep 21, 2018

BurntSushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi Sep 23, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethanpailes commented Sep 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RReverser commented Sep 23, 2018

ethanpailes commented Sep 23, 2018

BurntSushi commented Sep 23, 2018

ethanpailes commented Dec 8, 2018

BurntSushi commented Dec 8, 2018

ethanpailes commented Dec 8, 2018

bors commented Mar 30, 2019

ethanpailes commented Oct 13, 2020

BurntSushi commented Oct 13, 2020

BurntSushi commented Jul 20, 2021 • edited

ethanpailes commented Jul 20, 2021

BurntSushi commented Jul 20, 2021

ethanpailes commented Jul 21, 2021

BurntSushi commented Sep 20, 2018 •

edited

RReverser commented Sep 21, 2018 •

edited

BurntSushi Sep 23, 2018 •

edited

BurntSushi commented Jul 20, 2021 •

edited