Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a onepass dfa matcher. #467

Closed
wants to merge 8 commits into from

Conversation

ethanpailes
Copy link
Contributor

This PR adds the onepass DFA that I've been working on. I have marked it as a WIP because there is still one failing test case that I need to dig my teeth into.

Currently the only way to active the onepass matcher is to call .onepass() on ExecBuilder. I was hoping to get this merged in without having it enabled by default at first, then I can do a seperate PR which adds more testing/benchmarking and actually enables it. That approach also provides more time to discuss how exactly the onepass matcher will get used by the execution planner. My guess is that the onepass matcher should always be used when it can be, but it may turn out that shoveling capture groups around is pricier than the DFA startup cost.

We talked about test coverage over in #68, so I did some simple checks and found that about 300/700 of the tests in the test suite trigger the onepass matcher.

Closes #68

@BurntSushi
Copy link
Member

🎉 🎉 🎉 🎉 🎉

I can't wait to find time to dig into this. It still might be a bit though.

We talked about test coverage over in #68, so I did some simple checks and found that about 300/700 of the tests in the test suite trigger the onepass matcher.

Wow. That's awesome news. One thing that I think is important here is to create an additional test suite that disables the one-pass matcher so that those 300 tests hit the other engines as well.

(How the various engines are tested desperately needs to be improved, but that's well outside the scope here I think. It's definitely slated for a rethink as I refactor regex internals.)

@ethanpailes
Copy link
Contributor Author

I'm probably going to get bit for posting before CI finishes, but that last commit should turn all the tests green. The issue turned out to be a silly config bug.

@ethanpailes ethanpailes changed the title [WIP] Add a onepass dfa matcher. Add a onepass dfa matcher. Apr 23, 2018
@ethanpailes
Copy link
Contributor Author

Another possible optimization: don't record last_match until we leave a match state. Search for "another inner loop!" in the src/dfa.rs source code.

@ethanpailes ethanpailes force-pushed the onepass-dfa branch 3 times, most recently from d39e254 to ff79b1b Compare June 20, 2018 13:12
@ethanpailes
Copy link
Contributor Author

Given that there have been a number of new features and a new minimum Rust version since I first opened this PR, I've rebased onto the current master. I've also squashed the various polishing patches that I added, so there are now only two patches in the PR. This is still a pretty big diff, so I understand that it will probably take time to digest (if it would help reduce the work at all I would be happy to talk through the design online or in person).

@bors
Copy link
Contributor

bors commented Jun 21, 2018

☔ The latest upstream changes (presumably 77140e7) made this pull request unmergeable. Please resolve the merge conflicts.

@bors
Copy link
Contributor

bors commented Jul 18, 2018

☔ The latest upstream changes (presumably 2918aab) made this pull request unmergeable. Please resolve the merge conflicts.

Ethan Pailes added 2 commits September 15, 2018 01:17
This patch adds the analysis function, `is_onepass`
found in `analysis.rs`, which is required in order to
determine if a particular regex can be executed using the
onepass DFA. A regex is said to be onepass iff there are
no non-deterministic splits in it. An example of a
non-determinism in a regex is `/alex|apple/`. Here we can't
know which branch to take because both of them start with
`a`. A more subtle example is `/(?:alex)*apple/`. After every
iteration of the Kleene star, we might branch back to `alex`
or continue on to `apple`.
This patch adds a onepass matcher, which is a DFA that
has all the abilities of an NFA! There are lots
of expressions that a onepass matcher can't handle, namely
those cases where a regex contains non-determinism.

The general approach we take is as follows:

    1. Check if a regex is onepass using `src/onepass.rs::is_onepass`.
    2. Compile a new regex program using the compiler with the bytes
        flag set.
    3. Compile a onepass DFA from the program produced in step 2. We
        will roughly map each instruction to a state in the DFA, though
        instructions like `split` don't get states.
        a. Make a new transition table for the first instruction.
        b. For each child of the first instruction:
            - If it is a bytes instruction, add a transition to
                the table for every byte class in the instruction.
            - If it is an instruction which consumes zero input
                (like `EmptyLook` or `Save`), emit a job to a DAG asking to
                forward the first instruction state to the state for
                the non-consuming instruction.
            - Push the child instruction to a queue of instructions to
                process.
        c. Peel off an instruction from the queue and go back to
            step a, processing the instruction as if it was the
            first instruction. If the queue is empty, continue with
            step d.
        d. Topologically sort the forwarding jobs, and shuffle
            the transitions from the forwarding targets to the
            forwarding sources in topological order.
        e. Bake the intermediary transition tables down into a single
            flat vector. States which require some action (`EmptyLook`
            and `Save`) get an extra entry in the baked transition table
            that contains metadata instructing them on how to perform
            their actions.
    4. Wait for the user to give us some input.
    5. Execute the DFA:
        - The inner loop is basically:
            while at < text.len():
                state_ptr = baked_table[text[at]]
                at += 1
        - There is a lot of window dressing to handle special states.

The idea of a onepass matcher comes from Russ Cox and
his RE2 library. I haven't been as good about reading
the RE2 source as I should have, but I've gotten the
impression that the RE2 onepass matcher is more in the
spirit of an NFA simulation without threads than a DFA.
@ethanpailes
Copy link
Contributor Author

I resolved the merge conflicts.

@BurntSushi
Copy link
Member

@ethanpailes Thanks! I haven't forgotten about this. I'm working up to it. Soon. :)

@ethanpailes
Copy link
Contributor Author

Thanks!

@@ -309,6 +309,11 @@ impl<I: Interval> IntervalSet<I> {
}
true
}

/// Returns true iff this class is empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: if.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a typo. iff is parlance for "if and only if."

@@ -797,6 +797,16 @@ impl ClassUnicode {
pub fn symmetric_difference(&mut self, other: &ClassUnicode) {
self.set.symmetric_difference(&other.set);
}

/// Returns true iff this character class contains no characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same typo.

@@ -998,6 +1008,16 @@ impl ClassBytes {
pub fn is_all_ascii(&self) -> bool {
self.set.intervals().last().map_or(true, |r| r.end <= 0x7F)
}

/// Returns true iff this character class contains no characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same (probably copy-paste).

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm determined to review this PR. In particular, I want to get this merged before I start overhauling internals. :-)

I haven't review the whole thing yet. It will take me a bit. But I'm going to try and do this incrementally because my previous strategy of finding the time and will to review the whole thing has failed. Sorry! :(

src/analysis.rs Outdated
self.next()
}
_ => Some(&es[idx]),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanpailes I believe this routine uses a stack size proportional to the depth of the HIR. It seems like it should be possible to flatten this with a loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thanks! I'll fix this with a loop as you suggest.

Note to future self: The degenerate case is when a bunch of concats are nested, with the child concat always coming first.

src/analysis.rs Outdated
// there is byte-level nondeterminism in the \d character
// class, and we care about things in the byte space rather
// than the character space. If you do a onepass engine at
// the character level, Cox's example is indeed onepass.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test for said regex?

IIRC, RE2 uses a byte level representation as well. I wonder if the real reason here is that \d is Unicode aware by default here, but is only ASCII aware in RE2. That is, would (?-u)(\d+)-(\d+) be one pass?

I think overall I would expect to see quite a bit more tests that check the top level is_onepass routine here. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression from skimming the RE2 & go onepass code was that it was basically a variant of the PikeVM with some extra operations, and as such operates at a character level. The work on this branch uses a different algorithm because I didn't see a good reason to pay the cost of NFA simulation in the face of an unambiguous regex. So even if \d was Unicode aware in RE2 \d+- would still be onepass in RE2, but not in this branch.

I will go back and try to enumerate a much more cases, this test module is a bit on the sparse side.

(I might be off the mark about the RE2 onepass matcher. To be honest I didn't read the RE2 & go onepass code as closely as I probably should have before I dove into this implementation :P)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the RE2 onepass matcher is definitely byte oriented because RE2 doesn't have any character level byte codes. Even in this crate, the pike VM can match Unicode character classes without character level byte codes. For RE2, see: https://github.com/google/re2/blob/e6acac839be126c8a20a20183f20a23fa9f4895e/re2/onepass.cc#L459-L524

The Go onepass matcher---along with all of the matchers in Go---are however character oriented. There is no byte oriented matching in Go's regexp engine IIRC.

@ethanpailes
Copy link
Contributor Author

The work on this branch started off with me under the impression that I was just engaged in porting the onepass matcher from RE2 to this crate, but as I learned more along the way the algorithm I settled on drifted further away from the one used there. In order to avoid confusion it might be worth giving it a new name, such as "unambigious_dfa". "onepass" sounds cooler, but it is probably worth being descriptive.

@BurntSushi
Copy link
Member

BurntSushi commented Sep 20, 2018

@ethanpailes SGTM. How much work would it be to write up a high level description of how the detection works in a comment in the code? That would help me a lot.

I don't think it needs to be done in this PR, but one thing worth thinking about is whether the algorithm can be built incrementally as HIR are constructed. This would avoid the need for having a second pass over the HIR. You can see how I'm doing this with a variety of other attributes by looking at the various HIR constructors. e.g.,

pub fn concat(mut exprs: Vec<Hir>) -> Hir {
--- This is something I'd be happy to adapt the current code to, so I'm in no way asking for you to rewrite it here, but getting your thoughts on the feasibility of the idea would be very useful!

@ethanpailes
Copy link
Contributor Author

I decided to do a bit more due diligence and actually read the RE2 code more closely. It turns out the algorithm is actually pretty much the same. I think I was just thrown off the first time by the references to a "onepass nfa" in the code, even though the RE2's onepass matcher is compiling and executing a dfa just like this branch does.

As for determining is_onepass all in one compiler pass when going from Ast -> Hir, it might be possible, but we might also be able to just determine onepassness by examining the bytecode. This seems to be what RE2 does. Are you just worried about the performance/memory implications of letting the crate sprout a middle end? A separate pass seems more maintainable to me.

@RReverser
Copy link
Contributor

RReverser commented Sep 21, 2018

Not sure how feasible this would be, but could one pass matcher be also used for regexps which normally are not eligible due to parallel threads with capture groups, but are called with .is_match, .find or similar method which ignores capture groups anyway? I imagine this would increase memory consumption, but might be worth it in terms of speed?

@BurntSushi
Copy link
Member

As for determining is_onepass all in one compiler pass when going from Ast -> Hir

Hmm, I think you might have misunderstood me. My suggestion isn't really about the translation from AST to HIR, but rather, the construction of any HIR. There is no "pass" here; the idea is that the property is defined inductively on the structure of the HIR. That is, given any two HIRs x and y and a known value for whether each is onepass or not, can you compute the following in time proportional to the size of its non-recursive sub-expressions?

Concat(x, y) => onepass(x) && onepass(y) && no_intersection(tail(x), head(y))
Alternation(x, y) => ...
Repeat(x) => ...
...

and so on. My guess is that there might be additional state required but that it otherwise seems plausible.

Are you just worried about the performance/memory implications of letting the crate sprout a middle end? A separate pass seems more maintainable to me.

Computing properties upon construction of the HIR is what feels like the most ideal solution is, with respect to performance, memory and maintenance. :-) It is explicitly inductive and impossible to forget, because it's baked directly into the construction of the HIR.

I am totally fine starting with a separate pass. I don't think having a separate pass is bad. I can give more thoughts on the feasibility of the explicitly inductive approach once I understand the detection algorithm better.

but we might also be able to just determine onepassness by examining the bytecode

Yeah, that is plausible. For now, I like the idea of using the HIR. I don't have a good feel for what role the bytecode will play in analysis passes. In the past, I've found the bytecode more difficult to do analysis on, but it's been a while. My intent is to rethink the bytecode completely.

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanpailes All righty, I think I've finally made it through the analysis portion of this PR. Unfortunately, I think there are significant issues that need to be addressed there in terms of the runtime complexity of the algorithm. Hopefully I didn't misunderstand!

I will work on the DFA portion soon.

And thanks so much for the doc comment that you wrote. That really helped speed up my review.

/// sets. The first set of an expression is the set of
/// bytes which might begin a word in the language of that
/// expression. If the expression can accept the empty string,
/// the first set takes note of that as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Like Glushkov's construction algorithm. The "first" set is computed in addition to the "last" set (and the set of factors of length 2).

src/analysis.rs Outdated
/// We only ever care about the first byte of a particular character,
/// because the onepass DFA is implemented in the byte space, not the
/// character space. This means, for example, that a branch between
/// lowercase delta and uppercase delta is actually non-deterministic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this means that any regex that uses \d, \w and (probably) \s is also not one pass either, right? But I think all of the ASCII variants are onepass?

(This question is more to just check my understanding here.)

src/analysis.rs Outdated
&HirKind::Alternation(ref es) => {
let mut fset = FirstSet::empty();
for e in es {
fset.union(&fset_of(e));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Hmm... So this is also going to use stack size proportional to the depth of the HIR. Basically, fset_of here is implemented using normal structural recursion. Fixing this is pretty tricky, so I recommend that we mush on at this point. In particular, the literal analysis also suffers from this same problem, but it is a TODO item to fix.

The only way to fix this is to wire first set construction into the visitor itself (which is a pain), or to take the approach of building the onepass analysis (and "first" sets) as the HIR is itself constructed.

With all that said, popping up a level, I think the actual algorithm you've implemented here is quadratic in the depth of the HIR. Namely, visit_pre above is called on every single HIR expression, and for every HIR expression with sub-expressions, you in turn call fset_of, which itself does structural induction on the HIR. So I think you wind up repeating the entire analysis quite a bit.

I am OK with using structural induction for now and just documenting the stack growth as a temporary downside, but I do think we need to get rid of the quadratic behavior. For now, I think it would probably be best to abandon the use of the HIR visitor. At least, that feels like the simplest path to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is pretty unforgivable on my part. The solution is definitely going to have to include caching results somewhere, so I've started work rewriting this module as an extra thing that gets constructed while building Hirs like you've suggested elsewhere. I should have it fixed soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanpailes Awesome! I just want to re-iterate that I am OK with doing structural recursion here for the time being. Literal anaylsis already does this, so it's not introducing a new regression.

The problem that could potentially arise with doing this at HIR construction time is that I think you'll need to keep a copy of fsets for every HIR expression, and that could prove to be quite precarious. This isn't something I realized at the outset when I suggested it. So it's not clear to me that it is definitely the path we want to take.

Sorry for the back and forth on this, but I'm learning as I go. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kinda worried about the additional memory usage, but the fact that FirstSets are represented with a ClassBytes instead of a [u8; 256] does help a bit. What's the use case to worry about for an expanded Hir memory footprint? I would think that there would be more memory used during execution of the regex (at least in the worst case), and I don't think adding a ClassBytes to each Hir node will take up more memory than the 3 Programs and a OnePass that make up an ExecReadOnly. Does the Hir make it into a Regex object somewhere that I havn't noticed?

If I was trigger happy in doing the re-write I can revert that last patch and just try to either add some more ephemeral caching or ditch the visitor to get rid of the quadratic behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly hand wavy, sure, but for example, a naive implementation would add an fset to every character in a pattern. Moreover, the size of an fset is proportion to the size of the fsets of its sub-expressions, so the growth there could be worrisome.

But honestly, I don't know. It's just a concern. This is definitely something I'm going to have to play with.

Looking at your latest commit, I don't think the implementation is actually what we want anyway. If you're looking at sub-sub-expressions during construction (such as with NestedConcat), then that's basically wrong and defeats the purpose of computing this property at construction. That is, use of NestedConcat violates the notion that the property is computed inductively because you're using more information than just the previous step, but rather, information from many previous steps.

Perhaps the best path forward here is to just do the straight-forward thing and use structural recursion. No visitors. No construction time properties. I can sort this out later, perhaps as part of sorting out literal analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are saved from storing an fset per character by the Literal hir node, and the size of an fset shouldn't grow any bigger a bit less than 256 bytes. Plus larger fsets should have a smaller memory foot print, unless I'm off the mark on how ClassBytes works.

The NestedConcat issue is a bit trickier. I added NestedConcat while I was working to get the whole test suite to pass. It is only needed for some pretty weird regex, so I can go back to revisit the idea to see if there is a way to get rid of it. (Also isn't strong induction still induction?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are saved from storing an fset per character by the Literal hir node, and the size of an fset shouldn't grow any bigger a bit less than 256 bytes. Plus larger fsets should have a smaller memory foot print, unless I'm off the mark on how ClassBytes works.

I'm afraid this isn't true. Look at the HIR for a simple expression:

$ regex-debug hir 'abc'
Hir {
    kind: Concat(
        [
            Hir {
                kind: Literal(
                    Unicode(
                        'a'
                    )
                ),
                info: HirInfo {
                    bools: 1
                }
            },
            Hir {
                kind: Literal(
                    Unicode(
                        'b'
                    )
                ),
                info: HirInfo {
                    bools: 1
                }
            },
            Hir {
                kind: Literal(
                    Unicode(
                        'c'
                    )
                ),
                info: HirInfo {
                    bools: 1
                }
            }
        ]
    ),
    info: HirInfo {
        bools: 1
    }
}

Notice that every character has its own HirInfo value. Your commit adds an fset to HirInfo, which means every character will have an fset, which means we have quite a bit of overheard per character. Not only is this problematic in terms of size, but this will actually wind up being a separate independent allocation per character.

Now, this isn't necessarily a death blow to computing the onepass property on construction, but I think it's pretty clear that it requires more work than I thought---potentially even refactoring the HIR itself.

(Also isn't strong induction still induction?)

Only logically. If you implement induction by actually using every previous step, then you've effectively lost one of the benefits of building this stuff by construction because you still wind up doing work proportional to the depth of the HIR, and that in turn means you've implementing a quadratic algorithm. The really key idea here is that the construction of an HIR expression should never take time proportional to the depth of its sub-expressions. It should always take time proportional to the number of direct sub-expressions.

I'd like to nudge you towards moving back to structural recursion where things are a bit simpler to reason about. I'd also like to say that I don't think we can really permit quadratic algorithms anywhere in this code, unless we explicitly bound them by some small (heuristically chosen) constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that Literals in Hir only have one character now (I was still thinking of them as the old Expr::Literal). The allocation-per-character thing is an issue. We could solve it by special casing an fset that contains only one byte, but that seems a bit hacky.

I'm going to look into avoiding the use of NestedConcat to fix the worst-case quadratic behavior. I wasn't saying that NestedConcat doesn't make the runtime worst-case quadratic, just quibbling unnecessarily over the word inductive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to fix the use of nested concat because that isn't enough to make this approach work. Could you please go back to the structural recursion implementation as an explicit analysis pass?

src/analysis.rs Outdated Show resolved Hide resolved
src/analysis.rs Outdated Show resolved Hide resolved
src/analysis.rs Outdated Show resolved Hide resolved
@ethanpailes
Copy link
Contributor Author

@RReverser I'm not sure I understand the optimization you are talking about; maybe you can expand on it a bit. The onepass dfa has full support for capture groups. I think if you call .is_match or .find on a regex with capture groups the normal lazy dfa will get used, but its been a little while since I read the execution planner.

src/onepass.rs Outdated Show resolved Hide resolved

// Iterate over the children, visiting lower priority
// children first.
let mut resume = match &self.prog[inst_idx] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cost of allocating this resume queue for each Bytes instruction is a problem, this might be a candidate for a SmallVec. Probably not worth the dependency without some measurements.

/// The arrow flows in a funny direction because we want the jobs
/// with no dependencies to live at the roots of the DAG so that
/// we can process them first.
fn perform_forward(&mut self, fwd: Forward) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this whole forwarding scheme in the RE2 code, and I think that the reason boils down to the fact that there is DFA states in the RE2 encoding occupy a fixed amount of space in the transition table, so you can point to a state that has not yet been computed pretty easily. To keep the DFA state encoding small, RE2 devotes more of its equivalent of StatePointers to flags and control. One result of this is that RE2's onepass engine only supports a finite number of capture groups.

It might be worth trying to improve the OnePassCompiler performance by simplifying our DFA encoding a bit, but I think that would probably fit better in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern with forwarding is the time complexity. The graph algorithms involved are O(N*E) where N is the number of instructions and E is the edges between those instructions. It shouldn't be bad if the number of edges is O(N), which feels like it ought to be true, but I don't have an off the cuff proof about why. Its at least worth thinking about.

@RReverser
Copy link
Contributor

@ethanpailes Right, I suppose I misread description from the original issue

Detecting whether a regex is deterministic or not. A regex is said to be deterministic if, at any point during the execution of a regex program, at most one thread can lead to a match state

as talking about captures but now I see that it talks about matching in general.

@ethanpailes
Copy link
Contributor Author

I just reverted the commit checking for onepassness at Hir construction time and replaced it with a version of the analysis.rs module that uses direct recursion on the Hir. It still uses strong induction instead of weak induction because of the NestedConcat iterator, but it shouldn't hurt the time complexity because I don't recurse on the nested concatenation Hir nodes, only their children.

@BurntSushi
Copy link
Member

@ethanpailes Great, thank you!

@ethanpailes
Copy link
Contributor Author

@BurntSushi, I know reviewing this change is asking you to do a lot of work, and even though you mentioned trying to get the review done before launching into a big crate-level refactoring I don't want this PR to block you from working on your ideas. I'm perfectly happy to rebase on top of major changes and re-work this PR. The diff for this PR looks rather invasive because of the integration with the execution planner and whatnot, but most of the work went into onepass.rs, which is fairly self-contained. Instruction format changes would be the only thing that really required big changes to onepass.rs, and even then it should be fairly manageable unless the semantics change a lot.

I'm also just selfishly excited to see how you end up re-working the compiler, because I'm sure that I will learn a lot.

@BurntSushi
Copy link
Member

@ethanpailes Thanks for the encouragement! That's good to know. I have tried looking more at your onepass DFA a couple times, and I've failed to make good progress reviewing it each time. It's a ton of new code.

The big crate level refactoring I have in mind will probably happen at the time scale of years unfortunately. I'm working on some bits and pieces independently of this crate in order to learn/experiment, and see what I can manage to work back into this crate. But I don't have a super concrete plan yet.

@ethanpailes
Copy link
Contributor Author

Ah, ok. I was worried that you already had a grand vision brewing and this change was getting in the way.

I tried to make the code a clean as possible, but it is just inherently a big ball of low-level state. If you make an attempt at reviewing it and run into trouble because of the volume I would be happy to respond to questions, refactor to make the code clearer, and add comments to explain the guts. Another thing is that if the style/structure of the code is a barrier to understanding I would be happy to re-work it (for example maybe my decision to use the iterator protocol to view a DAG in topological order has a more direct, clearer solution that I didn't think of). I trust your judgement about what makes code grokkable significantly more than my own. I'm happy to respond to partial reviews of that sort for however long it takes for you to either get comfortable with the code or decide that it doesn't fit in with the regex crate, and again I don't want this PR blocking other changes.

@bors
Copy link
Contributor

bors commented Mar 30, 2019

☔ The latest upstream changes (presumably 8f9ca96) made this pull request unmergeable. Please resolve the merge conflicts.

@ethanpailes
Copy link
Contributor Author

@BurntSushi I'm going to close the PR for now, though if you ever want to re-open it later, I'd be happy to work to get it merged.

The code in this PR is very algorithmically dense, and while there are some real performance wins they are only for a subset of cases. The longer a regex gets, the less likely it is to be deterministic. Also, because of the lazy DFA, the onepass matcher would probably only bring big wins when looking for capture groups. On the other hand, re2 does include a onepass matcher, and Russ Cox is pretty clever.

Given that this PR has been open for years and you've attempted to give it a proper review several times without being able to really digest it all, I think it is mostly just a drain on maintainer resources and you are already stretched pretty thin between all your different crates. (Also closing it now would give the PR queue for regex inbox zero for the first time in years 🙂 ).

@BurntSushi
Copy link
Member

@ethanpailes Thanks! It's probably fine to close this for now, you're right. Basically, I'm still working on #656, and my plan there is still to incorporate a one-pass matcher inside of regex-automata. What I'd like to do is think about the problem from first principles for myself, then review this PR and see what I can pull out and move into regex-automata.

Optimizing the capture group case is rather important and comes up more often than you might think. One common example is parsing out the sections of each line in a log file. Right now, that's a pretty common case that is very slow with the regex crate that I think something like the one-pass matcher could help out with a lot.

Believe me, I've had this on my mind all this time, but you're right, I've consistently bounced off of this. But I'm still hoping to work my way toward it. :-)

(regex-automata is coming along fairly nice. I have almost the entire DFA component done with tons of improvements to the Thompson NFA compiler. That will add another type of matching engine to the regex crate. It will also provide a lower level "expert" API for thornier cases such as stream searching or more powerful RegexSet-like functionality. Namely, when using a full DFA, you can search for multiple regexes simultaneously and get their start/end offsets, just like how the aho-corasick crate works.)

@BurntSushi BurntSushi mentioned this pull request Dec 29, 2020
@BurntSushi
Copy link
Member

BurntSushi commented Jul 20, 2021

@ethanpailes So I've started to let my mind wonder about how to do a one-pass regex engine in regex-automata and finally circled back around to your PR to check out what you did in more detail (particularly with respect to the DFA).

My suspicion is that the technique of using a DFA for this in the context of the regex crate is probably dead-on-arrival unfortunately. I'm sorry if I led you down that path. While I haven't taken any measurements, just looking at the compilation phase, for example, suggests that it's probably going to spend a lot of time and memory building the DFA. The thing with DFAs is that we don't just want to avoid them because of the possibility of exponential state blow-up, but also because they are pigs in general. There's not much you can do about the fact that a DFA state is just plain huge (even with the byte class optimization) without sacrificing some portion of its speed advantages. This is why the lazy DFA has its cache capped at a couple MB (IIRC). We really just do not want to be using that much memory per regex.

While I haven't measured anything here---so I'm mostly just guessing---it kinda looks like the one-pass compiler here is probably way too expensive. To a first approximation, the NFA compiler itself is already almost too expensive, particularly when it comes to dealing with huge Unicode character classes.

I haven't circled back around to the analysis phase yet, so it might still be possible to salvage that component I think.

@ethanpailes
Copy link
Contributor Author

@BurntSushi that makes sense.

OnePass regex tend to be on the smaller side since the more is in a regular expression the more chances of an ambiguous transition. It's been a few years since I wrote this, so I don't have everything swapped in, but I think that despite the complexity here there is a chance the space usage won't be too bad because of the lack of exponential blow up and general small size of regex that manage to qualify as OnePass.

Adding a space limit to the compiler (or maybe trying to do it in the analysis pass) might be a way to salvage this work, but I understand if you don't want to open that can of worms.

If you're just using this as an artefact to study for what you do for real, I think it is worth mentioning that this implementation gets a lot of its complexity from trying to handle arbitrary numbers of capture expressions. To do this, states with control codes take up double the space of normal states, which makes the math for figuring out where a given state starts a little harder than if they were all fixed size. The re2 implementation just packs all its control flags in the high order bits of the address words and does not suffer from this problem and much of the complexity it implies.

@BurntSushi
Copy link
Member

Aye, thanks for the response. The main problem here is somewhat fundamental. A single state in a DFA takes size_of::<idtype>() * alphabet.len() bytes all on its own, and this adds up pretty quickly unfortunately.

Keep in mind that compilation time is also an issue.

The other thing I'll likely try to address is whether there's a way to make things like Unicode-aware \w fit into the one-pass model. Not sure if it's possible, but if Unicode character classes kill the one-pass optimization, then it makes it a lot less useful unfortunately.

@ethanpailes
Copy link
Contributor Author

The main problem here is somewhat fundamental. A single state in a DFA takes size_of::() * alphabet.len() bytes all on its own, and this adds up pretty quickly unfortunately.

Yeah, I can't think of a way around that except the byte class compression that this already does. I guess maybe you could make it lazy? That seems like a lot of effort for onepass regex though.

What about ditching the idea of a DFA entirely and just writing an NFA VM that just takes advantage of the fact that there can be no non-determinism in one pass regex? I think such an NFA engine would not have to deal with juggling thread state arrays like in the Pike VM or any sort of backtracking. There might be some perf wins to be had there, though it might be that the backtracking VM is good enough since it will just never backtrack and already skips the cost of the thread state arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement one-pass NFA matcher
4 participants