Skip to content

Latest commit

 

History

History
80 lines (52 loc) · 2.91 KB

PERFORMANCE.md

File metadata and controls

80 lines (52 loc) · 2.91 KB

The README has a quick introduction to the performance of this crate. This will look at some examples and compare them to the Oniguruma engine.

Catastrophic backtracking

Backtracking engines can have worst-case performance when the regular expression forces the engine to consider an exponentially increasing number of sub-cases.

For a good explanation of that, read Runaway Regular Expressions: Catastrophic Backtracking.

Let's look at the regex from the README again:

(a|b|ab)*bc

And the input text:

ababababababababababababababababababababababababababababac

Python's engine has exponential runtime. The regex crate and fancy-regex however have no problem with it.

Oniguruma

Oniguruma implements a backtracking engine. So we'd expect it to have a problem with the above regex too.

However, in the above case, it quickly finds that there's no match. How is that possible? The answer is that it has optimizations which sometimes help it avoid having to do any matching at all:

In the pattern (a|b|ab)*bc, you might notice that if the input doesn't contain bc, the pattern will never match. Oniguruma detects that and, before it tries to do any matching, tries to find bc in the input.

But what happens if we add bc at the end of the input, like this:

ababababababababababababababababababababababababababababacbc

Now the optimization doesn't help anymore, and Oniguruma is slow too.

fancy-regex

For (a|b|ab)*bc fancy-regex is fast in all cases because it can delegate to the regex crate which matches it in linear runtime.

Let's look at another regex, one that makes use of a "fancy" look-ahead:

(a|b|ab)*(?=c)

When fancy-regex matches it against this input:

abababababababababababababababababababababababababababab

It's slow! The reason is that (?=c) is not supported by the regex crate, so we need to handle it with backtracking. And because (a|b|ab)* is before it, we need to do it with backtracking as well.

Oniguruma doesn't have a problem with this particular case because its optimization saves it again: It checks if there's a c in the input before doing any matching.

There's nothing preventing fancy-regex from adding similar optimizations in the future, but it's not done yet.

Note that how much fancy-regex can do without backtracking depends on the structure of the regex. For example, with (?=(a|b|ab)*bc), the inner part of the look-ahead can be delegated to regex entirely.

Summary

  • If the regex doesn't use fancy features, fancy-regex should have linear runtime compared to Oniguruma's exponential worst-case.
  • Even if the regex doesn't use any fancy features, Oniguruma can be faster because it is a mature and highly optimized engine.
  • With fancy features, Oniguruma can be faster because of optimizations.