Simple regex like `\w{256}` take tens of milliseconds to compile #1095

vxgmichel · 2023-10-03T15:31:28Z

What version of regex are you using?

Version 1.9.6

Describe the bug at a high level.

Simple regex like \w{256} take tens of milliseconds to compile in release mode, and hundreds of milliseconds in debug mode.

Consider a code base with a single regex like this and a hundred tests in debug mode. If each test run in a different process, that's about 30 seconds of CPU time spent uniquely building this one regex for every run of the test suite.

What are the steps to reproduce the behavior?

#![feature(test)]

use regex::Regex;

extern crate test;

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn compile_unicode_regex(b: &mut Bencher) {
        b.iter(|| Regex::new(r"\w{256}"));
    }
}

What is the actual behavior?

About 40 milliseconds in release mode and 290 milliseconds in debug mode to build \w{256}:

$ cargo +nightly bench  
    Finished bench [optimized] target(s) in 0.31s
     Running unittests src/lib.rs (target/release/deps/regex_slow-319c2c8ca1c009f8)

running 1 test
test tests::compile_unicode_regex ... bench:  41,179,283 ns/iter (+/- 1,554,197)

± cargo +nightly bench --profile dev
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running unittests src/lib.rs (target/debug/deps/regex_slow-623fc42fab1b6baf)

running 1 test
test tests::compile_unicode_regex ... bench: 287,807,317 ns/iter (+/- 6,803,723)

What is the expected behavior?

Something similar to python where it takes about 50 microseconds to build and match \w{256}:

import re
import time
start = time.perf_counter()
assert re.compile(r"\w{256}").match("é" * 256)
print(f"> {(time.perf_counter() - start) * 1_000_000:.3f} μs")
# Outputs:
# > 50.566 μs

The text was updated successfully, but these errors were encountered:

Ref rust-lang/regex#1095

BurntSushi · 2023-10-03T17:24:24Z

I've added a benchmark for this to rebar, including a brief analysis: https://github.com/BurntSushi/rebar/blob/9bac8479a550b229e93ea2a7ce71cb6067e05cb2/benchmarks/definitions/reported/i1095-word-repetition.toml

After building the engines (the hard part), collecting measurements for this benchmark is easy:

$ rebar measure -f '^reported/i1095-word-repetition/' | tee i1095-word-repetition.csv

The short answer here is that \w{256} is not really a simple regex. The Unicode-aware \w matches one of over 100,000 possible codepoints, and you're repeating that 256 times over. This particular library uses byte based finite automata, and most of the time spent during compilation of this regex is indeed in the process of converting a \w to a byte based UTF-8 automaton over and over again. RE2 suffers from a similar affliction:

$ rebar cmp i1095-word-repetition.csv -e '^(rust/regex.*|re2|go/regexp)$' -f 'unicode-compile'
benchmark                                       go/regexp       re2                 rust/regex          rust/regexold
---------                                       ---------       ---                 ----------          -------------
reported/i1095-word-repetition/unicode-compile  8.21us (1.00x)  59.12ms (7200.97x)  31.43ms (3828.26x)  44.05ms (5365.41x)

This crate is not only faster than RE2 (although they're still in the same ballpark), but it's faster than what it was about 1 year ago. See the analysis I linked to above for why Go's engine compiles this regex so quickly despite being based on finite automata.

As for the rest, most engines are backtracking based. A backtracker does not need to convert large sets of codepoints into UTF-8 automata, and so its compilation phase is usually quite quick. Indeed, there's often little to no difference between compiling the Unicode-aware \w and the ASCII-only \w. You can see that via the results below, which shows Unicode-vs-ASCII compilation for many engines (most of which are backtracking, only dotnet/nobacktrack, Hyperscan, Go, RE2 and Rust's regex crate are not):

$ rebar cmp i1095-word-repetition.csv --row engine -f compile
engine              reported/i1095-word-repetition/unicode-compile  reported/i1095-word-repetition/ascii-compile
------              ----------------------------------------------  --------------------------------------------
dotnet              1.15us (15.75x)                                 1.23us (19.84x)
dotnet/compiled     21.20us (290.41x)                               26.53us (427.90x)
dotnet/nobacktrack  537.48us (7362.74x)                             29.99us (483.71x)
go/regexp           8.21us (112.47x)                                4.59us (74.03x)
hyperscan           -                                               848.28us (13681.94x)
icu                 26.19us (358.77x)                               2.20us (35.48x)
java/hotspot        73.00ns (1.00x)                                 62.00ns (1.00x)
javascript/v8       78.00ns (1.07x)                                 90.00ns (1.45x)
pcre2               249.00ns (3.41x)                                347.00ns (5.60x)
pcre2/jit           2.57us (35.21x)                                 2.26us (36.45x)
perl                1.07us (14.66x)                                 1.03us (16.61x)
python/re           -                                               16.96us (273.55x)
python/regex        39.58us (542.19x)                               89.19us (1438.55x)
re2                 59.12ms (809863.01x)                            90.88us (1465.81x)
regress             262.00ns (3.59x)                                833.00ns (13.44x)
rust/regex          31.43ms (430547.95x)                            41.12us (663.23x)
rust/regex/lite     -                                               6.46us (104.19x)
rust/regexold       44.05ms (603424.66x)                            98.90us (1595.16x)

I'm not familiar enough with the dotnet/nobacktrack engine to say much about what it's doing, although my guess is that it's matching one UTF-16 code unit at a time, similarish to what Go does (which matches one Unicode scalar value at a time).

Consider a code base with a single regex like this and a hundred tests in debug mode. If each test run in a different process, that's about 30 seconds of CPU time spent uniquely building this one regex for every run of the test suite.

Yes, Unicode is hard. If you care about CPU time while running tests, it would probably be a good idea to compile the tests with optimizations enabled.

Sadly, this is a wontfix bug. If there's really no other option to you and the compile times are not bearable, then you'll likely need to use a different regex engine or perhaps no regex at all.

BurntSushi · 2023-10-03T17:42:27Z

It's also worth adding, although perhaps this won't help you, that you can shrink \w substantially if you only need to match on a certain script. For example, [\w&&\p{Greek}] will only include the subset of word codepoints that are also in the \p{Greek} script.

BurntSushi · 2023-10-03T17:44:34Z

And note that almost this exact example is warned about in the crate docs: https://docs.rs/regex/latest/regex/#unicode-can-impact-memory-usage-and-search-speed and https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop

vxgmichel · 2023-10-03T21:23:48Z

I've added a benchmark for this to rebar

Thanks for the thorough analysis, I really appreciate you took the time to document that.

If there's really no other option to you and the compile times are not bearable, then you'll likely need to use a different regex engine or perhaps no regex at all.

Yes sadly my coworkers have started doing just that. For the record, our regex is ^[\w\-]{1,32}$ which typically compiles in 6 milliseconds in release mode. We are also compiling our project in wasm and the regex compiles in about 16 ms in this case (in release mode again). There are 3 types defined with a similar regex so it's about 50 milliseconds spent compiling regex for each wasm test that needs to run (and about 350 milliseconds in dev mode).

The short answer here is that \w{256} is not really a simple regex.

Yes I increased the length of the regex to make the issue more obvious, but note that it only takes 6 of those \w to exceed the millisecond. The benchmark from my first post with \w{6} gives the following output:

test tests::compile_unicode_regex ... bench:   1,201,266 ns/iter (+/- 21,136)

I don't know if it's worth its own benchmark, but I guess \w{6} feels less of an edge case than \w{256}.

BurntSushi · 2023-10-03T21:44:24Z

No, I don't think it's worth its own benchmark. They are the same regex essentially, with one just being bigger than the other.

Do you really need Unicode here? The PR you linked says it is matching a legacy ID. Is it really true that the IDs can contain any combination of Unicode word characters?

It also seems like each test runs in its own process? That's probably another thing to change. Usually that model winds up not doing so well precisely because of startup costs not being amortized. Regexes are just one of those.

Either way, there really isn't much I can do on my end. Unicode makes stuff like this very difficult in a general purpose finite automata regex engine that prioritizes match performance.

vxgmichel · 2023-10-06T09:54:07Z

Do you really need Unicode here? The PR you linked says it is matching a legacy ID. Is it really true that the IDs can contain any combination of Unicode word characters?

It also seems like each test runs in its own process?

Well it's a regex already used in production to validate data coming from the users and given we make no assumptions about them, using \w seemed like a sensible choice. Those choices can be debated but it feels weird to take the compilation time of the regex into account when making those decisions.

Either way, there really isn't much I can do on my end. Unicode makes stuff like this very difficult in a general purpose finite automata regex engine that prioritizes match performance.

I realized an alternative is to use a regex like \w+ or \w* and check the maximum size separately. It does not apply to all cases but it does help for regex like ours (e.g turning ^[\w\-]{1,32}$ into ^[\w\-]+$).

This particular library uses byte based finite automata, and most of the time spent during compilation of this regex is indeed in the process of converting a \w to a byte based UTF-8 automaton over and over again.

I think I just understood this part. What you're saying is that when compiling \w you'll have the first state of the automata consider all those transitions?

byte in a-zA-Z0-9_ -> match
otherwise byte of the form 0b0xxxxxxx -> not match
otherwise byte in 0b10xxxxxx -> invalid utf-8
otherwise byte in 0b110xxxxx -> go to one of 32 intermediate states depending on xxxxx
otherwise byte in 0b1110xxxx -> go to one of 16 intermediate states depending on xxxx
- each of the 16 states transitioning to another group 64 states
otherwise byte in 0b11110xxx -> go to one of 5 intermediate states depending on xxx
- each of the 5 states in the 0b11110xxx group transitioning to another group of 64 states
  - with each of those 64 states each pointing to yet another group of 64 states

I guess many of those 21536 intermediate states can be trimmed as some of them probably won't lead to any \w code point, but it's still a lot of states compare to a code-point based automata that would need no intermediate state at all.

Did I get that right? I don't know much about regex engines but that would definitely explain the 1000x to 400000x time difference shown by your benchmark.

Anyway thank you for your thoughtful replies :)

BurntSushi · 2023-10-06T10:11:03Z

Well it's a regex already used in production to validate data coming from the users and given we make no assumptions about them, using \w seemed like a sensible choice. Those choices can be debated but it feels weird to take the compilation time of the regex into account when making those decisions.

Yes it is definitely sensible! I just mean that now that you've seen compile time is an issue, you might reevaluate whether Unicode is truly required.

But converting the bounded repeat to an unbounded one is a standard work around. Sorry, I should have mentioned that initially. Nice find.

Did I get that right?

For the most part yes. The NFA compiler uses Daciuk's algorithm to build an approximately minimal FSM. But that only applies to the forwards direction. It also needs to build a reverse FSM too, and using Daciuk's algorithm there is not enabled by default because it will likely lead to much worse compile times. Instead, a heuristic is used for exploiting redundant structure.

Interestingly, in your case, the regex is fully anchored. And that perhaps means that a reverse NFA isn't needed at all. That would be a nice optimization. It won't magically fix compile times, but it should reduce them by approximately half.

BurntSushi · 2023-10-06T10:13:57Z

I created #1097 for tracking the anchored optimization I mentioned above.

BurntSushi added a commit to BurntSushi/rebar that referenced this issue Oct 3, 2023

reported: add benchmark for \w{256}

9bac847

Ref rust-lang/regex#1095

BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023

BurntSushi added the wontfix label Oct 3, 2023

BurntSushi mentioned this issue Oct 6, 2023

investigate skipping reverse NFA compilation if regex is anchored #1097

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple regex like `\w{256}` take tens of milliseconds to compile #1095

Simple regex like `\w{256}` take tens of milliseconds to compile #1095

vxgmichel commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

vxgmichel commented Oct 3, 2023 •

edited

BurntSushi commented Oct 3, 2023

vxgmichel commented Oct 6, 2023

BurntSushi commented Oct 6, 2023

BurntSushi commented Oct 6, 2023

Simple regex like \w{256} take tens of milliseconds to compile #1095

Simple regex like \w{256} take tens of milliseconds to compile #1095

Comments

vxgmichel commented Oct 3, 2023

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

BurntSushi commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

BurntSushi commented Oct 3, 2023

vxgmichel commented Oct 3, 2023 • edited

BurntSushi commented Oct 3, 2023

vxgmichel commented Oct 6, 2023

BurntSushi commented Oct 6, 2023

BurntSushi commented Oct 6, 2023

Simple regex like `\w{256}` take tens of milliseconds to compile #1095

Simple regex like `\w{256}` take tens of milliseconds to compile #1095

vxgmichel commented Oct 3, 2023 •

edited