Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple regex like \w{256} take tens of milliseconds to compile #1095

Closed
vxgmichel opened this issue Oct 3, 2023 · 8 comments
Closed

Simple regex like \w{256} take tens of milliseconds to compile #1095

vxgmichel opened this issue Oct 3, 2023 · 8 comments
Labels

Comments

@vxgmichel
Copy link

What version of regex are you using?

Version 1.9.6

Describe the bug at a high level.

Simple regex like \w{256} take tens of milliseconds to compile in release mode, and hundreds of milliseconds in debug mode.

Consider a code base with a single regex like this and a hundred tests in debug mode. If each test run in a different process, that's about 30 seconds of CPU time spent uniquely building this one regex for every run of the test suite.

What are the steps to reproduce the behavior?

#![feature(test)]

use regex::Regex;

extern crate test;

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn compile_unicode_regex(b: &mut Bencher) {
        b.iter(|| Regex::new(r"\w{256}"));
    }
}

What is the actual behavior?

About 40 milliseconds in release mode and 290 milliseconds in debug mode to build \w{256}:

$ cargo +nightly bench  
    Finished bench [optimized] target(s) in 0.31s
     Running unittests src/lib.rs (target/release/deps/regex_slow-319c2c8ca1c009f8)

running 1 test
test tests::compile_unicode_regex ... bench:  41,179,283 ns/iter (+/- 1,554,197)

± cargo +nightly bench --profile dev
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running unittests src/lib.rs (target/debug/deps/regex_slow-623fc42fab1b6baf)

running 1 test
test tests::compile_unicode_regex ... bench: 287,807,317 ns/iter (+/- 6,803,723)

What is the expected behavior?

Something similar to python where it takes about 50 microseconds to build and match \w{256}:

import re
import time
start = time.perf_counter()
assert re.compile(r"\w{256}").match("é" * 256)
print(f"> {(time.perf_counter() - start) * 1_000_000:.3f} μs")
# Outputs:
# > 50.566 μs
BurntSushi added a commit to BurntSushi/rebar that referenced this issue Oct 3, 2023
@BurntSushi
Copy link
Member

I've added a benchmark for this to rebar, including a brief analysis: https://github.com/BurntSushi/rebar/blob/9bac8479a550b229e93ea2a7ce71cb6067e05cb2/benchmarks/definitions/reported/i1095-word-repetition.toml

After building the engines (the hard part), collecting measurements for this benchmark is easy:

$ rebar measure -f '^reported/i1095-word-repetition/' | tee i1095-word-repetition.csv

The short answer here is that \w{256} is not really a simple regex. The Unicode-aware \w matches one of over 100,000 possible codepoints, and you're repeating that 256 times over. This particular library uses byte based finite automata, and most of the time spent during compilation of this regex is indeed in the process of converting a \w to a byte based UTF-8 automaton over and over again. RE2 suffers from a similar affliction:

$ rebar cmp i1095-word-repetition.csv -e '^(rust/regex.*|re2|go/regexp)$' -f 'unicode-compile'
benchmark                                       go/regexp       re2                 rust/regex          rust/regexold
---------                                       ---------       ---                 ----------          -------------
reported/i1095-word-repetition/unicode-compile  8.21us (1.00x)  59.12ms (7200.97x)  31.43ms (3828.26x)  44.05ms (5365.41x)

This crate is not only faster than RE2 (although they're still in the same ballpark), but it's faster than what it was about 1 year ago. See the analysis I linked to above for why Go's engine compiles this regex so quickly despite being based on finite automata.

As for the rest, most engines are backtracking based. A backtracker does not need to convert large sets of codepoints into UTF-8 automata, and so its compilation phase is usually quite quick. Indeed, there's often little to no difference between compiling the Unicode-aware \w and the ASCII-only \w. You can see that via the results below, which shows Unicode-vs-ASCII compilation for many engines (most of which are backtracking, only dotnet/nobacktrack, Hyperscan, Go, RE2 and Rust's regex crate are not):

$ rebar cmp i1095-word-repetition.csv --row engine -f compile
engine              reported/i1095-word-repetition/unicode-compile  reported/i1095-word-repetition/ascii-compile
------              ----------------------------------------------  --------------------------------------------
dotnet              1.15us (15.75x)                                 1.23us (19.84x)
dotnet/compiled     21.20us (290.41x)                               26.53us (427.90x)
dotnet/nobacktrack  537.48us (7362.74x)                             29.99us (483.71x)
go/regexp           8.21us (112.47x)                                4.59us (74.03x)
hyperscan           -                                               848.28us (13681.94x)
icu                 26.19us (358.77x)                               2.20us (35.48x)
java/hotspot        73.00ns (1.00x)                                 62.00ns (1.00x)
javascript/v8       78.00ns (1.07x)                                 90.00ns (1.45x)
pcre2               249.00ns (3.41x)                                347.00ns (5.60x)
pcre2/jit           2.57us (35.21x)                                 2.26us (36.45x)
perl                1.07us (14.66x)                                 1.03us (16.61x)
python/re           -                                               16.96us (273.55x)
python/regex        39.58us (542.19x)                               89.19us (1438.55x)
re2                 59.12ms (809863.01x)                            90.88us (1465.81x)
regress             262.00ns (3.59x)                                833.00ns (13.44x)
rust/regex          31.43ms (430547.95x)                            41.12us (663.23x)
rust/regex/lite     -                                               6.46us (104.19x)
rust/regexold       44.05ms (603424.66x)                            98.90us (1595.16x)

I'm not familiar enough with the dotnet/nobacktrack engine to say much about what it's doing, although my guess is that it's matching one UTF-16 code unit at a time, similarish to what Go does (which matches one Unicode scalar value at a time).

Consider a code base with a single regex like this and a hundred tests in debug mode. If each test run in a different process, that's about 30 seconds of CPU time spent uniquely building this one regex for every run of the test suite.

Yes, Unicode is hard. If you care about CPU time while running tests, it would probably be a good idea to compile the tests with optimizations enabled.

Sadly, this is a wontfix bug. If there's really no other option to you and the compile times are not bearable, then you'll likely need to use a different regex engine or perhaps no regex at all.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023
@BurntSushi
Copy link
Member

It's also worth adding, although perhaps this won't help you, that you can shrink \w substantially if you only need to match on a certain script. For example, [\w&&\p{Greek}] will only include the subset of word codepoints that are also in the \p{Greek} script.

@BurntSushi
Copy link
Member

@vxgmichel
Copy link
Author

vxgmichel commented Oct 3, 2023

I've added a benchmark for this to rebar

Thanks for the thorough analysis, I really appreciate you took the time to document that.

If there's really no other option to you and the compile times are not bearable, then you'll likely need to use a different regex engine or perhaps no regex at all.

Yes sadly my coworkers have started doing just that. For the record, our regex is ^[\w\-]{1,32}$ which typically compiles in 6 milliseconds in release mode. We are also compiling our project in wasm and the regex compiles in about 16 ms in this case (in release mode again). There are 3 types defined with a similar regex so it's about 50 milliseconds spent compiling regex for each wasm test that needs to run (and about 350 milliseconds in dev mode).

The short answer here is that \w{256} is not really a simple regex.

Yes I increased the length of the regex to make the issue more obvious, but note that it only takes 6 of those \w to exceed the millisecond. The benchmark from my first post with \w{6} gives the following output:

test tests::compile_unicode_regex ... bench:   1,201,266 ns/iter (+/- 21,136)

I don't know if it's worth its own benchmark, but I guess \w{6} feels less of an edge case than \w{256}.

@BurntSushi
Copy link
Member

No, I don't think it's worth its own benchmark. They are the same regex essentially, with one just being bigger than the other.

Do you really need Unicode here? The PR you linked says it is matching a legacy ID. Is it really true that the IDs can contain any combination of Unicode word characters?

It also seems like each test runs in its own process? That's probably another thing to change. Usually that model winds up not doing so well precisely because of startup costs not being amortized. Regexes are just one of those.

Either way, there really isn't much I can do on my end. Unicode makes stuff like this very difficult in a general purpose finite automata regex engine that prioritizes match performance.

@vxgmichel
Copy link
Author

Do you really need Unicode here? The PR you linked says it is matching a legacy ID. Is it really true that the IDs can contain any combination of Unicode word characters?

It also seems like each test runs in its own process?

Well it's a regex already used in production to validate data coming from the users and given we make no assumptions about them, using \w seemed like a sensible choice. Those choices can be debated but it feels weird to take the compilation time of the regex into account when making those decisions.

Either way, there really isn't much I can do on my end. Unicode makes stuff like this very difficult in a general purpose finite automata regex engine that prioritizes match performance.

I realized an alternative is to use a regex like \w+ or \w* and check the maximum size separately. It does not apply to all cases but it does help for regex like ours (e.g turning ^[\w\-]{1,32}$ into ^[\w\-]+$).

This particular library uses byte based finite automata, and most of the time spent during compilation of this regex is indeed in the process of converting a \w to a byte based UTF-8 automaton over and over again.

I think I just understood this part. What you're saying is that when compiling \w you'll have the first state of the automata consider all those transitions?

  • byte in a-zA-Z0-9_ -> match
  • otherwise byte of the form 0b0xxxxxxx -> not match
  • otherwise byte in 0b10xxxxxx -> invalid utf-8
  • otherwise byte in 0b110xxxxx -> go to one of 32 intermediate states depending on xxxxx
  • otherwise byte in 0b1110xxxx -> go to one of 16 intermediate states depending on xxxx
    • each of the 16 states transitioning to another group 64 states
  • otherwise byte in 0b11110xxx -> go to one of 5 intermediate states depending on xxx
    • each of the 5 states in the 0b11110xxx group transitioning to another group of 64 states
      • with each of those 64 states each pointing to yet another group of 64 states

I guess many of those 21536 intermediate states can be trimmed as some of them probably won't lead to any \w code point, but it's still a lot of states compare to a code-point based automata that would need no intermediate state at all.

Did I get that right? I don't know much about regex engines but that would definitely explain the 1000x to 400000x time difference shown by your benchmark.

Anyway thank you for your thoughtful replies :)

@BurntSushi
Copy link
Member

Well it's a regex already used in production to validate data coming from the users and given we make no assumptions about them, using \w seemed like a sensible choice. Those choices can be debated but it feels weird to take the compilation time of the regex into account when making those decisions.

Yes it is definitely sensible! I just mean that now that you've seen compile time is an issue, you might reevaluate whether Unicode is truly required.

But converting the bounded repeat to an unbounded one is a standard work around. Sorry, I should have mentioned that initially. Nice find.

Did I get that right?

For the most part yes. The NFA compiler uses Daciuk's algorithm to build an approximately minimal FSM. But that only applies to the forwards direction. It also needs to build a reverse FSM too, and using Daciuk's algorithm there is not enabled by default because it will likely lead to much worse compile times. Instead, a heuristic is used for exploiting redundant structure.

Interestingly, in your case, the regex is fully anchored. And that perhaps means that a reverse NFA isn't needed at all. That would be a nice optimization. It won't magically fix compile times, but it should reduce them by approximately half.

@BurntSushi
Copy link
Member

I created #1097 for tracking the anchored optimization I mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants