Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix memory usage regression for RegexSet with capture groups #1062

Merged
merged 8 commits into from
Aug 5, 2023

Conversation

BurntSushi
Copy link
Member

Basically, if one used a RegexSet with capturing groups in it, then as of regex 1.9, it would unnecessarily allocate space for those capture groups in the PikeVM. A Thompson NFA (and specifically the PikeVM) does not scale well with respect to the both the size of the NFA and the number of capture groups, and so this resulted in exorbitant memory usage even for medium sized regex sets. This is a bit of weird case, since it requires using capture groups in a regex set even though it's impossible to extract any capture information from the RegexSet API. (The regex-automata crate does expose APIs to do this for multi-regexes, hence the new memory usage requirements.) So normally this wouldn't apply, but it turns out that globset produces regex sets with capture groups in them. However, even after fixing that, memory usage is still exorbitant because of the capture states inserted to track an overall match. Regex sets don't expose a way to get match offsets either, and so even those addition states are superfluous.

This PR introduces a path in regex-automata to get back to the original memory usage by making the PikeVM and bounded backtrackers work with NFAs that have no capture states at all. Previously, constructing such engines would result in an error. Now those constructions will succeed, but asking for the match offsets with result in "no match" being reported. (Since without the capture states in the NFA, the PikeVM can't track the match offsets.)

Fixes #1059

BurntSushi added a commit to BurntSushi/ripgrep that referenced this pull request Aug 4, 2023
We currently implement globs by converting them to regexes, and in doing
so, sometimes use grouping. In all but one case, we used non-capturing
groups. But for alternations, we used capturing groups, which was likely
just an oversight. We don't make use of capture groups at all, and while
they usually don't have any overhead, they lead to weird cases like this
one: rust-lang/regex#1059

That particular issue is also a bug in the regex crate itself, which is
fixed in rust-lang/regex#1062. Note though that
the bug fix in the regex crate is required. Even with this patch to
globset, memory usage is reduced (by about half in rust-lang/regex#1059)
but is not returned to where it was prior to the regex 1.9 release.
@BurntSushi
Copy link
Member Author

Note that this PR changes globset to use non-capture groups in all cases when translating a glob to a regex. While that ameliorates the issue reported in #1059, it does not fully fix it. Without this patch here, the PikeVM would still insert capture states for implicit capture groups, which are only truly needed to track match offsets. Since RegexSet cannot report match offsets, these are superfluous.

This is the first step in fixing a regression in memory usage. The
underlying problem is that regex-automata now natively supports
multi-pattern regexes *with* capturing support. Unfortunately though,
this overall doesn't work too well with the current design of the
PikeVM, because the amount of memory used is `len(captures) *
len(states)`. So basically, as the regex and number of captures
increases, the amount of memory used gets quite high.

This is new functionality that we hope to improve upon over time, so
it's not too big of a deal on its own. But it turns out this impacts
previous uses of RegexSet that have capture groups. The old
implementation just ignored these capture groups because they weren't
supported in a RegexSet, and thus there were no memory problems. But in
the new implementation, nothing tells it that it's okay to ignore the
capture groups. So it winds up allocating space for them even though the
RegexSet APIs don't provide any of that functionality.

So my plan to fix this is to introduce a new configuration knob for
controlling more granularly which capture states are compiled into the
NFA. Previously we only supported "all of them" or "none of them." This
commit adds a new (backwards compatible) knob that also permits "just
implicit groups." That is, one capture group per pattern. This hopefully
leads to less memory usage overall. (Well, it will certaintly be less,
but hopefully it's a big reduction.) We don't actually change anything
here. We just add a new `Config::which_captures` knob, implement the
existing `Config::captures` in terms of `Config::which_captures` and
deprecate `Config::captures`.

If this winds up not being sufficient, then we may need to adapt the
PikeVM to work without any capture groups at all and instead just report
which patterns match. Which is... probably fine?
The NFA compiler now implements the 'All', 'Implicit' and 'None'
options. We also add some targeted unit tests to confirm basic behavior.
This propagates the new Thompson NFA compiler option to the meta regex
config API.
While this reduces memory usage by half, unfortunately, it's still quite
a bit more than memory usage prior to regex 1.9. This is because we are
still allocating room to store two offsets per regex for a rather large
regex.
Previously, construction of these engines checked to make sure the NFA
given had some capture states in it. If the NFA didn't, construction
failed with an error.

To support the case where the NFA has no capture states at all (to avoid
gratuitous memory allocation), we remove this restriction and tweak the
engine implementations to stop assuming that the NFA has capture states.
This turned out to not be too hard, as we only assumed as much in a few
places.

The main reason why this restriction existed in the first place was
semantics. Namely, it's important that the PikeVM remain infallible. But
what happens when you ask for match offsets in a search with an NFA that
has no capture states? The PikeVM just doesn't support that. Previously
it would panic (and thus the reason construction would fail). But now
instead it will just report "no match." It's a little hokey, but we
justify it to ourselves because "simplicity" and "avoids footguns" are
non-goals of this crate.
And this finally resolves the memory usage problem, as the PikeVM cache
used by the RegexSet in #1059 no longer allocates MBs of memory because
of the existence of impossible-to-use capturing groups.

Fixes #1059
BurntSushi added a commit to BurntSushi/ripgrep that referenced this pull request Aug 5, 2023
We currently implement globs by converting them to regexes, and in doing
so, sometimes use grouping. In all but one case, we used non-capturing
groups. But for alternations, we used capturing groups, which was likely
just an oversight. We don't make use of capture groups at all, and while
they usually don't have any overhead, they lead to weird cases like this
one: rust-lang/regex#1059

That particular issue is also a bug in the regex crate itself, which is
fixed in rust-lang/regex#1062. Note though that
the bug fix in the regex crate is required. Even with this patch to
globset, memory usage is reduced (by about half in rust-lang/regex#1059)
but is not returned to where it was prior to the regex 1.9 release.
I originally prided myself on not having a dedicated `is_match` routine
on the meta regex engine's internal `Strategy` trait, and actually spent
a fair amount of attention ensuring that `is_match` and `find` always
returned the same results. That is, `is_match` returns true if and only
if `find` returns a match.

But the fix in the previous commits for #1059 means that a `PikeVM` and
a `BoundedBacktracker` can be used to run a search with an NFA that has
no capture states. Since both engines are implemented to only track
offsets via those capture states, it follows that the only thing that
can be returned in such cases is whether a match occurs (and if so,
which pattern matched). That in turn means that `is_match` can return
`true` while `find` can return `None` for the same search. This is
because the latter returns `None` even when a match is found but there
are no capture states to record the offsets of the match.

This in theory could be resolved by adding APIs to the `PikeVM` and the
`BoundedBacktracker` that return a `HalfMatch` without depending on any
capture states at all. Then `is_match` could be implemented in terms of
those APIs. That is probably the right path, but it's pretty gnarly to
do without breaking changes and I don't want to do any breaking changes
right now.

So instead, we just add a special path to the meta regex engine for
`is_match` and permit some cases to have different results between
`is_match` and `find`. Sigh.
Welp, okay, turns out we do need to know at least the end offset of a
match even when the NFA has no capture states. This is necessary for
correctly handling the case where a regex can match the empty string but
the caller has asked that matches not split a codepoint. If we don't
know the end offset of a match, then we can't correctly determine
whether a match exists or not and are forced to return no match even
when a match exists. We can get away with this I think for `find`-style
APIs where the caller has specifically requested match offsets while
simultaneously configuring the NFA to not track offsets, but with
`is_match`-style APIs, we really should be able to handle it correctly.

We should eventually just expose the `HalfMatch` APIs on `PikeVM` and
`BoundedBacktracker`, but for now we keep them private.
@BurntSushi BurntSushi merged commit d93ddbe into master Aug 5, 2023
15 checks passed
@BurntSushi BurntSushi deleted the ag/fix-1059 branch August 5, 2023 18:32
facebook-github-bot pushed a commit to facebook/errpy that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebook/sapling that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebook/hhvm that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebookincubator/below that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebookexperimental/rust-shed that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebook/ocamlrep that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebookincubator/reindeer that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
facebook-github-bot pushed a commit to facebookexperimental/hermit that referenced this pull request Aug 5, 2023
Summary:
BurntSushi has fixed a memory usage regression introduced by regex 1.9 which caused Buck to allocate and retain significantly more memory when using moderately sized `buck2_common::ignores::ignore_set::IgnoreSet` objects concurrently from many threads.

- Bug report: **[rust-lang/regex#1059](rust-lang/regex#1059) *"1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8"***

- Globset fix: **[BurntSushi/ripgrep#25770](https://github.com/BurntSushi/ripgrep/pull/25770) *"globset: use non-capture groups in regex transform"***

- Regex fix: **[rust-lang/regex#1062](rust-lang/regex#1062) *"fix memory usage regression for RegexSet with capture groups"***

Reviewed By: zertosh

Differential Revision: D48095372

fbshipit-source-id: ec11c2bcaccbd26354d6d0a0398000134eaf3681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1.9 memory usage: globset-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8
1 participant