Increased memory usage when updating to regex 1.10 #1116

Marwes · 2023-10-27T14:13:36Z

What version of regex are you using?

1.10, and I used 1.7 before. Issue seems to be mainly be due the rewrite in 1.9

Describe the bug at a high level.

After updating to regex 1.10 I am seeing greatly increased memory usage (captured using the dhat crate. see example below). In particular part of the issue seems to be due to the use of capture groups in the regex. These captures only serve to group the regex so they could (and should) be non-capturing groups and I have fixed this on my end, however since captures do not seem to matter on 1.7 I guess there may be a missed optimization here? (#1059 comes to mind).

(The regex in the example has been altered but it remains the same in spirit and exhibits the same memory increase)

What are the steps to reproduce the behavior?

The following code can be used to reproduce the behavior by using dhat to track memory and changing the regex version.

// Cargo.toml
// regex = "=1.10"
// dhat = "0.3"

#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    without_captures();
    with_captures();
}

fn without_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (?:craigslist\.org$)|
        (?:utexas\.edu$)|
        (?:blogs\.com$)|
        (?:is\.gd$)|
        (?:vkontakte\.ru$)|
        (?:google\.com\.hk$)|
        (?:vimeo\.com$)|
        (?:simplemachines\.org$)|
        (?:plala\.or\.jp$)|
        (?:npr\.org$)|
        (?:census\.gov$)|
        (?:360\.cn$)|
        (?:wisc\.edu$)|
        (?:princeton\.edu$)|
        (?:addthis\.com$)|
        (?:google\.de$)|
        (?:ox\.ac\.uk$)|
        (?:free13runpool\.com$)|
        (?:berkeley\.edu$)|
        (?:fda\.gov$)|
        (?:soundcloud\.com$)|
        (?:ftc\.gov$)|
        (?:cloudflare\.com$)|
        (?:com\.com$)|
        (?:statcounter\.com$)|
        (?:tumblr\.com$)|
        (?:alexa\.com$)|
        (?:canalblog\.com$)|
        (?:uiuc\.edu$)|
        (?:msu\.edu$)|
        (?:bravesites\.com$)|
        (?:usatoday\.com$)|
        (?:edublogs\.org$)|
        (?:forbes\.com$)|
        (?:patch\.com$)|
        (?:1688\.com$)|
        (?:ihg\.com$)|
        (?:ow\.ly$)|
        (?:usda\.gov$)|
        (?:yellowbook\.com$)|
        (?:wired\.com$)|
        (?:homestead\.com$)|
        (?:state\.tx\.us$)|
        (?:webnode\.com$)|
        (?:123-reg\.co\.uk$)|
        (?:irs\.gov$)|
        (?:yale\.edu$)|
        (?:naver\.com$)|
        (?:elpais\.com$)|
        (?:example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

fn with_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (craigslist\.org$)|
        (utexas\.edu$)|
        (blogs\.com$)|
        (is\.gd$)|
        (vkontakte\.ru$)|
        (google\.com\.hk$)|
        (vimeo\.com$)|
        (simplemachines\.org$)|
        (plala\.or\.jp$)|
        (npr\.org$)|
        (census\.gov$)|
        (360\.cn$)|
        (wisc\.edu$)|
        (princeton\.edu$)|
        (addthis\.com$)|
        (google\.de$)|
        (ox\.ac\.uk$)|
        (free13runpool\.com$)|
        (berkeley\.edu$)|
        (fda\.gov$)|
        (soundcloud\.com$)|
        (ftc\.gov$)|
        (cloudflare\.com$)|
        (com\.com$)|
        (statcounter\.com$)|
        (tumblr\.com$)|
        (alexa\.com$)|
        (canalblog\.com$)|
        (uiuc\.edu$)|
        (msu\.edu$)|
        (bravesites\.com$)|
        (usatoday\.com$)|
        (edublogs\.org$)|
        (forbes\.com$)|
        (patch\.com$)|
        (1688\.com$)|
        (ihg\.com$)|
        (ow\.ly$)|
        (usda\.gov$)|
        (yellowbook\.com$)|
        (wired\.com$)|
        (homestead\.com$)|
        (state\.tx\.us$)|
        (webnode\.com$)|
        (123-reg\.co\.uk$)|
        (irs\.gov$)|
        (yale\.edu$)|
        (naver\.com$)|
        (elpais\.com$)|
        (example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

Memory stats from running the example

Most of the stats are the same, but we can see a 5x increase in memory when using capturing groups in 1.10.

https://docs.rs/dhat/latest/dhat/struct.HeapStats.html

1.7.3

Match `true`, with captures: HeapStats {
    total_blocks: 4137,
    total_bytes: 1189678,
    curr_blocks: 48,
    curr_bytes: 114285,
    max_blocks: 212,
    max_bytes: 247538,
}
Match `true`, with captures: HeapStats {
    total_blocks: 4152,
    total_bytes: 1201606,
    curr_blocks: 48,
    curr_bytes: 121921,
    max_blocks: 212,
    max_bytes: 247338,
}

1.10.2


Match `true`, with captures: HeapStats {
    total_blocks: 3486,
    total_bytes: 763125,
    curr_blocks: 221,
    curr_bytes: 160832,
    max_blocks: 1215,
    max_bytes: 228249,
}
Match `true`, with captures: HeapStats {
    total_blocks: 3694,
    total_bytes: 1871135,
    curr_blocks: 221,
    curr_bytes: 1242544,
    max_blocks: 216,
    max_bytes: 1242568,
}

The text was updated successfully, but these errors were encountered:

BurntSushi mentioned this issue Mar 8, 2024

rg allocates too much memory with: rg --files --ignore-file ~/.ultimate-gitignore BurntSushi/ripgrep#2750

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased memory usage when updating to regex 1.10 #1116

Increased memory usage when updating to regex 1.10 #1116

Marwes commented Oct 27, 2023 •

edited

Increased memory usage when updating to regex 1.10 #1116

Increased memory usage when updating to regex 1.10 #1116

Comments

Marwes commented Oct 27, 2023 • edited

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

Memory stats from running the example

1.7.3

1.10.2

Marwes commented Oct 27, 2023 •

edited