Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased memory usage when updating to regex 1.10 #1116

Open
Marwes opened this issue Oct 27, 2023 · 0 comments
Open

Increased memory usage when updating to regex 1.10 #1116

Marwes opened this issue Oct 27, 2023 · 0 comments

Comments

@Marwes
Copy link
Contributor

Marwes commented Oct 27, 2023

What version of regex are you using?

1.10, and I used 1.7 before. Issue seems to be mainly be due the rewrite in 1.9

Describe the bug at a high level.

After updating to regex 1.10 I am seeing greatly increased memory usage (captured using the dhat crate. see example below). In particular part of the issue seems to be due to the use of capture groups in the regex. These captures only serve to group the regex so they could (and should) be non-capturing groups and I have fixed this on my end, however since captures do not seem to matter on 1.7 I guess there may be a missed optimization here? (#1059 comes to mind).

(The regex in the example has been altered but it remains the same in spirit and exhibits the same memory increase)

What are the steps to reproduce the behavior?

The following code can be used to reproduce the behavior by using dhat to track memory and changing the regex version.

// Cargo.toml
// regex = "=1.10"
// dhat = "0.3"

#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    without_captures();
    with_captures();
}

fn without_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (?:craigslist\.org$)|
        (?:utexas\.edu$)|
        (?:blogs\.com$)|
        (?:is\.gd$)|
        (?:vkontakte\.ru$)|
        (?:google\.com\.hk$)|
        (?:vimeo\.com$)|
        (?:simplemachines\.org$)|
        (?:plala\.or\.jp$)|
        (?:npr\.org$)|
        (?:census\.gov$)|
        (?:360\.cn$)|
        (?:wisc\.edu$)|
        (?:princeton\.edu$)|
        (?:addthis\.com$)|
        (?:google\.de$)|
        (?:ox\.ac\.uk$)|
        (?:free13runpool\.com$)|
        (?:berkeley\.edu$)|
        (?:fda\.gov$)|
        (?:soundcloud\.com$)|
        (?:ftc\.gov$)|
        (?:cloudflare\.com$)|
        (?:com\.com$)|
        (?:statcounter\.com$)|
        (?:tumblr\.com$)|
        (?:alexa\.com$)|
        (?:canalblog\.com$)|
        (?:uiuc\.edu$)|
        (?:msu\.edu$)|
        (?:bravesites\.com$)|
        (?:usatoday\.com$)|
        (?:edublogs\.org$)|
        (?:forbes\.com$)|
        (?:patch\.com$)|
        (?:1688\.com$)|
        (?:ihg\.com$)|
        (?:ow\.ly$)|
        (?:usda\.gov$)|
        (?:yellowbook\.com$)|
        (?:wired\.com$)|
        (?:homestead\.com$)|
        (?:state\.tx\.us$)|
        (?:webnode\.com$)|
        (?:123-reg\.co\.uk$)|
        (?:irs\.gov$)|
        (?:yale\.edu$)|
        (?:naver\.com$)|
        (?:elpais\.com$)|
        (?:example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

fn with_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (craigslist\.org$)|
        (utexas\.edu$)|
        (blogs\.com$)|
        (is\.gd$)|
        (vkontakte\.ru$)|
        (google\.com\.hk$)|
        (vimeo\.com$)|
        (simplemachines\.org$)|
        (plala\.or\.jp$)|
        (npr\.org$)|
        (census\.gov$)|
        (360\.cn$)|
        (wisc\.edu$)|
        (princeton\.edu$)|
        (addthis\.com$)|
        (google\.de$)|
        (ox\.ac\.uk$)|
        (free13runpool\.com$)|
        (berkeley\.edu$)|
        (fda\.gov$)|
        (soundcloud\.com$)|
        (ftc\.gov$)|
        (cloudflare\.com$)|
        (com\.com$)|
        (statcounter\.com$)|
        (tumblr\.com$)|
        (alexa\.com$)|
        (canalblog\.com$)|
        (uiuc\.edu$)|
        (msu\.edu$)|
        (bravesites\.com$)|
        (usatoday\.com$)|
        (edublogs\.org$)|
        (forbes\.com$)|
        (patch\.com$)|
        (1688\.com$)|
        (ihg\.com$)|
        (ow\.ly$)|
        (usda\.gov$)|
        (yellowbook\.com$)|
        (wired\.com$)|
        (homestead\.com$)|
        (state\.tx\.us$)|
        (webnode\.com$)|
        (123-reg\.co\.uk$)|
        (irs\.gov$)|
        (yale\.edu$)|
        (naver\.com$)|
        (elpais\.com$)|
        (example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

Memory stats from running the example

Most of the stats are the same, but we can see a 5x increase in memory when using capturing groups in 1.10.

https://docs.rs/dhat/latest/dhat/struct.HeapStats.html

1.7.3
Match `true`, with captures: HeapStats {
    total_blocks: 4137,
    total_bytes: 1189678,
    curr_blocks: 48,
    curr_bytes: 114285,
    max_blocks: 212,
    max_bytes: 247538,
}
Match `true`, with captures: HeapStats {
    total_blocks: 4152,
    total_bytes: 1201606,
    curr_blocks: 48,
    curr_bytes: 121921,
    max_blocks: 212,
    max_bytes: 247338,
}

1.10.2

Match `true`, with captures: HeapStats {
    total_blocks: 3486,
    total_bytes: 763125,
    curr_blocks: 221,
    curr_bytes: 160832,
    max_blocks: 1215,
    max_bytes: 228249,
}
Match `true`, with captures: HeapStats {
    total_blocks: 3694,
    total_bytes: 1871135,
    curr_blocks: 221,
    curr_bytes: 1242544,
    max_blocks: 216,
    max_bytes: 1242568,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant