Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexSet - enhancement #732

Closed
ohaddahan opened this issue Dec 25, 2020 · 7 comments
Closed

RegexSet - enhancement #732

ohaddahan opened this issue Dec 25, 2020 · 7 comments

Comments

@ohaddahan
Copy link

ohaddahan commented Dec 25, 2020

I tried using RegexSet and I noticed that for my specific usage the performance was significantly slower than merging all my regex into one as in regex1|regex2|...|regexN.

I do know that when you combine it into one regex you lose the ability to get all matches without iterating it.

From my experience, in many applications that rely heavily on MANY regexp, for example a log processing, the match rate is usually very low (most messages are uninteresting info vs. error, as an example) thus this trade off is worth it.

Although it's not really a trade off, since when you find a matching line, you can run the existing RegexSet on it and get all the required info.

There is another caveat which can be resolved easily, named groups cannot be duplicated.
I workaround this before by taking any user input regex and doing two changes to it:

  1. Put the entire regex in a (?P<GROUP_NUM>regexp).
  2. Rename each group inside the user regexp, for example if regexp number 5 is (?P<Name>.*), it will be rewritten as (?P<GROUP_5>(?P<GROUP_5_NAME>.*))`.

I suggest this as an extra mode to enhance the existing RegexSet, I'll be happy to work on implementing it if the maintainers think it's worthwhile.

@ohaddahan ohaddahan changed the title RegexSet RegexSet - enhancement Dec 25, 2020
@BurntSushi
Copy link
Member

Sorry, but I don't understand what you're proposing. Can you make it more concrete in terms of what APIs you would add?

@ohaddahan
Copy link
Author

API wise it basically one optional argument in the constructor. For example:

RegexSet::new(&[
    r"[a-z]+@[a-z]+\.(com|org|net)",
    r"[a-z]+\.(com|org|net)",
    OR_MATCH_MODE
]).unwrap();

The main point is changing the matching algorithm to use the one merged regex instead of the existing flow.
(obviously only under this mode)

@ohaddahan
Copy link
Author

A very basic draft of this, you can run with run_regexp or run_regexp_array, I tested on a large file, seeing significant speed differences (depending on how many on the type of regex).

use regex::Regex;
use regex::RegexSet;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::time::Instant;

fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where
    P: AsRef<Path>,
{
    let file = File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

fn read_eme(filename: &str) -> Vec<String> {
    let mut regexp_array = Vec::new();
    if let Ok(lines) = read_lines(filename) {
        for line in lines {
            if let Ok(ip) = line {
                let _re = Regex::new(&ip).unwrap();
                regexp_array.push(ip.clone());
            }
        }
    }
    regexp_array
}

fn read_eme2(filename: &str) -> String {
    let mut count = 0;
    let mut regexp = String::new();
    if let Ok(lines) = read_lines(filename) {
        for line in lines {
            if let Ok(ip) = line {
                if count > 0 {
                    regexp.push('|');
                }
                regexp.push_str(&format!("(?P<GROUP_{}>{})", count, &ip));
                count += 1;
            }
        }
    }
    regexp
}


fn run_regexp() {
    let regexp_str = read_eme2("./patterns.eme");
    println!("regexp_str: '{:?}'", regexp_str);
    let regex = Regex::new(&regexp_str).unwrap();


    let mut count = 0;
    let mut t0 = Instant::now();

    if let Ok(lines) = read_lines("./test_100.log") {
        for line in lines {
            if count % 10000 == 0 {
                let t1 = Instant::now();
                println!("Line {:?} , {:?}, {:?}", &line, count, t1 - t0);
                t0 = t1;
            }
            if let Ok(ip) = line {
                for cap in regex.captures_iter(&ip) {
                    println!("cap: '{:?}', line: '{:?}'", cap, ip);
                }
            }
            count += 1;
        }
    }
}


fn run_regexp_array() {
    let regexp_array = read_eme("./patterns.eme");
    let regexp_set = RegexSet::new(&regexp_array).unwrap();
    let mut count = 0;
    let mut t0 = Instant::now();

    if let Ok(lines) = read_lines("./test.log") {
        for line in lines {
            if count % 10000 == 0 {
                let t1 = Instant::now();
                println!("Line {:?} , {:?}, {:?}", &line, count, t1 - t0);
                t0 = t1;
            }
            if let Ok(ip) = line {
                let matches: Vec<_> = regexp_set.matches(&ip).into_iter().collect();
                for m in matches {
                    println!("m: '{:?}', pattern: '{:?}', line: '{:?}'", m, regexp_array[m], ip);
                }
            }
            count += 1;
        }
    }
}

fn main() {
    run_regexp();
    //run_regexp_array();
}

@BurntSushi
Copy link
Member

I don't think API changes are the right path here. If there's a performance difference then we should fix that. I believe there are tickets for that.

@ohaddahan
Copy link
Author

I agree about the API changes.
I'll work on a PR with something more solid for you to provide feedback on (due note I'm not a Rust expert so I expect much criticism on my code).

Also found some related issue #259

@BurntSushi
Copy link
Member

OK, but it would probably be better to discuss what specifically you'll be doing in the PR if possible. Even if it's just a sketch. I may not merge it based purely on its design, for example.

@BurntSushi
Copy link
Member

I think having to preprocess the capture group names on the caller side is probably the right path here. I don't think complicating the regex crate API is the right path here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants