RegexSet - enhancement #732

ohaddahan · 2020-12-25T22:35:50Z

I tried using RegexSet and I noticed that for my specific usage the performance was significantly slower than merging all my regex into one as in regex1|regex2|...|regexN.

I do know that when you combine it into one regex you lose the ability to get all matches without iterating it.

From my experience, in many applications that rely heavily on MANY regexp, for example a log processing, the match rate is usually very low (most messages are uninteresting info vs. error, as an example) thus this trade off is worth it.

Although it's not really a trade off, since when you find a matching line, you can run the existing RegexSet on it and get all the required info.

There is another caveat which can be resolved easily, named groups cannot be duplicated.
I workaround this before by taking any user input regex and doing two changes to it:

Put the entire regex in a (?P<GROUP_NUM>regexp).
Rename each group inside the user regexp, for example if regexp number 5 is (?P<Name>.*), it will be rewritten as (?P<GROUP_5>(?P<GROUP_5_NAME>.*))`.

I suggest this as an extra mode to enhance the existing RegexSet, I'll be happy to work on implementing it if the maintainers think it's worthwhile.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2020-12-25T22:42:07Z

Sorry, but I don't understand what you're proposing. Can you make it more concrete in terms of what APIs you would add?

ohaddahan · 2020-12-25T22:46:02Z

API wise it basically one optional argument in the constructor. For example:

RegexSet::new(&[
    r"[a-z]+@[a-z]+\.(com|org|net)",
    r"[a-z]+\.(com|org|net)",
    OR_MATCH_MODE
]).unwrap();

The main point is changing the matching algorithm to use the one merged regex instead of the existing flow.
(obviously only under this mode)

ohaddahan · 2020-12-25T22:51:40Z

A very basic draft of this, you can run with run_regexp or run_regexp_array, I tested on a large file, seeing significant speed differences (depending on how many on the type of regex).

use regex::Regex;
use regex::RegexSet;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::time::Instant;

fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where
    P: AsRef<Path>,
{
    let file = File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

fn read_eme(filename: &str) -> Vec<String> {
    let mut regexp_array = Vec::new();
    if let Ok(lines) = read_lines(filename) {
        for line in lines {
            if let Ok(ip) = line {
                let _re = Regex::new(&ip).unwrap();
                regexp_array.push(ip.clone());
            }
        }
    }
    regexp_array
}

fn read_eme2(filename: &str) -> String {
    let mut count = 0;
    let mut regexp = String::new();
    if let Ok(lines) = read_lines(filename) {
        for line in lines {
            if let Ok(ip) = line {
                if count > 0 {
                    regexp.push('|');
                }
                regexp.push_str(&format!("(?P<GROUP_{}>{})", count, &ip));
                count += 1;
            }
        }
    }
    regexp
}


fn run_regexp() {
    let regexp_str = read_eme2("./patterns.eme");
    println!("regexp_str: '{:?}'", regexp_str);
    let regex = Regex::new(&regexp_str).unwrap();


    let mut count = 0;
    let mut t0 = Instant::now();

    if let Ok(lines) = read_lines("./test_100.log") {
        for line in lines {
            if count % 10000 == 0 {
                let t1 = Instant::now();
                println!("Line {:?} , {:?}, {:?}", &line, count, t1 - t0);
                t0 = t1;
            }
            if let Ok(ip) = line {
                for cap in regex.captures_iter(&ip) {
                    println!("cap: '{:?}', line: '{:?}'", cap, ip);
                }
            }
            count += 1;
        }
    }
}


fn run_regexp_array() {
    let regexp_array = read_eme("./patterns.eme");
    let regexp_set = RegexSet::new(&regexp_array).unwrap();
    let mut count = 0;
    let mut t0 = Instant::now();

    if let Ok(lines) = read_lines("./test.log") {
        for line in lines {
            if count % 10000 == 0 {
                let t1 = Instant::now();
                println!("Line {:?} , {:?}, {:?}", &line, count, t1 - t0);
                t0 = t1;
            }
            if let Ok(ip) = line {
                let matches: Vec<_> = regexp_set.matches(&ip).into_iter().collect();
                for m in matches {
                    println!("m: '{:?}', pattern: '{:?}', line: '{:?}'", m, regexp_array[m], ip);
                }
            }
            count += 1;
        }
    }
}

fn main() {
    run_regexp();
    //run_regexp_array();
}

BurntSushi · 2020-12-26T01:16:38Z

I don't think API changes are the right path here. If there's a performance difference then we should fix that. I believe there are tickets for that.

ohaddahan · 2020-12-26T08:37:10Z

I agree about the API changes.
I'll work on a PR with something more solid for you to provide feedback on (due note I'm not a Rust expert so I expect much criticism on my code).

Also found some related issue #259

BurntSushi · 2020-12-26T13:29:26Z

OK, but it would probably be better to discuss what specifically you'll be doing in the PR if possible. Even if it's just a sketch. I may not merge it based purely on its design, for example.

BurntSushi · 2021-07-19T18:26:35Z

I think having to preprocess the capture group names on the caller side is probably the right path here. I don't think complicating the regex crate API is the right path here.

ohaddahan changed the title ~~RegexSet~~ RegexSet - enhancement Dec 25, 2020

BurntSushi closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexSet - enhancement #732

RegexSet - enhancement #732

ohaddahan commented Dec 25, 2020 •

edited

BurntSushi commented Dec 25, 2020

ohaddahan commented Dec 25, 2020

ohaddahan commented Dec 25, 2020

BurntSushi commented Dec 26, 2020

ohaddahan commented Dec 26, 2020

BurntSushi commented Dec 26, 2020

BurntSushi commented Jul 19, 2021

RegexSet - enhancement #732

RegexSet - enhancement #732

Comments

ohaddahan commented Dec 25, 2020 • edited

BurntSushi commented Dec 25, 2020

ohaddahan commented Dec 25, 2020

ohaddahan commented Dec 25, 2020

BurntSushi commented Dec 26, 2020

ohaddahan commented Dec 26, 2020

BurntSushi commented Dec 26, 2020

BurntSushi commented Jul 19, 2021

ohaddahan commented Dec 25, 2020 •

edited