reverse inner literal optimization results in incorrect match offsets in some cases #1060

Theta-Dev · 2023-08-03T16:25:10Z

What version of regex are you using?

The bug is present in version 1.9.0 and 1.9.1.
1.8.4 was the last working version.

Describe the bug at a high level.

I am using the regex crate for parsing textual video durations and getting the duration in seconds.
After updating the regex crate to version 1.9.0, my regex fails to parse durations with more than 2 hour digits.

The + operator I use for matching an arbitrary amount of hour digits now only matches 1 or 2 digits.

What are the steps to reproduce the behavior?

Regex: (?:(\d+)[:.])?(\d{1,2})[:.](\d{2})

Here is my parsing function.

/// Parse textual video length (e.g. `0:49`, `2:02` or `1:48:18`)
/// and return the duration in seconds.
pub fn parse_video_length(text: &str) -> Option<u32> {
    static VIDEO_LENGTH_REGEX: Lazy<Regex> =
        Lazy::new(|| Regex::new(r#"(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})"#).unwrap());
    VIDEO_LENGTH_REGEX.captures(text).map(|cap| {
        let hrs = cap
            .get(1)
            .and_then(|x| x.as_str().parse::<u32>().ok())
            .unwrap_or_default();
        let min = cap
            .get(2)
            .and_then(|x| x.as_str().parse::<u32>().ok())
            .unwrap_or_default();
        let sec = cap
            .get(3)
            .and_then(|x| x.as_str().parse::<u32>().ok())
            .unwrap_or_default();

        hrs * 3600 + min * 60 + sec
    })
}

What is the actual behavior?

The hour group only matches the number 2, so the parsed duration is 2:12:39 or 7959 seconds

parse_video_length("102:12:39") == Some(7959)

What is the expected behavior?

parse_video_length("102:12:39") == Some(367959)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2023-08-03T16:53:40Z

CLI reproduction:

$ regex-cli find match meta -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
     parse time:  27.056µs
 translate time:  9.404µs
build meta time:  263.862µs
    search time:  28.082µs
  total matches:  1
0:1:9:88:77:66

The match should be 0:9 but 1:9 is reported.

This isn't really about + but about a literal optimization. The regex is conceptually broken into three pieces: (?:(\d+)[:.])?(\d{1,2}), [:.] and (\d{2}). The search looks for matches of the [:.] first, which is very quick by virtue of using SIMD acceleration. That first match occurs at &haystack[3]. Then it tries to run a reverse search for (?:(\d+)[:.])?(\d{1,2}). This is where things go wrong. It finds a reverse match at &haystack[1] because the \d{1,2} matches two digits and since the (?:(\d+)[:.])? is overall optional and the [:.] doesn't match at this point, the match is reported at offset 1. Then a forward search is run from that point to find the real end of the match, which finds 9 because a match at &haystack[1..9] exists, but the correct leftmost match is &haystack[0..9].

I'm going to need to think a little bit about how to fix this. It might be something about characterizing regexes that can't be used with the inner literal optimization. Which is unfortunate, but much better than removing the inner literal optimization altogether... Yikes.

Sadly it seems that my days of squashing optimization bugs are still before me. In this particular case, the reverse inner literal optimization (which is a new optimization introduced in regex 1.9) resulted in reporting incorrect match offsets in some cases. The offending case here is: $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66' 0:1:9:888:77:66 The above reports a match at 1..9, but the correct match is 0..9. The problem here is that the reverse inner literal optimization is being applied, which splits the regex into three (conceptual) pieces: 1. `(?:(\d+)[:.])?(\d{1,2})` 2. `[:.]` 3. `(\d{2})` The reverse inner optimization works by looking for occurrences of (2) first, then matching (1) in reverse to find the start position of the match and then searching for (3) in the forward direction to find the end of the match. The problem in this particular case is that (2) matches at position `3` in the `888:77:66` haystack. Since the first section of numbers is optional, the reverse inner optimization believes a match exists at offset `1` by virtue of matching (1) in reverse. That is, the `(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at all. The reverse search here is correct in isolation, but it leads to an overall incorrect result by stopping the search early. The issue is that the true leftmost match requires (2) to match at 6..7, but since it matched at 3..4 first, it is considered first and leads to an incorrect overall match. To fix this, we add another "trip wire" to the reverse inner optimization (of which there are already several) that tries to detect cases where it cannot prove that the match it found is actually the leftmost match. Namely, if it reports a match offset greater than the start of the search and otherwise *could* have kept searching, then we don't know whether we have the true leftmost match. In that case, we bail on the optimization and let a slower path take over. This is yet another example of how the nature of regex searching, and in particular leftmost searching, inhibits the composition of different regex strategies. Or at least, makes them incredibly subtle. Fixes #1060

BurntSushi · 2023-08-05T22:20:24Z

This is fixed in regex 1.9.3 and regex-automata 0.3.6 on crates.io.

BurntSushi added the bug label Aug 3, 2023

BurntSushi changed the title ~~Unexpected + operator behavior in 1.9~~ reverse inner literal optimization results in incorrect match offsets in some cases Aug 5, 2023

BurntSushi mentioned this issue Aug 5, 2023

automata: fix incorrect offsets reported by reverse inner optimization #1063

Merged

BurntSushi closed this as completed in #1063 Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reverse inner literal optimization results in incorrect match offsets in some cases #1060

reverse inner literal optimization results in incorrect match offsets in some cases #1060

Theta-Dev commented Aug 3, 2023 •

edited

BurntSushi commented Aug 3, 2023

BurntSushi commented Aug 5, 2023

reverse inner literal optimization results in incorrect match offsets in some cases #1060

reverse inner literal optimization results in incorrect match offsets in some cases #1060

Comments

Theta-Dev commented Aug 3, 2023 • edited

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

BurntSushi commented Aug 3, 2023

BurntSushi commented Aug 5, 2023

Theta-Dev commented Aug 3, 2023 •

edited