Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverse inner literal optimization results in incorrect match offsets in some cases #1060

Closed
Theta-Dev opened this issue Aug 3, 2023 · 2 comments · Fixed by #1063
Closed
Labels

Comments

@Theta-Dev
Copy link

Theta-Dev commented Aug 3, 2023

What version of regex are you using?

The bug is present in version 1.9.0 and 1.9.1.
1.8.4 was the last working version.

Describe the bug at a high level.

I am using the regex crate for parsing textual video durations and getting the duration in seconds.
After updating the regex crate to version 1.9.0, my regex fails to parse durations with more than 2 hour digits.

The + operator I use for matching an arbitrary amount of hour digits now only matches 1 or 2 digits.

What are the steps to reproduce the behavior?

Regex: (?:(\d+)[:.])?(\d{1,2})[:.](\d{2})

Here is my parsing function.

/// Parse textual video length (e.g. `0:49`, `2:02` or `1:48:18`)
/// and return the duration in seconds.
pub fn parse_video_length(text: &str) -> Option<u32> {
    static VIDEO_LENGTH_REGEX: Lazy<Regex> =
        Lazy::new(|| Regex::new(r#"(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})"#).unwrap());
    VIDEO_LENGTH_REGEX.captures(text).map(|cap| {
        let hrs = cap
            .get(1)
            .and_then(|x| x.as_str().parse::<u32>().ok())
            .unwrap_or_default();
        let min = cap
            .get(2)
            .and_then(|x| x.as_str().parse::<u32>().ok())
            .unwrap_or_default();
        let sec = cap
            .get(3)
            .and_then(|x| x.as_str().parse::<u32>().ok())
            .unwrap_or_default();

        hrs * 3600 + min * 60 + sec
    })
}

What is the actual behavior?

The hour group only matches the number 2, so the parsed duration is 2:12:39 or 7959 seconds

parse_video_length("102:12:39") == Some(7959)

What is the expected behavior?

parse_video_length("102:12:39") == Some(367959)
@BurntSushi BurntSushi added the bug label Aug 3, 2023
@BurntSushi
Copy link
Member

CLI reproduction:

$ regex-cli find match meta -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
     parse time:  27.056µs
 translate time:  9.404µs
build meta time:  263.862µs
    search time:  28.082µs
  total matches:  1
0:1:9:88:77:66

The match should be 0:9 but 1:9 is reported.

This isn't really about + but about a literal optimization. The regex is conceptually broken into three pieces: (?:(\d+)[:.])?(\d{1,2}), [:.] and (\d{2}). The search looks for matches of the [:.] first, which is very quick by virtue of using SIMD acceleration. That first match occurs at &haystack[3]. Then it tries to run a reverse search for (?:(\d+)[:.])?(\d{1,2}). This is where things go wrong. It finds a reverse match at &haystack[1] because the \d{1,2} matches two digits and since the (?:(\d+)[:.])? is overall optional and the [:.] doesn't match at this point, the match is reported at offset 1. Then a forward search is run from that point to find the real end of the match, which finds 9 because a match at &haystack[1..9] exists, but the correct leftmost match is &haystack[0..9].

I'm going to need to think a little bit about how to fix this. It might be something about characterizing regexes that can't be used with the inner literal optimization. Which is unfortunate, but much better than removing the inner literal optimization altogether... Yikes.

@BurntSushi BurntSushi changed the title Unexpected + operator behavior in 1.9 reverse inner literal optimization results in incorrect match offsets in some cases Aug 5, 2023
BurntSushi added a commit that referenced this issue Aug 5, 2023
Sadly it seems that my days of squashing optimization bugs are still
before me. In this particular case, the reverse inner literal
optimization (which is a new optimization introduced in regex 1.9)
resulted in reporting incorrect match offsets in some cases. The
offending case here is:

    $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
    0:1:9:888:77:66

The above reports a match at 1..9, but the correct match is 0..9. The
problem here is that the reverse inner literal optimization is being
applied, which splits the regex into three (conceptual) pieces:

1. `(?:(\d+)[:.])?(\d{1,2})`
2. `[:.]`
3. `(\d{2})`

The reverse inner optimization works by looking for occurrences of (2)
first, then matching (1) in reverse to find the start position of the
match and then searching for (3) in the forward direction to find the
end of the match.

The problem in this particular case is that (2) matches at position `3`
in the `888:77:66` haystack. Since the first section of numbers is
optional, the reverse inner optimization believes a match exists at
offset `1` by virtue of matching (1) in reverse. That is, the
`(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at
all. The reverse search here is correct in isolation, but it leads to an
overall incorrect result by stopping the search early. The issue is that
the true leftmost match requires (2) to match at 6..7, but since it
matched at 3..4 first, it is considered first and leads to an incorrect
overall match.

To fix this, we add another "trip wire" to the reverse inner
optimization (of which there are already several) that tries to detect
cases where it cannot prove that the match it found is actually the
leftmost match. Namely, if it reports a match offset greater than the
start of the search and otherwise *could* have kept searching, then we
don't know whether we have the true leftmost match. In that case, we
bail on the optimization and let a slower path take over.

This is yet another example of how the nature of regex searching, and in
particular leftmost searching, inhibits the composition of different
regex strategies. Or at least, makes them incredibly subtle.

Fixes #1060
BurntSushi added a commit that referenced this issue Aug 5, 2023
Sadly it seems that my days of squashing optimization bugs are still
before me. In this particular case, the reverse inner literal
optimization (which is a new optimization introduced in regex 1.9)
resulted in reporting incorrect match offsets in some cases. The
offending case here is:

    $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
    0:1:9:888:77:66

The above reports a match at 1..9, but the correct match is 0..9. The
problem here is that the reverse inner literal optimization is being
applied, which splits the regex into three (conceptual) pieces:

1. `(?:(\d+)[:.])?(\d{1,2})`
2. `[:.]`
3. `(\d{2})`

The reverse inner optimization works by looking for occurrences of (2)
first, then matching (1) in reverse to find the start position of the
match and then searching for (3) in the forward direction to find the
end of the match.

The problem in this particular case is that (2) matches at position `3`
in the `888:77:66` haystack. Since the first section of numbers is
optional, the reverse inner optimization believes a match exists at
offset `1` by virtue of matching (1) in reverse. That is, the
`(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at
all. The reverse search here is correct in isolation, but it leads to an
overall incorrect result by stopping the search early. The issue is that
the true leftmost match requires (2) to match at 6..7, but since it
matched at 3..4 first, it is considered first and leads to an incorrect
overall match.

To fix this, we add another "trip wire" to the reverse inner
optimization (of which there are already several) that tries to detect
cases where it cannot prove that the match it found is actually the
leftmost match. Namely, if it reports a match offset greater than the
start of the search and otherwise *could* have kept searching, then we
don't know whether we have the true leftmost match. In that case, we
bail on the optimization and let a slower path take over.

This is yet another example of how the nature of regex searching, and in
particular leftmost searching, inhibits the composition of different
regex strategies. Or at least, makes them incredibly subtle.

Fixes #1060
@BurntSushi
Copy link
Member

This is fixed in regex 1.9.3 and regex-automata 0.3.6 on crates.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants