Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panicked at 'not a char boundary' when using Regex::replace #1006

Closed
WIZeaz opened this issue Jun 5, 2023 · 5 comments · Fixed by #1007
Closed

panicked at 'not a char boundary' when using Regex::replace #1006

WIZeaz opened this issue Jun 5, 2023 · 5 comments · Fixed by #1007

Comments

@WIZeaz
Copy link

WIZeaz commented Jun 5, 2023

What version of regex are you using?

v1.8.3

Describe the bug at a high level.

I have fuzzed the regex package with afl.rs, and afl reports some cases panicked at "char boundary":

thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'ο' (bytes 0..2) of `ο00000000000`', <my_project_path>/regex/src/re_unicode.rs:574:31

It seems replace function have some problems with utf-8 character. I have found such panic in several fuzz drivers.

What are the steps to reproduce the behavior?

fn main() {
    let re = regex::Regex::new(r"\B|00(?-u)\B").unwrap();
    let text = r"𐾁00000000";
    let rep = r"0𐾁Ű000ο";
    let _ = re.replace(text, rep);
}
fn main(){
    let re = regex::Regex::new(r"\B|00(?-u)0\B").unwrap();
    let text = "ο";
    let rep = "000";
    let _ = re.replace(text, rep);
}
fn main(){
    let re = regex::Regex::new(r"(()(?-u)\B)0|\B").unwrap();
    let _ = re.replace("Ԑ0000000000000" ,"000000000000000");
}
fn main(){
    let re = regex::Regex::new(r"0()|\B|(?-u)\B0").unwrap();
    let _ = re.replace("ⳅ000000000000" ,"000000000000000");
}
fn main(){
    let re = regex::Regex::new(r"\B|(?-u)\B0").unwrap();
    let _ = re.replace_all("00000^睓00" ,"00000000000");
}

I also found that if I replace "replace_all" with "replace", the last case will not panic. The rest of these will panic after being replaced. It seems weird.

What is the actual behavior?

Here is an error message sample; the error messages in these cases are similar.

thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'ο' (bytes 0..2) of `ο00000000000`', <my_project_path>/regex/src/re_unicode.rs:574:31

What is the expected behavior?

Regex::replace should deal with utf-8 character correctly.

@BurntSushi
Copy link
Member

It looks like this is a new bug introduced by regex 1.8. Prior versions return an error for such regexes:

$ cargo run
   Compiling aho-corasick v0.7.20
   Compiling regex-syntax v0.6.29
   Compiling regex v1.7.3
   Compiling i1006 v0.1.0 (/home/andrew/tmp/issues/regex/i1006)
    Finished dev [unoptimized + debuginfo] target(s) in 1.50s
     Running `target/debug/i1006`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    \B|00(?-u)0\B
               ^^
error: pattern can match invalid UTF-8
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', main.rs:2:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This is fixed by #978 more generally by allowing the regex to compile but not reporting a match anywhere. In older versions of this crate, things like (?-u:\B) were forbidden in Unicode regexes because it can match between code units in UTF-8 encoded text. But that's technically also true of .* as well (since it can match any empty string). So #978 takes a more robust perspective on this issue and generally permits regexes that could match between code units, but adjusts the matching engines to not report such matches.

BurntSushi added a commit that referenced this issue Jun 5, 2023
This regex failed to compile in `regex <1.8`, but the migration to
regex-automata tweaked the rules in a subtle way that permitted it
to compile despite the fact that the old/status-quo matching engines
can't handle it correctly. By that, I mean that they may permit the \B
to match between code units. That in turn results in panicking when
slicing a &str.

In `regex 1.9`, this regex will actually be able to be compiled, but
the matching engines will correctly and robustly never report matches
that split UTF-8 code units. For now, we just add code that causes
`regex 1.8` to have the same behavior as previous releases.

Fixes #1006
@BurntSushi
Copy link
Member

I put up #1007 to make regex 1.8 also return a regex compilation error, like in previous versions.

BurntSushi added a commit that referenced this issue Jun 5, 2023
This regex failed to compile in `regex <1.8`, but the migration to
regex-automata tweaked the rules in a subtle way that permitted it
to compile despite the fact that the old/status-quo matching engines
can't handle it correctly. By that, I mean that they may permit the \B
to match between code units. That in turn results in panicking when
slicing a &str.

In `regex 1.9`, this regex will actually be able to be compiled, but
the matching engines will correctly and robustly never report matches
that split UTF-8 code units. For now, we just add code that causes
`regex 1.8` to have the same behavior as previous releases.

Fixes #1006
@addisoncrump
Copy link
Contributor

Can we add replace to the general fuzz harness for this case? Seems to only happen in replace, which I didn't write a fuzzer for.

@BurntSushi
Copy link
Member

Probably just making sure that Match::as_str works is enough. But adding replace seems fine.

I'd prefer is we waited until #978 was merged to do more work on the fuzzers.

@BurntSushi
Copy link
Member

This is fixed in regex 1.8.4 on crates.io.

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jun 12, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | patch | `1.8.1` -> `1.8.4` |

---

### Release Notes

<details>
<summary>rust-lang/regex</summary>

### [`v1.8.4`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;184-2023-06-05)

[Compare Source](rust-lang/regex@1.8.3...1.8.4)

\==================
This is a patch release that fixes a bug where `(?-u:\B)` was allowed in
Unicode regexes, despite the fact that the current matching engines can report
match offsets between the code units of a single UTF-8 encoded codepoint. That
in turn means that match offsets that split a codepoint could be reported,
which in turn results in panicking when one uses them to slice a `&str`.

This bug occurred in the transition to `regex 1.8` because the underlying
syntactical error that prevented this regex from compiling was intentionally
removed. That's because `(?-u:\B)` will be permitted in Unicode regexes in
`regex 1.9`, but the matching engines will guarantee to never report match
offsets that split a codepoint. When the underlying syntactical error was
removed, no code was added to ensure that `(?-u:\B)` didn't compile in the
`regex 1.8` transition release. This release, `regex 1.8.4`, adds that code
such that `Regex::new(r"(?-u:\B)")` returns to the `regex <1.8` behavior of
not compiling. (A `bytes::Regex` can still of course compile it.)

Bug fixes:

-   [BUG #&#8203;1006](rust-lang/regex#1006):
    Fix a bug where `(?-u:\B)` was allowed in Unicode regexes, and in turn could
    lead to match offsets that split a codepoint in `&str`.

### [`v1.8.3`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;183-2023-05-25)

[Compare Source](rust-lang/regex@1.8.2...1.8.3)

\==================
This is a patch release that fixes a bug where the regex would report a
match at every position even when it shouldn't. This could occur in a very
small subset of regexes, usually an alternation of simple literals that
have particular properties. (See the issue linked below for a more precise
description.)

Bug fixes:

-   [BUG #&#8203;999](rust-lang/regex#999):
    Fix a bug where a match at every position is erroneously reported.

### [`v1.8.2`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;182-2023-05-22)

[Compare Source](rust-lang/regex@1.8.1...1.8.2)

\==================
This is a patch release that fixes a bug where regex compilation could panic
in debug mode for regexes with large counted repetitions. For example,
`a{2147483516}{2147483416}{5}` resulted in an integer overflow that wrapped
in release mode but panicking in debug mode. Despite the unintended wrapping
arithmetic in release mode, it didn't cause any other logical bugs since the
errant code was for new analysis that wasn't used yet.

Bug fixes:

-   [BUG #&#8203;995](rust-lang/regex#995):
    Fix a bug where regex compilation with large counted repetitions could panic.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS4xMTEuMCIsInVwZGF0ZWRJblZlciI6IjM1LjExMS4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1912
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants