Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes_regex should permit generation of byte sequences that are invalid UTF-8 #336

Closed
zackw opened this issue Jun 29, 2023 · 4 comments
Closed
Labels
help-request This issue is asking for advice/help on using proptest

Comments

@zackw
Copy link
Contributor

zackw commented Jun 29, 2023

This strategy function

fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
    prop::string::bytes_regex(
        r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)"
    ).unwrap()
}

is intended to generate, among other things, invalid UTF-8 byte sequences, because it's for testing a parser that works directly from data on disk that cannot be trusted. What it actually does is crash on the unwrap() with

thread 'attrs::test::parse_xattr_ts_invalid' panicked at 'called `Result::unwrap()` on an `Err` value:
    RegexSyntax(Translate(Error {
        kind: InvalidUtf8,
        pattern: "(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\\.[0-9]*[^0-9].*)",
        span: Span(Position(o: 7, l: 1, c: 8), Position(o: 13, l: 1, c: 14))
    }))'

Looking at the code, I believe the change needed is for bytes_regex to have its own version of regex_to_hir that calls ParserBuilder::allow_invalid_utf8(true).

This change should have no visible effect on any existing code that uses bytes_regex, since one must also opt into generation of invalid UTF-8 within the regex itself (that's one of the things the (?s-u: does) and any existing regex that uses that flag must be using it in a way that actually can't generate invalid UTF-8, or else they'd hit the same crash I'm hitting.

zackw added a commit to zackw/proptest that referenced this issue Jun 29, 2023
It is desirable to be able to generate, from a regex, byte sequences
that are not necessarily valid UTF-8.  For example, suppose you have a
parser that accepts any string generated by the regex

```
[0-9]+(\.[0-9]*)?
```

Then, in your test suite, you might want to generate strings from the
complementary regular language, which you could do with the regex

```
(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

However, this will still only generate valid UTF-8 strings.  Maybe you
are parsing directly from byte sequences read from disk, in which case
you want to test the parser’s ability to reject invalid UTF-8 _as well
as_ valid UTF-8 but not within the accepted language.  Then you want
this slight variation:

```
(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

But this regex will be rejected by `bytes_regex`, because by default
`regex_syntax::Parser` errors out on any regex that potentially
matches invalid UTF-8.  The application — i.e. proptest — must opt
into use of such regexes.  This patch makes proptest do just that, for
`bytes_regex` only.

There should be no change to the behavior of any existing test suite,
because opting to allow use of `(?-u)` does not change the semantics
of any regex that _doesn’t_ contain `(?-u)`, and any existing regex
that _does_ contain `(?-u)` must be incapable of generating invalid
UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it.
(For example, `(?-u:[a-z])` cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for `bytes_regex`, which AFAICT
was not being tested at all.  Some of these use the new functionality
and others don’t.  There is quite a bit of code duplication in the
test helper functions — `do_test` and `do_test_bytes` are almost
identical, as are `generate_values_matching_regex` and
`generate_byte_values_matching_regex`.  I am not good enough at
generic metaprogramming in Rust to factor out the duplication.
@matthew-russo
Copy link
Member

Hey sorry for the delay in getting to this. Gave this a look -- bytes_regex doesn't allow this but there is a public method bytes_regex_parsed that just takes in &regex_syntax::hir::Hir, allowing you to generate that yourself with a customized regex_syntax::ParserBuilder which should address your needs. Given the workaround is straightforward and works generally for other regex parsing options, I'm inclined to leave things as is and not introduce a new, specific helper but I'm open to hearing arguments for that.

I'm going to close this but feel free to reopen if there are outstanding questions/discussions to be had

The proptest method:
https://docs.rs/proptest/latest/proptest/string/fn.bytes_regex_parsed.html

A working example with your pattern:

╰─⠠⠵ cat Cargo.toml
[package]
name = "proptest-playground"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
proptest = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-derive = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-state-machine = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
regex-syntax = { version = "0.7" }
╰─⠠⠵ cat src/main.rs
fn main() {
    println!("Hello, world!");
}

#[cfg(test)]
mod test {
    use proptest::prelude::*;
    use regex_syntax::hir::Hir;
    use regex_syntax::{ParserBuilder, Error};

    fn regex_to_hir(pattern: &str) -> Result<Hir, Error> {
        let mut parser = ParserBuilder::new().utf8(false).build();
        Ok(parser.parse(pattern)?)
    }

    fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
        let pattern = r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)";
        let hir = regex_to_hir(pattern).unwrap();
        prop::string::bytes_regex_parsed(&hir).unwrap()
    }

    proptest!(
        #![proptest_config(ProptestConfig::with_cases(3))]

        #[test]
        fn simple(bytes in invalid_ts()) {
            println!("{:?}", bytes);
        }
    );
}
╰─⠠⠵ cargo test -- --nocapture
   Compiling proptest-playground v0.1.0 (/Users/matthewrusso/projects/user/proptest-playground)
    Finished test [unoptimized + debuginfo] target(s) in 0.23s
     Running unittests src/main.rs (target/debug/deps/proptest_playground-e32ba8f312713645)

running 1 test
[57, 57, 56, 50, 53, 49, 57, 54, 48, 48, 49, 50, 54, 48, 48, 49, 55, 56, 52, 51, 56, 57, 51, 52, 55, 50, 52, 47, 188, 55, 82, 114, 39, 118, 18, 8, 205, 202, 3, 115, 54, 181, 186, 253, 21, 165, 164, 197, 71, 172, 43, 208, 128, 183]
[54, 54, 56, 57, 55, 53, 55, 48, 52, 51, 48, 52, 51, 56, 50, 51, 55, 56, 51, 53, 49, 56, 51, 56, 46, 53, 55, 35, 35, 11, 196, 222, 134, 231, 202, 233, 40, 95, 229, 91, 72, 4, 119, 112, 41, 25, 251, 237, 184, 36, 157, 51, 128, 215]
[57, 53, 57, 51, 55, 51, 49, 57, 52, 56, 48, 56, 56, 56, 49, 49, 55, 49, 50, 49, 56, 51, 55, 57, 49, 54, 47, 248, 53, 205, 23, 39, 137, 171, 14, 133, 3, 102, 253, 209, 228, 29, 98]
[52, 53, 5, 201, 75, 108, 145, 227, 247, 55, 123]
test test::simple ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

@matthew-russo matthew-russo added the help-request This issue is asking for advice/help on using proptest label Jul 15, 2023
@zackw
Copy link
Contributor Author

zackw commented Jul 18, 2023

I'm going to ask you to reconsider closing this, and instead to merge #337, for two reasons:

  1. bytes_regex already exists. Allowing people to use (?-u) in bytes_regex regexes makes it significantly more useful. I can't presently think of any situation where I would want to use bytes_regex without (?-u), although that may be biased by the kind of tests I'm currently writing (e.g. https://git.sr.ht/~zackw/file-integrity-rs/tree/c8ec4a33d3b4d75648e7568587eeb7a0f4b6c529/item/file_integrity/src/test/attrs.rs#L106).
  2. The authors of the regex_syntax crate describe it as "an implementation detail of the regex crate" and "may experience breaking changes at [an unpredictable] cadence". (See https://github.com/rust-lang/regex/tree/master/regex-syntax#motivation.) proptest is already committed to keeping up with those changes, but users of proptest should not need to grow a dependency on an (allegedly) unstable API for something this simple.

@matthew-russo matthew-russo reopened this Jul 18, 2023
@matthew-russo
Copy link
Member

Thanks for the follow up -- I'll give the PR a look some time this week.

And just for clarity, the intention wasn't to bury any requests, just if there was a satisfactory solution without adding new code, I'll generally opt for that. It's clear that the existing solution is not satisfactory

zackw added a commit to zackw/proptest that referenced this issue Sep 18, 2023
It is desirable to be able to generate, from a regex, byte sequences
that are not necessarily valid UTF-8.  For example, suppose you have a
parser that accepts any string generated by the regex

```
[0-9]+(\.[0-9]*)?
```

Then, in your test suite, you might want to generate strings from the
complementary regular language, which you could do with the regex

```
(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

However, this will still only generate valid UTF-8 strings.  Maybe you
are parsing directly from byte sequences read from disk, in which case
you want to test the parser’s ability to reject invalid UTF-8 _as well
as_ valid UTF-8 but not within the accepted language.  Then you want
this slight variation:

```
(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

But this regex will be rejected by `bytes_regex`, because by default
`regex_syntax::Parser` errors out on any regex that potentially
matches invalid UTF-8.  The application — i.e. proptest — must opt
into use of such regexes.  This patch makes proptest do just that, for
`bytes_regex` only.

There should be no change to the behavior of any existing test suite,
because opting to allow use of `(?-u)` does not change the semantics
of any regex that _doesn’t_ contain `(?-u)`, and any existing regex
that _does_ contain `(?-u)` must be incapable of generating invalid
UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it.
(For example, `(?-u:[a-z])` cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for `bytes_regex`, which AFAICT
was not being tested at all.  Some of these use the new functionality
and others don’t.  There is quite a bit of code duplication in the
test helper functions — `do_test` and `do_test_bytes` are almost
identical, as are `generate_values_matching_regex` and
`generate_byte_values_matching_regex`.  I am not good enough at
generic metaprogramming in Rust to factor out the duplication.
matthew-russo pushed a commit that referenced this issue Sep 24, 2023
* Permit use of (?-u) in byte-regex strategies (#336)

It is desirable to be able to generate, from a regex, byte sequences
that are not necessarily valid UTF-8.  For example, suppose you have a
parser that accepts any string generated by the regex

```
[0-9]+(\.[0-9]*)?
```

Then, in your test suite, you might want to generate strings from the
complementary regular language, which you could do with the regex

```
(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

However, this will still only generate valid UTF-8 strings.  Maybe you
are parsing directly from byte sequences read from disk, in which case
you want to test the parser’s ability to reject invalid UTF-8 _as well
as_ valid UTF-8 but not within the accepted language.  Then you want
this slight variation:

```
(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

But this regex will be rejected by `bytes_regex`, because by default
`regex_syntax::Parser` errors out on any regex that potentially
matches invalid UTF-8.  The application — i.e. proptest — must opt
into use of such regexes.  This patch makes proptest do just that, for
`bytes_regex` only.

There should be no change to the behavior of any existing test suite,
because opting to allow use of `(?-u)` does not change the semantics
of any regex that _doesn’t_ contain `(?-u)`, and any existing regex
that _does_ contain `(?-u)` must be incapable of generating invalid
UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it.
(For example, `(?-u:[a-z])` cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for `bytes_regex`, which AFAICT
was not being tested at all.  Some of these use the new functionality
and others don’t.  There is quite a bit of code duplication in the
test helper functions — `do_test` and `do_test_bytes` are almost
identical, as are `generate_values_matching_regex` and
`generate_byte_values_matching_regex`.  I am not good enough at
generic metaprogramming in Rust to factor out the duplication.

* [to squash] Correct for API change in regex-syntax 0.7

Commit
rust-lang/regex@706b07d
renamed ParserBuilder::allow_invalid_utf8
to ParserBuilder::utf8
and inverted the sense of its argument.

Separate commit for review purposes; should be squashed before landing
to preserve bisectability of trunk.
@matthew-russo
Copy link
Member

Closed with merge of: #337

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-request This issue is asking for advice/help on using proptest
Projects
None yet
Development

No branches or pull requests

2 participants