Permit use of (?-u) in byte-regex strategies (#336) #337

zackw · 2023-06-29T19:23:20Z

It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex

[0-9]+(\.[0-9]*)?

Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex

(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)

However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 as well as valid UTF-8 but not within the accepted language. Then you want this slight variation:

(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)

But this regex will be rejected by bytes_regex, because by default regex_syntax::Parser errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for bytes_regex only.

There should be no change to the behavior of any existing test suite, because opting to allow use of (?-u) does not change the semantics of any regex that doesn’t contain (?-u), and any existing regex that does contain (?-u) must be incapable of generating invalid UTF-8 for other reasons, or regex_syntax::Parser would be rejecting it. (For example, (?-u:[a-z]) cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for bytes_regex, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — do_test and do_test_bytes are almost identical, as are generate_values_matching_regex and generate_byte_values_matching_regex. I am not good enough at generic metaprogramming in Rust to factor out the duplication.

cameron1024 · 2023-06-29T20:11:42Z

Thanks for the PR 😅 I'll have a look at this more closely when I have time, but upon first inspection it looks sensible.

If I haven't reviewed by the end of this weekend feel free to ping me 😅

tzemanovic

I think this is sound. Even though as @matthew-russo pointed out there is bytes_regex_parsed already, it requires to pull in regex_syntax to be used (side note, this crate should probably be re-exported as it appears in the public api), so extending the fn bytes_regex without a breaking change is a nice solution to allow this to be done directly. Additionally, having the extra doc on how to do this might help someone in future

matthew-russo · 2023-07-28T15:57:32Z

Agh thanks for reminding me I need to look at this -- will do it this evening after work

It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex ``` [0-9]+(\.[0-9]*)? ``` Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex ``` (?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 _as well as_ valid UTF-8 but not within the accepted language. Then you want this slight variation: ``` (?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` But this regex will be rejected by `bytes_regex`, because by default `regex_syntax::Parser` errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for `bytes_regex` only. There should be no change to the behavior of any existing test suite, because opting to allow use of `(?-u)` does not change the semantics of any regex that _doesn’t_ contain `(?-u)`, and any existing regex that _does_ contain `(?-u)` must be incapable of generating invalid UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it. (For example, `(?-u:[a-z])` cannot generate invalid UTF-8.) This patch also adds a bunch of tests for `bytes_regex`, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — `do_test` and `do_test_bytes` are almost identical, as are `generate_values_matching_regex` and `generate_byte_values_matching_regex`. I am not good enough at generic metaprogramming in Rust to factor out the duplication.

zackw · 2023-09-18T12:34:55Z

Rebased on latest trunk, sorry for the delay.

Commit rust-lang/regex@706b07d renamed ParserBuilder::allow_invalid_utf8 to ParserBuilder::utf8 and inverted the sense of its argument. Separate commit for review purposes; should be squashed before landing to preserve bisectability of trunk.

zackw · 2023-09-18T13:01:04Z

Added another commit to fix up the code for an API change. I cannot run the test suite locally right now due to what looks like a bug in message-io...

error[E0716]: temporary value dropped while borrowed
   --> .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/message-io-0.17.0/src/adapters/udp.rs:241:22
    |
241 |                 &mut [io::IoSliceMut::new(&mut input_buffer)],
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ creates a temporary value which is freed while still in use
...
244 |             );
    |              - temporary value is freed at the end of this statement
245 |
246 |             match result {
    |                   ------ borrow later used here
    |
    = note: consider using a `let` binding to create a longer lived value

... but this is sufficient to make proptest build again as a dependency of a larger project that uses the feature this PR adds, and that project's testsuite passes.

The additional commit should be squashed before merging.

zackw · 2023-09-18T13:05:13Z

Fatfingered the close button, sorry for the noise.

zackw mentioned this pull request Jul 18, 2023

bytes_regex should permit generation of byte sequences that are invalid UTF-8 #336

Closed

tzemanovic approved these changes Jul 28, 2023

View reviewed changes

matthew-russo approved these changes Jul 30, 2023

View reviewed changes

zackw force-pushed the master branch from 51ef282 to 5f5b02b Compare September 18, 2023 12:34

zackw closed this Sep 18, 2023

zackw reopened this Sep 18, 2023

matthew-russo merged commit 370b3a0 into proptest-rs:master Sep 24, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permit use of (?-u) in byte-regex strategies (#336) #337

Permit use of (?-u) in byte-regex strategies (#336) #337

zackw commented Jun 29, 2023

cameron1024 commented Jun 29, 2023

tzemanovic left a comment

matthew-russo commented Jul 28, 2023

zackw commented Sep 18, 2023

zackw commented Sep 18, 2023 •

edited

zackw commented Sep 18, 2023

Permit use of (?-u) in byte-regex strategies (#336) #337

Permit use of (?-u) in byte-regex strategies (#336) #337

Conversation

zackw commented Jun 29, 2023

cameron1024 commented Jun 29, 2023

tzemanovic left a comment

Choose a reason for hiding this comment

matthew-russo commented Jul 28, 2023

zackw commented Sep 18, 2023

zackw commented Sep 18, 2023 • edited

zackw commented Sep 18, 2023

zackw commented Sep 18, 2023 •

edited