Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the regex crate #978

Merged
merged 20 commits into from
Jul 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
204 changes: 130 additions & 74 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,31 +28,25 @@ permissions:
contents: read

jobs:
# This job does our basic build+test for supported platforms.
test:
name: test
env:
# For some builds, we use cross to test on 32-bit and big-endian
# systems.
CARGO: cargo
# When CARGO is set to CROSS, TARGET is set to `--target matrix.target`.
# Note that we only use cross on Linux, so setting a target on a
# different OS will just use normal cargo.
TARGET:
# Bump this as appropriate. We pin to a version to make sure CI
# continues to work as cross releases in the past have broken things
# in subtle ways.
CROSS_VERSION: v0.2.5
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
build:
- pinned
- stable
- stable-32
- stable-mips
- beta
- nightly
- macos
- win-msvc
- win-gnu
include:
- build: pinned
os: ubuntu-latest
rust: 1.60.0
- build: stable
os: ubuntu-latest
rust: stable
Expand Down Expand Up @@ -80,98 +74,160 @@ jobs:
os: windows-latest
rust: stable-x86_64-gnu
steps:

- name: Checkout repository
uses: actions/checkout@v3

- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
toolchain: ${{ matrix.rust }}

- name: Install and configure Cross
if: matrix.target != ''
if: matrix.os == 'ubuntu-latest' && matrix.target != ''
run: |
# In the past, new releases of 'cross' have broken CI. So for now, we
# pin it. We also use their pre-compiled binary releases because cross
# has over 100 dependencies and takes a bit to compile.
dir="$RUNNER_TEMP/cross-download"
mkdir "$dir"
echo "$dir" >> $GITHUB_PATH
cd "$dir"
curl -LO "https://github.com/cross-rs/cross/releases/download/$CROSS_VERSION/cross-x86_64-unknown-linux-musl.tar.gz"
tar xf cross-x86_64-unknown-linux-musl.tar.gz

# We used to install 'cross' from master, but it kept failing. So now
# we build from a known-good version until 'cross' becomes more stable
# or we find an alternative. Notably, between v0.2.1 and current
# master (2022-06-14), the number of Cross's dependencies has doubled.
cargo install --bins --git https://github.com/rust-embedded/cross --tag v0.2.1
# cargo install --bins --git https://github.com/rust-embedded/cross --tag v0.2.1
echo "CARGO=cross" >> $GITHUB_ENV
echo "TARGET=--target ${{ matrix.target }}" >> $GITHUB_ENV

- name: Show command used for Cargo
run: |
echo "cargo command is: ${{ env.CARGO }}"
echo "target flag is: ${{ env.TARGET }}"

echo "cargo command is: $CARGO"
echo "target flag is: $TARGET"
- name: Show CPU info for debugging
if: matrix.os == 'ubuntu-latest'
run: lscpu

- name: Basic build
run: ${{ env.CARGO }} build --verbose $TARGET

- name: Build docs
run: ${{ env.CARGO }} doc --verbose $TARGET

# Our dev dependencies evolve more rapidly than we'd like, so only run
# tests when we aren't pinning the Rust version.
#
# Also, our "full" test suite does quite a lot of work, so we only run it
# on one build. Otherwise, we just run the "default" set of tests.
- name: Run subset of tests
if: matrix.build != 'pinned' && matrix.build != 'stable'
run: ${{ env.CARGO }} test --verbose --test default $TARGET

- name: Run full test suite
if: matrix.build == 'stable'
# 'stable' is Linux only, so we have bash.
run: ./test

- name: Run randomized tests against regexes from the wild
if: matrix.build == 'stable'
run: |
# We run the tests in release mode since it winds up being faster.
RUST_REGEX_RANDOM_TEST=1 ${{ env.CARGO }} test --release --verbose --test crates-regex $TARGET

run: ${{ env.CARGO }} test --verbose --test integration $TARGET
- name: Build regex-syntax docs
if: matrix.build != 'pinned'
run: |
${{ env.CARGO }} doc --verbose --manifest-path regex-syntax/Cargo.toml $TARGET

run: ${{ env.CARGO }} doc --verbose --manifest-path regex-syntax/Cargo.toml $TARGET
- name: Run subset of regex-syntax tests
if: matrix.build != 'pinned' && matrix.build != 'stable'
run: |
${{ env.CARGO }} test --verbose --manifest-path regex-syntax/Cargo.toml $TARGET
run: ${{ env.CARGO }} test --verbose --manifest-path regex-syntax/Cargo.toml $TARGET
- name: Build regex-automata docs
run: ${{ env.CARGO }} doc --verbose --manifest-path regex-automata/Cargo.toml $TARGET
- name: Run subset of regex-automata tests
if: matrix.build != 'win-gnu' # Just horrifically slow.
run: ${{ env.CARGO }} test --verbose --manifest-path regex-automata/Cargo.toml $TARGET
- name: Run regex-lite tests
run: ${{ env.CARGO }} test --verbose --manifest-path regex-lite/Cargo.toml $TARGET
- name: Run regex-cli tests
run: ${{ env.CARGO }} test --verbose --manifest-path regex-cli/Cargo.toml $TARGET

# This job runs a stripped down version of CI to test the MSRV. The specific
# reason for doing this is that the regex crate's dev-dependencies tend to
# evolve more quickly. There isn't as tight of a control on them because,
# well, they're only used in tests and their MSRV doesn't matter as much.
#
# It is a bit unfortunate that our MSRV test is basically just "build it"
# and pass if that works. But usually MSRV is broken by compilation problems
# and not runtime behavior. So this is in practice good enough.
msrv:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
toolchain: 1.60.0
- name: Basic build
run: cargo build --verbose
- name: Build docs
run: cargo doc --verbose

# This job runs many more tests for the regex crate proper. Basically,
# it repeats the same test suite for a bunch of different crate feature
# combinations. There are so many features that exhaustive testing isn't
# really possible, but we cover as much as is feasible.
#
# If there is a feature combo that should be tested but isn't, you'll want to
# add it to the appropriate 'test' script in this repo.
testfull-regex:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
toolchain: stable
- name: Run full test suite
run: ./test

- name: Run full regex-syntax test suite
if: matrix.build == 'stable'
run: |
# 'stable' is Linux only, so we have bash.
cd regex-syntax
./test
# Same as above, but for regex-automata, which has even more crate features!
testfull-regex-automata:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
toolchain: stable
- name: Run full test suite
run: ./regex-automata/test

- name: Run regex-capi tests
if: matrix.build == 'stable'
run: |
# 'stable' is Linux only, so we have bash.
cd regex-capi
./test
# Same as above, but for regex-syntax.
testfull-regex-syntax:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
toolchain: stable
- name: Run full test suite
run: ./regex-syntax/test

- if: matrix.build == 'nightly'
name: Run benchmarks as tests
run: |
cd bench
./run rust --no-run --verbose
# Same as above, but for regex-capi.
testfull-regex-capi:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
toolchain: stable
- name: Run full test suite
run: ./regex-capi/test

- if: matrix.build == 'nightly'
name: Run tests with pattern feature
run: |
cargo test --test default --no-default-features --features 'std pattern unicode-perl'
# Runs miri on regex-automata's test suite. This doesn't quite cover
# everything. Many tests are disabled when building with miri because of
# how slow miri runs. But it still gives us decent coverage.
miri-regex-automata:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Install Rust
uses: dtolnay/rust-toolchain@v1
with:
# We use nightly here so that we can use miri I guess?
# It caught me by surprise that miri seems to only be
# available on nightly.
toolchain: nightly
components: miri
- name: Run full test suite
run: cargo miri test --manifest-path regex-automata/Cargo.toml

# Tests that everything is formatted correctly.
rustfmt:
name: rustfmt
runs-on: ubuntu-latest
steps:
- name: Checkout repository
Expand Down
6 changes: 6 additions & 0 deletions .vim/coc-settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"rust-analyzer.linkedProjects": [
"fuzz/Cargo.toml",
"Cargo.toml"
]
}
77 changes: 77 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,80 @@
1.9.0 (2023-07-05)
==================
This release marks the end of a [years long rewrite of the regex crate
internals](https://github.com/rust-lang/regex/issues/656). Since this is
such a big release, please report any issues or regressions you find. I would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

* A new `Captures::extract` method exists for quickly accessing the substrings
that match each capture group in a regex.
* A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
`(?m:$)` match after and before both `\r` and `\n`, respectively, but never
between a `\r` and `\n`.
* `RegexBuilder::line_terminator` was added to further customize the line
terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
* The `std` Cargo feature is now actually optional. That is, the `regex` crate
can be used without the standard library.
* Because `regex 1.9` may make binary size and compile times even worse, a
new experimental crate called `regex-lite` has been published. It prioritizes
binary size and compile times over functionality (like Unicode) and
performance. It shares no code with the `regex` crate.

New features:

* [FEATURE #244](https://github.com/rust-lang/regex/issues/244):
One can opt into CRLF mode via the `R` flag.
e.g., `(?mR:$)` matches just before `\r\n`.
* [FEATURE #259](https://github.com/rust-lang/regex/issues/259):
Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
* [FEATURE #476](https://github.com/rust-lang/regex/issues/476):
`std` is now an optional feature. `regex` may be used with only `alloc`.
* [FEATURE #644](https://github.com/rust-lang/regex/issues/644):
`RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
* [FEATURE #675](https://github.com/rust-lang/regex/issues/675):
Anchored search APIs are now available in `regex-automata 0.3`.
* [FEATURE #824](https://github.com/rust-lang/regex/issues/824):
Add new `Capptures::extract` method for easier capture group access.
* [FEATURE #961](https://github.com/rust-lang/regex/issues/961):
Add `regex-lite` crate with smaller binary sizes and faster compile times.

Performance improvements:

* [PERF #68](https://github.com/rust-lang/regex/issues/68):
Added a one-pass DFA engine for faster capture group matching.
* [PERF #510](https://github.com/rust-lang/regex/issues/510):
Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan
for `@`.
* [PERF #787](https://github.com/rust-lang/regex/issues/787),
[PERF #891](https://github.com/rust-lang/regex/issues/891):
Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

* [BUG #429](https://github.com/rust-lang/regex/issues/429):
Fix matching bugs related to `\B` and inconsistencies across internal engines.
* [BUG #517](https://github.com/rust-lang/regex/issues/517):
Fix matching bug with capture groups.
* [BUG #579](https://github.com/rust-lang/regex/issues/579):
Fix matching bug with word boundaries.
* [BUG #779](https://github.com/rust-lang/regex/issues/779):
Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
* [BUG #850](https://github.com/rust-lang/regex/issues/850):
Fix matching bug inconsistency between NFA and DFA engines.
* [BUG #921](https://github.com/rust-lang/regex/issues/921):
Fix matching bug where literal extraction got confused by `$`.
* [BUG #976](https://github.com/rust-lang/regex/issues/976):
Add documentation to replacement routines about dealing with fallibility.
* [BUG #1002](https://github.com/rust-lang/regex/issues/1002):
Use corpus rejection in fuzz testing.


1.8.4 (2023-06-05)
==================
This is a patch release that fixes a bug where `(?-u:\B)` was allowed in
Expand Down