Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add anchored search APIs #675

Closed
Alphare opened this issue May 7, 2020 · 4 comments
Closed

add anchored search APIs #675

Alphare opened this issue May 7, 2020 · 4 comments

Comments

@Alphare
Copy link

Alphare commented May 7, 2020

In Mercurial, we have to use ^(?:<patterns>) where <patterns> is the culmination of all patterns (transformed to regex) in the user's .hgignore to remove the additional .* on each end. The user could input a regex like )(, which while invalid on its own would work with the workaround and would create a useless capturing group.

Adding an option to RegexBuilder and its bytes cousin seems like a good solution to me.

@BurntSushi
Copy link
Member

OK, so I have some varied thoughts on this.

I think at a high level, adding an option as described here is probably the wrong path to take. In particular, your problem doesn't really have anything to do with ^ specifically, but rather, it's a problem of composition. ripgrep actually does similar things with regex composition in order to implement its -w/--word-regexp flag, but rg ')(' -w correctly exits with a syntax error. This is because it first attempts to parse the regex as given before trying to compose them. If you're doing regex composition, then I think this is the only correct way to do it. More specifically, you could depend on regex-syntax to run just the parser to check for syntax validity. It's a little extra work, but should overall be pretty cheap compared to the entire regex compilation process.

With that said, an "anchored search" is indeed kind of a special case. I think my plan at the moment is not to surface this as a compile-time option, but rather a search-time option. That is, perhaps in addition to Regex::find there will also be Regex::find_anchored (or whatever name). But that needs at least some API design work and won't happen until at least #656 is done.

@BurntSushi BurntSushi changed the title Add option in RegexBuilder to not add .* around pattern add anchored search APIs May 8, 2020
@Alphare
Copy link
Author

Alphare commented May 11, 2020

You're right that checking the pattern first with regex-syntax should be pretty inconsequential in terms of runtime compared to building the DFA and even more so compared to the rest of the program. I have to thank whoever decided to split regex in modular crates to make this so easy. ;)

We will be using this "workaround" (if that's really the term) until Regex::find_anchored becomes part of the API, thanks.

@BurntSushi
Copy link
Member

BurntSushi commented May 11, 2020

No problem. And yeah, it's kind of a work-around for this specific case, but for general composition, I think it's right.

It is plausible that some kind of API for this should/could be exposed in regex proper. Maybe not full syntax parsing, but a, for example, parse_regex(&str) -> Result<(), Error> that just checked whether the regex was valid or not without compiling I think would be sufficient for composition. Then folks wouldn't need to depend on regex-syntax explicitly. (Which, while convenient, is still primarily supposed to be an implementation detail of regex.)

to building the DFA

Just to make sure your mental model is right here, the regex crate currently never builds a full DFA ahead of time. It builds an NFA first, and depending on which matching engine is selected, will either execute the search directly with the NFA or will build the DFA lazily one state at a time during a search. This is the same execution model as RE2.

(In the future, I expect there will be some cases where building the DFA ahead of time is done, but only when doing so would be very cheap and use very little space.)

@BurntSushi
Copy link
Member

BurntSushi commented Mar 6, 2023

I think once #656 lands, it will be possible to achieve this using regex-automata's "meta" regex engine. It will support this sort of flexibility with a richer set of search options.

I'm not sure if it will ever make it into regex proper through unfortunately, since it would seem to me to require duplicating a lot of the methods.

So for now, I think I'm going to request that folks who need this try out the meta regex engine once regex-automata 0.3 is out. If you run into troubles there, then please file an issue.

BurntSushi added a commit that referenced this issue Jul 5, 2023
I usually close tickets on a commit-by-commit basis, but this refactor
was so big that it wasn't feasible to do that. So ticket closures are
marked here.

Closes #244
Closes #259
Closes #476
Closes #644
Closes #675
Closes #824
Closes #961

Closes #68
Closes #510
Closes #787
Closes #891

Closes #429
Closes #517
Closes #579
Closes #779
Closes #850
Closes #921
Closes #976
Closes #1002

Closes #656
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jul 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex (regex)</summary>

### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07)

[Compare Source](rust-lang/regex@1.9.0...1.9.1)

\==================
This is a patch release which fixes a memory usage regression. In the regex
1.9 release, one of the internal engines used a more aggressive allocation
strategy than what was done previously. This patch release reverts to the
prior on-demand strategy.

Bug fixes:

-   [BUG #&#8203;1027](rust-lang/regex#1027):
    Change the allocation strategy for the backtracker to be less aggressive.

### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05)

[Compare Source](rust-lang/regex@1.8.4...1.9.0)

\==================
This release marks the end of a [years long rewrite of the regex crate
internals](rust-lang/regex#656). Since this is
such a big release, please report any issues or regressions you find. We would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

-   A new `Captures::extract` method for quickly accessing the substrings
    that match each capture group in a regex.
-   A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
    Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
    `(?m:$)` match after and before both `\r` and `\n`, respectively, but never
    between a `\r` and `\n`.
-   `RegexBuilder::line_terminator` was added to further customize the line
    terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
-   The `std` Cargo feature is now actually optional. That is, the `regex` crate
    can be used without the standard library.
-   Because `regex 1.9` may make binary size and compile times even worse, a
    new experimental crate called `regex-lite` has been published. It prioritizes
    binary size and compile times over functionality (like Unicode) and
    performance. It shares no code with the `regex` crate.

New features:

-   [FEATURE #&#8203;244](rust-lang/regex#244):
    One can opt into CRLF mode via the `R` flag.
    e.g., `(?mR:$)` matches just before `\r\n`.
-   [FEATURE #&#8203;259](rust-lang/regex#259):
    Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
-   [FEATURE #&#8203;476](rust-lang/regex#476):
    `std` is now an optional feature. `regex` may be used with only `alloc`.
-   [FEATURE #&#8203;644](rust-lang/regex#644):
    `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
-   [FEATURE #&#8203;675](rust-lang/regex#675):
    Anchored search APIs are now available in `regex-automata 0.3`.
-   [FEATURE #&#8203;824](rust-lang/regex#824):
    Add new `Captures::extract` method for easier capture group access.
-   [FEATURE #&#8203;961](rust-lang/regex#961):
    Add `regex-lite` crate with smaller binary sizes and faster compile times.
-   [FEATURE #&#8203;1022](rust-lang/regex#1022):
    Add `TryFrom` implementations for the `Regex` type.

Performance improvements:

-   [PERF #&#8203;68](rust-lang/regex#68):
    Added a one-pass DFA engine for faster capture group matching.
-   [PERF #&#8203;510](rust-lang/regex#510):
    Inner literals are now used to accelerate searches, e.g., `\w+@&#8203;\w+` will scan
    for `@`.
-   [PERF #&#8203;787](rust-lang/regex#787),
    [PERF #&#8203;891](rust-lang/regex#891):
    Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

-   [BUG #&#8203;429](rust-lang/regex#429):
    Fix matching bugs related to `\B` and inconsistencies across internal engines.
-   [BUG #&#8203;517](rust-lang/regex#517):
    Fix matching bug with capture groups.
-   [BUG #&#8203;579](rust-lang/regex#579):
    Fix matching bug with word boundaries.
-   [BUG #&#8203;779](rust-lang/regex#779):
    Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
-   [BUG #&#8203;850](rust-lang/regex#850):
    Fix matching bug inconsistency between NFA and DFA engines.
-   [BUG #&#8203;921](rust-lang/regex#921):
    Fix matching bug where literal extraction got confused by `$`.
-   [BUG #&#8203;976](rust-lang/regex#976):
    Add documentation to replacement routines about dealing with fallibility.
-   [BUG #&#8203;1002](rust-lang/regex#1002):
    Use corpus rejection in fuzz testing.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957
Reviewed-by: crapStone <crapstone01@gmail.com>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants