Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a helper routine for building a static regex using std::lazy #709

Open
hbina opened this issue Sep 17, 2020 · 19 comments
Open

provide a helper routine for building a static regex using std::lazy #709

hbina opened this issue Sep 17, 2020 · 19 comments

Comments

@hbina
Copy link

hbina commented Sep 17, 2020

Introduce a macro that lazily initialise a static regex expression, call it lazy_regex.
This is helpful whenever you want to avoid the compile-at-each-iteration pattern when using regexes across threads.
The code is simply,

macro_rules! lazy_regex_1 {
    ($re:literal $(,)?) => {{
        static RE: SyncOnceCell<regex::Regex> = SyncOnceCell::new();
        RE.get_or_init(|| Regex::new($re).unwrap())
    }};
}

or

macro_rules! lazy_regex_2 {
    ($re:literal $(,)?) => {{
        static RE: SyncLazy<regex::Regex> = SyncLazy::new(|| Regex::new($re).unwrap());
        &RE
    }};
}

Which is only a slight alteration from the one provided in the once_cell documentation. There are many possible variations of this --- there are many ways to lazily initialise a static variable. I think this is something that is common enough that it should be included.

Thank you.

@BurntSushi
Copy link
Member

I don't understand your feature request and I don't understand your diff. Please include more details. Maybe show how the feature you're requesting is intended to be used?

@tesuji
Copy link

tesuji commented Sep 17, 2020

SyncOnceCell is nightly std feature. I don't think regex crate need to introduce a macro that depends on nightly or lazy_static or so.

@BurntSushi
Copy link
Member

OK, so I didn't even know that SyncOnceCell had made its way into std already. @hbina Please include relevant context in feature requests.

In any case, we've gone years without this and I see no reason to get impatient now. Once std::lazy stabilizes, then I'll be happy to revisit this. But I have no plans to add another nightly-only API. In the mean time, I'd encourage folks to brainstorm on what the API should look like. One thing I'd like to know is whether we need a macro or not. I'd rather not.

And to head off the question that I keep getting from people: no const fn can't help us here, or at least, won't be able to help us until the vast majority of Rust is able to be constified and I don't see that happening any time soon. Therefore, I plan to behave as if const fn has no impact.

@BurntSushi BurntSushi changed the title Add a macro to generate a SyncOnceCell regex provide a helper routine for building a static regex using std::lazy Sep 17, 2020
@matklad
Copy link
Member

matklad commented Sep 20, 2020

JFYI, this exists as a crate: https://crates.io/crates/regex-macro

@BurntSushi
Copy link
Member

I'd rather wait to revisit this until std::lazy has been a thing, and unfortunately, it probably needs to be there for at least a year before I'm okay bumping the MSRV to depend on it.

With that said, it's not clear to me that this is something we want to do. It's pretty painless already to compose the Lazy API with Regex. For example:

static RE_GROUP: Lazy<Regex> =
    Lazy::new(|| Regex::new(r"^[-A-Za-z0-9]+$").unwrap());

That's already pretty good IMO. So I think if someone wants to propose a new regex-specific API here, whatever they come up with is going to need to have a meaningful advantage over the above.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023
@matklad
Copy link
Member

matklad commented Mar 6, 2023

That's already pretty good IMO

I'd maybe push back against this tiny bit. To give a concrete example from git-branchless,

Both this:

        lazy_static! {
            static ref RE: Regex = Regex::new(r"^([^ ]+) (.+)$").unwrap();
        };
        match RE.captures(line) {

And this:

        static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^([^ ]+) (.+)$").unwrap());
        match RE.captures(line) {

do look to me meaningfully worse than a hypothetical:

        match regex!(r"^([^ ]+) (.+)$").captures(line) {

Fore these kind of code, the use of Lazy and the use of static both seem like irrelevant details (which of course is context dependent: in more "production" code you might very well want to be aware when exactly each regex is compiled, but in more "stripty" code you'd rather ignore this, and one often wants to use regex in a scripty code). Similarly, introducing an essentially useless name like RE adds to noise.

At the same time, yes, absolutely, at the moment this issue isn't really actionable on the regex side, because the underlying lazy evaluation primitive isn't yet available in std. And also the implementation is really tiny and is available on crates.io and even in once-cell docs (https://docs.rs/once_cell/latest/once_cell/#lazily-compiled-regex), so there's no practical reason for regex to provide something now.

@BurntSushi
Copy link
Member

Yeah I'd love to re-open this and revisit it. I do personally like the regex!(...) code you've shown. It's nice.

The other angle to put on this here is that regex! actually used to be a thing, and it was for "compile time regex." There is a minute possibility of that coming back some day (years away, if ever). There is also the problem that some folks might interpret it as a compile time regex. (Not necessarily because it used to exist, but because it kind of looks like it. But maybe that's my own bias.)

@matklad
Copy link
Member

matklad commented Jun 3, 2023

std::sync::OnceLock is a thing as of Rust 1.70.0, so in about a year (1.79.0) we could in theory provide this.

@BurntSushi BurntSushi reopened this Jun 4, 2023
@BurntSushi
Copy link
Member

BurntSushi commented Jun 4, 2023

EDIT: I updated this comment on 2024-01-23. I would love to hear folks thoughts on this.

OK, so now that OnceLock is stable, we can revisit this. To summarize the above discussion:

  • A regex! macro can make some uses of regexes much nicer, even when compared to using once_cell::sync::Lazy (which has a std counterpart called LazyLock, but isn't stable yet).
  • The regex crate tries to be somewhat conservative with respect to MSRV. I've somewhat arbitrarily settled on "try to support the last year of Rust releases." Since OnceLock was just stabilized, this means it will probably be a year before we can add it. We could use conditional compilation to sniff the Rust version and only add it for newer versions of Rust, but I've done this sort of thing in the past and I'm kind of "over it." Plus, in this case, it's a public API and I imagine it could be confusing for folks on an older version reading the docs and not being able to use it for non-obvious reasons.
  • Some folks might get the impression that regex!("...") is a compile time regex when in reality it's "just" a convenience. In the past, there was actually a procedural macro named regex! that was a compile time regex.

I believe the macro would be defined as this:

macro_rules! regex {
    ($re:literal $(,)?) => {{
        use {std::sync::OnceLock, regex::Regex};

        static RE: OnceLock<Regex> = OnceLock::new();
        RE.get_or_init(|| Regex::new($re).unwrap())
    }};
}

Its return type is a &'static Regex. It would appear in the crate root. So you'd use it like regex::regex!. I'd also add an analogous version of it to the bytes submodule that creates a regex::bytes::Regex instead.

I have some other concerns too that I've thought of:

  • The macro hides an unwrap call. I don't think this is a huge deal personally since we can define the macro in a way that it only accepts a string literal. We could make the macro return a Result<Regex, Error> and force users to unwrap it, but that seems like a bit of a bummer. Overall I'm personally fine hiding the unwrap as long we clearly document it, but I do think there will be a contingent of users that will be unhappy with hiding the unwrap.
  • The fact that you don't have to unwrap the result of regex!("...") is likely to reinforce the idea that it is a compile time regex. For example, "of course I wouldn't have to unwrap it, since it's a compile time regex my program would fail to compile if the regex were invalid."
  • The macro as proposed unfortunately reinforces a pattern that leads to a pretty unfortunate performance bug in some cases. The underlying performance bug could just be fixed and that would mitigate this concern. (This has been mostly mitigated to the point that I no longer feel too concerned about this.)
  • Using this macro seems to preclude setting any options on a RegexBuilder. I think for the most part that is fine since most options can be set inside the regex syntax itself (.e.g, (?i) instead of RegexBuilder::case_insensitive), but some can't. For example, size limits.
  • The fully qualified name, regex::regex!, seems unfortunately redundant. @matklad suggested some alternatives below, but I think they're too verbose. We could just name it compile! so that its fully qualified use is regex::compile!. But its... unqualified use as just compile! is pretty awful.
  • Last but not least, do we even need a macro for this? I suppose the macro helps us guarantee that the pattern is actually a literal and thus makes the unwrap more palatable, but this can also be written as a function that accepts a &'static str which is pretty close to requiring a literal.

Here's what this would look like as a function (or perhaps a Regex::lazy method or something):

fn regex(pattern: &'static str) -> &'static regex::Regex {
    use std::sync::OnceLock;
    static RE: OnceLock<regex::Regex> = OnceLock::new();
    RE.get_or_init(|| regex::Regex::new(pattern).unwrap())
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ab56dc43c81e9b16bfbf7fef5ab4805c

(A function doesn't work. This has to be a macro. See @matklad's rebuttal below.)

Note that if you're reading this and wondering "why not just make compile time regex instead," please see #1076 and #1012. This macro is more about codifying a very common pattern that is used to avoid the footgun of compiling a regex in a loop. (A compile time regex would also solve that problem, but as the issues linked explain, there is a lot more to it than the simple macro proposed here that a bunch of folks are already using in one form or another.)

@matklad
Copy link
Member

matklad commented Jun 4, 2023

I was also thinking that we can assuage some semantics concerns here by some creative naming.

Something like

regex::compile_lazy!("[a-z]+")
regex::compile_once!("[a-z]+")

might read nice and self-explanatory.

@BurntSushi
Copy link
Member

Yeah that might be a more precise name, but is annoyingly verbose IMO. :)

@joshtriplett
Copy link
Member

@BurntSushi wrote:

  • We could use conditional compilation to sniff the Rust version and only add it for newer versions of Rust, but I've done this sort of thing in the past and I'm kind of "over it."

What about adding a feature flag to enable the macro, documenting that if you set that feature flag the MSRV increases, and then making the macro unconditionally available once the crate's overall MSRV increases?

@matklad
Copy link
Member

matklad commented Jan 23, 2024

Here's what this would look like as a function

That sadly works for exactly one regex :-)

let foo = regex("foo");
let foo_again = regex("bar");

I could imagine something like

fn regex<const pattern: &str>() -> &'static regex::Regex {
    // Somehow ask the compiler to monomorphise a version of RE for each
    // instantiation of regex
    static RE: OnceLock<regex::Regex> = OnceLock::new(); 
    ...
}

maybe working in Rust one day, but that seems unlikely.

@BurntSushi
Copy link
Member

That sadly works for exactly one regex :-)

Derp. Right.

What about adding a feature flag to enable the macro, documenting that if you set that feature flag the MSRV increases, and then making the macro unconditionally available once the crate's overall MSRV increases?

I suppose, but I'm not sure it's worth it. And that feature will have to stick around forever (or until regex 2, whichever comes first) as a no-op, which is mildly annoying.

@AriaFallah
Copy link

I think this is an awesome idea! With regards to naming, I think regex! is great, and if unwrap is truly a concern there can be a try_regex! since it's a relatively common pattern. I also like @matklad's naming too

Yeah that might be a more precise name, but is annoyingly verbose IMO. :)

Regarding verbosity, I don't think it should be that big of a factor. I mean anything will be better than the current pattern, which is a minimum of a few lines to import OnceCell/OnceLock/lazy_static, set it up, and use it. If more clarity can exist in the api through a descriptive name, then I think prioritizing that over aesthetics isn't so bad.

@LukasKalbertodt
Copy link
Member

"[...] my program would fail to compile if the regex were invalid"

This would be the main feature I would want from a regex! macro. The convenience alone is nice, but not big enough of an advantage to use the name regex! for it, in my opinion. Rust is about compile time checking and has powerful macros, so that compile-time regex validation should be the end goal, I think.

This could be achieved by defining a procedural macro that runs regex_syntax on the input at compile time, but throws away the result, only performing error checking. The regex would still be compiled lazily at runtime, but that is then guaranteed to succeed. To me that would be pretty ideal.

Of course, people always associate long compile times with procedural macros. The described macro could be very light though: the only tricky part is to get the value from the string literal (unescaping), which is implemented in syn or litrs (my crate, but I swear, I didn't comment to promote, I'm just really interested in a compile-time checked regex and saw the call to comment on Twitter 🙂). Of course, regex_syntax would be a dependency of the macro as well. But if not cross compiling, that does not add a compilation step as it's already a dependency of regex.
And finally, we could always hide the proc macro behind a non-default feature flag.

Summary: I think your concerns about "compile time checking is expected" are valid and I think that should be the end-goal of a regex! macro. And I wonder how strong you would oppose this theoretical procedural macro.

@BurntSushi
Copy link
Member

BurntSushi commented Jan 23, 2024

And I wonder how strong you would oppose this theoretical procedural macro.

Pretty strongly opposed. For the following reasons:

  1. Clippy already covers the case of "my regex literal is invalid."
  2. The procedural macro you mention can't actually guarantee that construction succeeds. It can only guarantee that it won't fail from a parse error. It could still fail if the size limit is exceeded.
  3. A procedural macro will add at least one more dependency to the tree. That probably means it won't be enabled by default. Not providing this by default seems like a bummer to me.
  4. My opinion is that "is this literal string a valid regex" is not that useful to guarantee at compile time. It's nice, for sure, but I actually don't think it buys you a lot. The reason is that if the regex is invalid, then you should very quickly discover that fact at runtime.

@LukasKalbertodt
Copy link
Member

4. My opinion is that "is this literal string a valid regex" is not that useful to guarantee at compile time. It's nice, for sure, but I actually don't think it buys you a lot. The reason is that if the regex is invalid, then you should very quickly discover that fact at runtime.

I would disagree. Take this argument a few steps further and it sounds like "why type checking if you have unit tests". Your tests would need to hit the branch using that regex for this to work. When reviewing a PR where regexes are changed, a compile-time-checked regex would give me confidence that all still compile, without having to check whether all these regexes are covered by tests.

But: your other points make total sense to me and I certainly don't want to derail this discussion. I didn't know about the clippy lint and yes, that would give me all the "PR review confidence" I described above. I still think regex!() would somewhat suggest to the user that the syntax is checked at compile time, but sth sth "perfect is the enemy of good". Making the common "static regex" pattern more convenient is useful. The name suggestions of @matklad would potentially already resolve all my concerns.

@BurntSushi
Copy link
Member

BurntSushi commented Jan 23, 2024

Take this argument a few steps further and it sounds like "why type checking if you have unit tests".

But that's not my argument. I didn't take it a few steps further. I looked at this specific thing in context. My argument doesn't generalize. Consider also, for example, "is this regex syntax valid," is usually far less interesting of a question to ask than "does this regex have the semantics I intend." If you're testing out the latter, then you've already automatically cleared the hurdle of the former.

The name suggestions of @matklad would potentially already resolve all my concerns.

I'm still not a fan of a verbose name for this macro. :-\ I like regex::compile! and regex::lazy!, but they only work well if they are always used in their fully qualified form.

Also, every time I define this macro for myself, I call it regex!. That's what I've seen others do as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants