Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter syntax around repeat modifiers #765

Closed
marmeladema opened this issue Apr 28, 2021 · 11 comments
Closed

Stricter syntax around repeat modifiers #765

marmeladema opened this issue Apr 28, 2021 · 11 comments

Comments

@marmeladema
Copy link
Contributor

Hello and thank you for your work!

I work on different projects that use the regex crate intensively and I recently started doing some comparisons with other regex engines around different aspects, mainly syntax, performance and memory usage.

That comparison brought some attention of what could be considered "invalid" syntax that is currently accepted by this crate.
I will leave it to you as if those examples should be accepted or not:

  1. ^*\.google\.com$: repetition of start anchor: accepted by golang regex engine, refused by hyperscan, refused by python (same question applies for other kind of repeat modifiers like ?, +)
  2. a**\.google\.com$: multiple consecutive same repeat modifier (you can specify as many): refused by golang, refused by hyperscan, refused by python (same question applies to other repeat modifiers like ? and + except that ?? has a special meaning).
  3. a*+?*+?\.google\.com$: multiple consecutive different repeat modifiers: refused by golang, refused by hyperscan, refused by python. According to the documentation maybe only *?, +? and ?? should be accepted?

Version tested:

  • rust-lang/regex 1.4.6 (latest version at the time of creation of this issue)
  • hyperscan 5.4.0
  • Python 3.9.4 (using re.compile)
  • golang: https://regoio.herokuapp.com/

I think PCRE2 will accept 2 repeat modifiers like ** but not more, however I haven't verified those examples against PCRE2 yet.

If those examples are considered bugs, would it fixable? There might be backward compatibility concerns here?
If they are not considered bugs, maybe it could be explained/detailed in the doc?

In any case, I would want to refuse such syntax in the systems I operate.
According to you, what would be the best approach here? Would you recommend using the regex-syntax crate to parse and analyze the AST to detect such construct?

Thank you for your time :)

@BurntSushi
Copy link
Member

In any case, I would want to refuse such syntax in the systems I operate.

This seems like the crux of the matter, but I don't understand why. Could you elaborate?

I'll respond more fully to this issue once I have more time and I'm on my workstation.

@BurntSushi
Copy link
Member

BurntSushi commented Apr 28, 2021

TL;DR - It's not really clear to me why we should adopt a "stricter" syntax here, other than consistency with other regex engines.

I think the irony of this issue is that the regex crate probably has the most strict regex syntax of any of the libraries you bring up. For example, while this crate rejects a{5, both Python, Go and PCRE2 accept it, as they do "heuristic" parsing of brackets and fall back to a literal interpretation if a fully formed bounded repetition isn't written. Another example is \/. The regex crate rejects it because it's a meaningless escape sequence, but Python, Go and PCRE2 accept it.

(I'm not going to bother checking Hyperscan since I think it's supposed to match PCRE2 precisely, but more so because I don't have an easy way of running it right now.)

So with that said, let's go through each of your regexes and I'll say more about them specifically.

^*\.google\.com$

The minimal version of this is ^*. Python, Go and PCRE2 do indeed all reject it. However, all of them accept (?:^)*, which is semantically equivalent! All of them also accept boundary assertions repeated literally, e.g., ^^ is valid.

There is nothing inherently wrong about repeating a boundary assertion. It just doesn't make much sense. However, actually banning all such non-sensical constructions is hard. Go, Python and PCRE2 ban maybe some of the more obvious ones, but it's only surface deep.

a**\.google\.com$

Again, the minimal regex here is a**. It's the same deal as ^*. All of Go, Python and PCRE2 accept (?:a*)*. This is another one of those cases where there is nothing inherently invalid about a**. It just doesn't make much sense.

a*+?*+?\.google\.com$

Again, same as above. Although there are some interesting differences here. e.g., PCRE2 accepts a*+ but neither Python nor Go do. But of course, all of them accept (?:a*)+. (EDIT: Ah I think PCRE2 accepts a*+ because the + is a "possessive" modifier on the * IIRC that limits backtracking.)

If those examples are considered bugs, would it fixable?

They aren't bugs. I intentionally designed the syntax this way. It never made sense to me why these constructs were disallowed even though they were trivial to work around. It's possible that backtracking engines like Python and PCRE2 disallow things like a** because it might have performance consequences. I'm not sure. But in the regex crate, that shouldn't be true.

There might be backward compatibility concerns here?

Correct. Even if I agree with you that these things should be disabled, they can't be because it would break backcompat. It would have to wait for regex 2, which I have no plans of releasing any time soon.

In any case, I would want to refuse such syntax in the systems I operate.

Again, this is the key point. As I asked above, I have no idea why you would want to refuse the syntax. There's really no practical harm that can come from, and even if you disabled the surface level versions like Python, Go and PCRE2, it would be trivial to work around.

If they are not considered bugs, maybe it could be explained/detailed in the doc?

I don't think there is anything in the doc that would suggest they are disallowed. It would seem weird to me to call out this specific case. I think it would be better in a document that tried to note the differences between the regex crate and other regex engines. #497 tracks that (and I've just added a line item about nonsensical repetitions).

@BurntSushi
Copy link
Member

As a side note, I personally would be interested in the history of why some regex engines disallow things like ^* and a**.

@intgr
Copy link

intgr commented Jul 9, 2021

TL;DR - It's not really clear to me why we should adopt a "stricter" syntax here, other than consistency with other regex engines.

Yes. I agree with OP that the examples have nonsensical constructs. But I agree with BurntSushi that any such decision would be arbitrary.

But what matters a lot more is that I can take a regex used by one system, put it in Rust code and expect it to do the same thing.

Currently this project seems to implement its own Yet Another Dialect. Maybe a reasonable solution is to adopt some existing dialect fully? Or at least in arbitrary questions like these, strive towards compatibility with one particular dialect?

That leads to another arbitrary question of which dialect to follow. I think there should be a preference towards a dialect that has a specification and multiple implementations. EcmaScript ECMA-262 RegExp is one such spec, are there others?

The ECMA-262 dialect is also used by the JSON Schema IETF draft (https://json-schema.org/understanding-json-schema/reference/regular_expressions.html) and at one point was mentioned in the OpenAPI spec (OAI/OpenAPI-Specification#1725) - although I think they're converging on JSON Schema now.

@BurntSushi
Copy link
Member

BurntSushi commented Jul 9, 2021

But what matters a lot more is that I can take a regex used by one system, put it in Rust code and expect it to do the same thing.

It would be nice, but in practice, this is almost never true between any two regex engines.

Maybe a reasonable solution is to adopt some existing dialect fully?

No, it's not reasonable. And this suggestion seems like a major scope increase from the original topic of this issue. Questions like these basically amount to, "scrap the entire existing project, scrap the design goals, and do a complete rewrite." (I'm not sure if you realized this implication of your question, but either way, questions like that are frustrating to field.)

If you need a regex engine implemented in Rust that precisely matches a spec, and you hold that goal above all others, then you need to go write your own independent regex engine.

are there others?

Yes. POSIX.

@intgr
Copy link

intgr commented Jul 11, 2021

scrap the entire existing project, scrap the design goals, and do a complete rewrite.

You're overreacting. My point wasn't that regex must become a fully ECMA-262 compliant regex engine. Sorry if it came across that way. I don't think that regex should be rewritten.

Rather, my point was that nobody benefits from frivolous differences between engines. Whether to accept ^*, ** or not, I think is frivolous. There is no "correct" answer about whether this syntax should be accepted or not. If someone wrote that in a regex, it's almost certain it was by mistake.

So instead of debating what the correct answer should be, I think it makes more sense to look at what a standard has decided and just go with that.

It would be nice, but in practice, this is almost never true between any two regex engines.

I find that in practice, it's very often true. Obviously there are differences in implementation capabilities -- whether an engine supports features such as backreferences, lookaround, unbounded lookbehind, etc. But if a regex is supported, it generally behaves the same.

This is from experience working with the almost 4000 regexes from https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos on 3 different implementations (Python's re, Python's regex and now Rust's fancy-regex).

@BurntSushi
Copy link
Member

There is no "correct" answer about whether this syntax should be accepted or not.

I agree.

So instead of debating what the correct answer should be, I think it makes more sense to look at what a standard has decided and just go with that.

This only works if the only question is that a simple decision needs to be made. But as I've said in previous comments in this issue, that isn't the only concern.

I find that in practice, it's very often true.

It's not. I'm sorry that I don't have the time or motivation to prove this statement wrong, but it is. While I did have "capabilities" in mind before, those aren't even remotely close to the only differences.

But if a regex is supported, it generally behaves the same.

Well, hang on, yes, of course I agree! The word "generally" is a weasel word. I do not mean that in a bad way. I mean that, yes, if a regex is supported, most regex engines are going to provide the same answer in the vast majority of cases. That fits with "generally."

This issue is not about what "generally" works. This is very clearly about a corner case of a pathological syntactic difference. The word "generally" does not fit in this issue.

You're overreacting.

If I am, then it's only because I misunderstood you, or because you said something you didn't mean. Emphasis mine:

Maybe a reasonable solution is to adopt some existing dialect fully?

@BurntSushi
Copy link
Member

I've added the regex2 label to make it clear that the only way this change could land is in a hypothetical regex 2.0 release. I have no specific concrete plans to make a regex 2.0 release, and even if I agreed with making this change, it would not be enough to motivate such a release.

@intgr
Copy link

intgr commented Jul 11, 2021

If I am, then it's only because I misunderstood you, or because you said something you didn't mean. Emphasis mine:

Maybe a reasonable solution is to adopt some existing dialect fully?

Yes, I did suggest that. Immediately following that, I also offered an alternative suggestion:

Or at least in arbitrary questions like these, strive towards compatibility with one particular dialect?

"Fully" would be my ideal world. But I would also be happy to see a tentative acknowledgement of some standard.


I mean that, yes, if a regex is supported, most regex engines are going to provide the same answer in the vast majority of cases.

This got off track, I think we're in agreement here. Yes, it's likely that if you try hard enough, you can find differences between any two engines. My original statement ...

But what matters a lot more is that I can take a regex used by one system, put it in Rust code and expect it to do the same thing.

... was just trying to make the case that any change that reduces the differences between engines is an improvement for users. Not necessarily that implementations need to be 100% compatible.

@BurntSushi
Copy link
Member

But I would also be happy to see a tentative acknowledgement of some standard.

I can pretty confidently say that this will never happen.

@BurntSushi
Copy link
Member

BurntSushi commented Mar 7, 2022

Upon re-visiting this issue in light of #847, I'm going to close it for the following reasons:

  1. I don't think this issue has really evolved beyond my initial set of questions to the OP.
  2. The discussion following my initial set of questions was less about a specific pathology in the syntax and more about a scope increase towards adopting a standard. I am pretty sure that will never happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants