Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: conditionals #263

Closed
alubbe opened this issue Jul 10, 2016 · 16 comments
Closed

Feature request: conditionals #263

alubbe opened this issue Jul 10, 2016 · 16 comments

Comments

@alubbe
Copy link

alubbe commented Jul 10, 2016

I have some uses cases where it is very useful to have conditionals based on whether previous capture groups were successful, i.e., (?(1)some_regex|another_regex)
Take for example this subject

({a}) = {2}
{(a)} = (2)

I want to match everything between () and {}, whichever closure is relevant. Because the regex describing the content in the closure is very long and complex, I would like to re-use it and have specify different ending characters depending on the starting character, like so:

(?:(\()|({)).*?(?(1)\)|})

Currently, I would have to break it up like this:

\(.*?\)|{.*?}

Which, in this short example, is better - because .*? is very short. But in real life, this is very long and very complex, and I have 5 different kind of starting and ending combinations that I would like to separate. So right now, I'd have to increase the length of the full regex five-fold to get the wanted behaviour. Having conditionals would be really helpful!

@BurntSushi
Copy link
Member

I'll just say this up front: I agree that conditionals can be useful, and that in some circumstances, they may make writing/reading/maintaining regexes easier. (This is also true of many of the bells and whistles you find in other regex libraries, like full blown backreferences and look around.)

However, one design constraint of this library is that matching must have time complexity no greater than linear with respect to the input. In other words, when searching text, every byte of text is examined no more than a constant number of times (where the size of the regex is considered constant).

With that said, this feels an awful lot like backreferences, albeit with a seemingly more limited power. It definitely isn't clear to me how said conditionals could be encoded in a state machine. In particular, a notable property of capturing groups in this library is that they can never impact whether a regex as a whole matches or not. (Certainly, grouping itself can, but the property of whether a group is capturing or not does not.) This feature would change that, such that even if it were possible to do this while maintaining linear search time, it would require substantial changes.

With that said, I do have some suggestions that might help solve your specific problem in a different way:

  1. A common difference between regex engines built on top of finite state machines (like this one) and regex engines that use some variant of backtracking is that the latter engines support look around operators which can often shrink or otherwise reduce redundancy in a regex crafted for the former engine type. Your example seems like it fits in that category. One possible way to ameliorate this would be to treat your .*? as a placeholder and use string interpolation to build up your regex piecewise, which should help with the complexity/redundancy that is plaguing you at the moment (if I'm understanding correctly).
  2. Consider using multiple regexes. This may be slower, but may help simplify things a bit.
  3. If all else fails, there are bindings to other regex engines in Rust (Oniguruma and PCRE) that have lots of bells and whistles but give up guaranteed linear time matching.

@alubbe
Copy link
Author

alubbe commented Jul 10, 2016

Thanks for the detailed response. I was thinking that it might be difficult given the design constraint.

How would you implement 1. for the given regex above? Run [({] and then decide which regex to run next?
2. seems wasteful indeed, and is what I am trying to avoid.

If at any point you see the opportunity to implement some version of this, it would help me a great deal! :)

@BurntSushi
Copy link
Member

BurntSushi commented Jul 11, 2016

For (1), I might not have been clear. I wasn't suggesting using multiple regexes. Instead, I was suggesting to use string interpolation. In particular, you mentioned that this regex \(.*?\)|{.*?} was fine, but only because .*? was short. So what about this instead:

let pat = ".*?";
let re = Regex::new(&format!(r"\({pat}\)|{{{pat}}}", pat)).unwrap();

For (2), whether it's too wasteful or not depends on the problem you're trying to solve and the frequency of matches. For example, if the first regex has a very low false positive rate and the second regex doesn't need to scan the entire input, then the performance could be quite competitive.

If at any point you see the opportunity to implement some version of this, it would help me a great deal!

Sadly, I don't think it's going to happen for reasons stated previously.

@alubbe
Copy link
Author

alubbe commented Jul 11, 2016

Let's say that I have n starting and ending patterns (the example above had
2).

So what you meant by 1) is to chain the n different possibilities using |
into one big regex, right?

How is that different in meaning and performance from 2), where instead of
one big regex you would pass in n smaller regexs to be evaluated at the
same time?

Also, what would be the consequence with regards to memory consumption?
With my limited knowledge it looks as if both of these approaches would
need n times as much memory over a regex using some kind of a conditional,
because the largest part, the middle part of the regex, could be reused
instead of being duplicated n times.
On 11 Jul 2016 11:53 a.m., "Andrew Gallant" notifications@github.com
wrote:

For (1), I might not have been clear. I wasn't suggesting using multiple
regexes. Instead, I was suggesting to use string interpolation. In
particular, you mentioned that this regex (.?)|{.?} was fine, but
only because .*? was short. So what about this instead:

let pat = ".*?";let re = Regex::new(format!(r"({pat})|{{{pat}}}", pat)).unwrap();

For (2), whether it's too wasteful or not depends on the problem you're
trying to solve and the frequency of matches. For example, if the first
regex has a very low false positive rate and the second regex doesn't need
to scan the entire input, then the performance could be quite competitive.

If at any point you see the opportunity to implement some version of this,
it would help me a great deal!

Sadly, I don't think it's going to happen for reasons stated previously.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#263 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AB7yIaInFnT7pkJYhWKgkz9aid2Mj0HEks5qUiCxgaJpZM4JI0Ms
.

@BurntSushi
Copy link
Member

For (1), I was merely suggesting that you use string interpolation to reduce redundancy in writing and maintaining the regular expression. The resulting regex would still be large.

The performance trade offs between one large regex and multiple smaller regexes are not at all clear. One large regex could be just as fast as one small regex. It is not wise to make performance assumptions based on the size of the regex. The only thing you really know for sure is that the bigger the regex the more memory it will use, but you can't really make any conclusions about its match performance. Remember, this library is based on finite state machines, so the way alternations work is very different than how they work in a traditional backtracking engine.

Also, what would be the consequence with regards to memory consumption?
With my limited knowledge it looks as if both of these approaches would
need n times as much memory over a regex using some kind of a conditional,
because the largest part, the middle part of the regex, could be reused
instead of being duplicated n times.

I think the trade offs of conditionals aren't being appropriately account for here. As I said before, it's not clear that a linear time implementation exists. If they turn out to be equivalent to backreferences in expressive power, then it would turn into an NP-complete problem to solve them, which likely takes worst case exponential time.

Whether the additional memory consumption is a problem or not isn't clear. If the regex uses 10KB of heap instead of 1KB, is that an issue?

@alubbe
Copy link
Author

alubbe commented Jul 12, 2016

Fair enough, long regex it is then. There are several capture groups in the middle part and they would now be duplicated as well. I could either use an index offset to identify them or, for readability, use named groups. Are they supported by this crate or would you recommend a different approach?

@BurntSushi
Copy link
Member

@alubbe
Copy link
Author

alubbe commented Jul 12, 2016

Yes, but I couldn't find any documentation on what happens if the same name is used multiple times, as it would be here. Only one capture group will ever be used, i.e., it will not have multiple return values, but I don't think the compiler can deduce that.

@BurntSushi
Copy link
Member

Ah, I see. Yeah, duplicate capture names aren't allowed. A potential alternative behavior is to permit multiple capturing groups with the same name, and fill in the "last" matched value.

@alubbe
Copy link
Author

alubbe commented Jul 12, 2016

Yes, that would be perfectly fine for my use case. That behaviour could live behind a flag, so that you have to explicitly enable multiple named groups.

@BurntSushi
Copy link
Member

Could you say more about the specific use case that benefits from having multiple capture groups with the same name? (It seems like a separate issue than this one.)

@alubbe
Copy link
Author

alubbe commented Jul 12, 2016

We're still trying to do the same thing - imagine .*? was long, complicated and involved 5-10 capture groups. If we multiply it n times, surround each with short regexs and then concatenate them using |, we now have 5n - 10n capture groups. To cope with this, I see two approaches:

  1. Loop over the captures to find the first real capture, and use this as an offset for the other captures
  2. Give the capture groups names, so that we can reference them by that rather than an integer. This would help with the readability and maintainability of the construct

@BurntSushi
Copy link
Member

@alubbe It honestly sounds to me like it would be worth it to build up a good abstraction around the problem you're trying to solve, and in particular, abstract away the details of actually writing the regex in the first place. It seems like you don't actually need duplicate capture names for that.

If you are indeed concatenating a bunch of regexes with |, then you may also find that a RegexSet is useful, which lets you build up a set of regexes, and it will report which ones matched. You'd then have to run the matched regex a second time to extract any capturing groups.

@alubbe
Copy link
Author

alubbe commented Jul 12, 2016

Fair enough, I will try working with an offset on the index to avoid named
captures then.

Would it be possible to configure the RegexSet (by setting a flag or
passing an enum) to return on the first match instead of trying all? In
this case, are there any differences in run time or memory to concatenation
using |?
On 12 Jul 2016 1:36 p.m., "Andrew Gallant" notifications@github.com wrote:

@alubbe https://github.com/alubbe It honestly sounds to me like it
would be worth it to build up a good abstraction around the problem you're
trying to solve, and in particular, abstract away the details of actually
writing the regex in the first place. It seems like you don't actually need
duplicate capture names for that.

If you are indeed concatenating a bunch of regexes with |, then you may
also find that a RegexSet
https://doc.rust-lang.org/regex/regex/struct.RegexSet.html is useful,
which lets you build up a set of regexes, and it will report which ones
matched. You'd then have to run the matched regex a second time to extract
any capturing groups.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#263 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AB7yIYL86D4Lnm7Hg_LjpmziRVAQRYdNks5qU3w7gaJpZM4JI0Ms
.

@BurntSushi
Copy link
Member

@alubbe The point of RegexSet is that it tries all of them simultaneously in a single scan of the text.

There are technically performance differences between "is match," "which matched first" and "which matched." RegexSet currently supports the first and last. #259 is about supporting the second.

are there any differences in run time or memory to concatenation using |?

The match semantics are different. In a RegexSet, multiple regexes can match. When using |, only one of the alternations can match. The performance should be roughly similar.

@alubbe
Copy link
Author

alubbe commented Jul 12, 2016

Awesome, thanks for clarifying!
On 12 Jul 2016 6:04 p.m., "Andrew Gallant" notifications@github.com wrote:

@alubbe https://github.com/alubbe The point of RegexSet is that it
tries all of them simultaneously in a single scan of the text.

There are technically performance differences between "is match," "which
matched first" and "which matched." RegexSet currently supports the first
and last. #259 #259 is
about supporting the second.

are there any differences in run time or memory to concatenation using |?

The match semantics are different. In a RegexSet, multiple regexes can
match. When using |, only one of the alternations can match. The
performance should be roughly similar.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#263 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AB7yIVy8yhvPt7St84zN30wtl-P6gTY2ks5qU7rCgaJpZM4JI0Ms
.

@alubbe alubbe closed this as completed Jul 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants