Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid back reference error #112

Open
mdecimus opened this issue Aug 8, 2023 · 5 comments
Open

Invalid back reference error #112

mdecimus opened this issue Aug 8, 2023 · 5 comments

Comments

@mdecimus
Copy link

mdecimus commented Aug 8, 2023

Hi,

I am writing a tool that needs to be able parse and evaluate regular expressions originally written for Perl. Overall the library works great but I am getting an Invalid back reference error when trying to parse the following regex:

[\042\223\224\262\263\271]{2}\S{0,16}[\042\223\224\262\263\271]{2}

This regex is parsed properly by Perl and also online tools such as regex101.com. To make it work on fancy-regex I need to replace the octal references with their corresponding Unicode sequences:

[\"\u{93}\u{94}\u{B2}\u{B3}\u{B9}]{2}\S{0,16}[\"\u{93}\u{94}\u{B2}\u{B3}\u{B9}]{2}

Not a big deal, but I am opening this issue in case you consider not being able to parse those octal codes a bug in the library.

Thanks.

@robinst
Copy link
Contributor

robinst commented Aug 9, 2023

We try to be compatible with Oniguruma syntax, which supports octal, so this is something we might add. How does Perl disambiguate between \1 being a backreference vs octal?

@mdecimus
Copy link
Author

mdecimus commented Aug 9, 2023

We try to be compatible with Oniguruma syntax, which supports octal, so this is something we might add. How does Perl disambiguate between \1 being a backreference vs octal?

I've never used Perl so I can't say for sure but this is how I think it works:

  • A back reference refers to a previously matched capturing group.
  • If the character after the backslash is a number and doesn't conform to the octal format (especially if it doesn't start with 0), then Perl interprets it as a back reference. For example, \1 would refer to the first capturing group.

I've just found another valid Perl regular expression that can't be parsed by fancy-regex:

/[({[<][. ]*(?-i:\xbc\xba[. ]*\xc0\xce[. ]*)?(?-i:\xb1\xa4(?:[. ]*|[\x00-\x7f]{0,3})\xb0\xed|\xc1\xa4[. ]*\xba\xb8|\xc8\xab[. ]*\xba\xb8)[. ]*[)}\]>]/

I had to escape every special character so it could be parsed:

/[\(\{\[\<][. ]*(?-i:\xbc\xba[. ]*\xc0\xce[. ]*)?(?-i:\xb1\xa4(?:[. ]*|[\x00-\x7f]{0,3})\xb0\xed|\xc1\xa4[. ]*\xba\xb8|\xc8\xab[. ]*\xba\xb8)[. ]*[\)\}\]\>]/

@BurntSushi
Copy link

BurntSushi commented Aug 9, 2023

I'm not involved with fancy-regex (but I'm the author of regex on which this crate is built), and I can say that you're going to have a bad time if you assume any two regex engines are going to accept the same syntax. And even if they accept the same syntax, they may still behave differently.

Maybe a case here and there can be smoothed out, but in general, if you need to be able to "parse and match regexes written for Perl," then I think you have three choices:

  1. Use Perl.
  2. Spend an enormous amount of effort translating regexes from the Perl flavor to the fancy-regex flavor. (This may actually be a Sysphean task.)
  3. Drop the requirement or "be okay" with some regexes not working.

(This same discussion has repeated itself several times in different forms on the regex crate repo.)

@robinst
Copy link
Contributor

robinst commented Aug 9, 2023

Looks like onig doesn't require a leading 0, need to check what it does when it's ambiguous: https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md#28-onig_syn_op_esc_octal3-enable-ooo-octal-codes

@mdecimus
Copy link
Author

mdecimus commented Aug 9, 2023

I'm not involved with fancy-regex (but I'm the author of regex on which this crate is built), and I can say that you're going to have a bad time if you assume any two regex engines are going to accept the same syntax. And even if they accept the same syntax, they may still behave differently.

I am currently porting SpamAssassin to Rust, which relies on hundreds of Perl regular expressions (many of them very inefficient) so my plan is to replace with native code those regexes that don't work on fancy-regex or are inefficient (such as /<!--(?:\s{1,10}[-\w'"]{1,40}){100}/im).

I have already fixed all the expressions that couldn't be parsed, I just opened this issue in case the author(s) wanted to support a syntax that Perl and other engines consider valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants