Invalid back reference error #112

mdecimus · 2023-08-08T15:03:35Z

Hi,

I am writing a tool that needs to be able parse and evaluate regular expressions originally written for Perl. Overall the library works great but I am getting an Invalid back reference error when trying to parse the following regex:

[\042\223\224\262\263\271]{2}\S{0,16}[\042\223\224\262\263\271]{2}

This regex is parsed properly by Perl and also online tools such as regex101.com. To make it work on fancy-regex I need to replace the octal references with their corresponding Unicode sequences:

[\"\u{93}\u{94}\u{B2}\u{B3}\u{B9}]{2}\S{0,16}[\"\u{93}\u{94}\u{B2}\u{B3}\u{B9}]{2}

Not a big deal, but I am opening this issue in case you consider not being able to parse those octal codes a bug in the library.

Thanks.

The text was updated successfully, but these errors were encountered:

robinst · 2023-08-09T00:52:31Z

We try to be compatible with Oniguruma syntax, which supports octal, so this is something we might add. How does Perl disambiguate between \1 being a backreference vs octal?

mdecimus · 2023-08-09T09:30:27Z

We try to be compatible with Oniguruma syntax, which supports octal, so this is something we might add. How does Perl disambiguate between \1 being a backreference vs octal?

I've never used Perl so I can't say for sure but this is how I think it works:

A back reference refers to a previously matched capturing group.
If the character after the backslash is a number and doesn't conform to the octal format (especially if it doesn't start with 0), then Perl interprets it as a back reference. For example, \1 would refer to the first capturing group.

I've just found another valid Perl regular expression that can't be parsed by fancy-regex:

/[({[<][. ]*(?-i:\xbc\xba[. ]*\xc0\xce[. ]*)?(?-i:\xb1\xa4(?:[. ]*|[\x00-\x7f]{0,3})\xb0\xed|\xc1\xa4[. ]*\xba\xb8|\xc8\xab[. ]*\xba\xb8)[. ]*[)}\]>]/

I had to escape every special character so it could be parsed:

/[\(\{\[\<][. ]*(?-i:\xbc\xba[. ]*\xc0\xce[. ]*)?(?-i:\xb1\xa4(?:[. ]*|[\x00-\x7f]{0,3})\xb0\xed|\xc1\xa4[. ]*\xba\xb8|\xc8\xab[. ]*\xba\xb8)[. ]*[\)\}\]\>]/

BurntSushi · 2023-08-09T09:38:47Z

I'm not involved with fancy-regex (but I'm the author of regex on which this crate is built), and I can say that you're going to have a bad time if you assume any two regex engines are going to accept the same syntax. And even if they accept the same syntax, they may still behave differently.

Maybe a case here and there can be smoothed out, but in general, if you need to be able to "parse and match regexes written for Perl," then I think you have three choices:

Use Perl.
Spend an enormous amount of effort translating regexes from the Perl flavor to the fancy-regex flavor. (This may actually be a Sysphean task.)
Drop the requirement or "be okay" with some regexes not working.

(This same discussion has repeated itself several times in different forms on the regex crate repo.)

robinst · 2023-08-09T09:39:18Z

Looks like onig doesn't require a leading 0, need to check what it does when it's ambiguous: https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md#28-onig_syn_op_esc_octal3-enable-ooo-octal-codes

mdecimus · 2023-08-09T09:52:43Z

I'm not involved with fancy-regex (but I'm the author of regex on which this crate is built), and I can say that you're going to have a bad time if you assume any two regex engines are going to accept the same syntax. And even if they accept the same syntax, they may still behave differently.

I am currently porting SpamAssassin to Rust, which relies on hundreds of Perl regular expressions (many of them very inefficient) so my plan is to replace with native code those regexes that don't work on fancy-regex or are inefficient (such as /<!--(?:\s{1,10}[-\w'"]{1,40}){100}/im).

I have already fixed all the expressions that couldn't be parsed, I just opened this issue in case the author(s) wanted to support a syntax that Perl and other engines consider valid.

HerringtonDarkholme mentioned this issue Dec 12, 2023

[feature] consider using fancy-regex to support look-around and backtracking ast-grep/ast-grep#763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid back reference error #112

Invalid back reference error #112

mdecimus commented Aug 8, 2023

robinst commented Aug 9, 2023

mdecimus commented Aug 9, 2023

BurntSushi commented Aug 9, 2023 •

edited

robinst commented Aug 9, 2023

mdecimus commented Aug 9, 2023

Invalid back reference error #112

Invalid back reference error #112

Comments

mdecimus commented Aug 8, 2023

robinst commented Aug 9, 2023

mdecimus commented Aug 9, 2023

BurntSushi commented Aug 9, 2023 • edited

robinst commented Aug 9, 2023

mdecimus commented Aug 9, 2023

BurntSushi commented Aug 9, 2023 •

edited