Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in Pattern#compile on certain case-insensitive patterns #168

Open
fmeum opened this issue Aug 24, 2023 · 15 comments
Open

Infinite loop in Pattern#compile on certain case-insensitive patterns #168

fmeum opened this issue Aug 24, 2023 · 15 comments

Comments

@fmeum
Copy link

fmeum commented Aug 24, 2023

Pattern.compile("(?i)ᲀ") hangs in an infinite loop with JDK 11 and 17, but not 8. Other characters from the Cyrillic Extended-C block trigger the same behavior.

Stack trace captured with jcmd:

at java.lang.Character.toLowerCase(java.base@19.0.2/Character.java:10710)
at com.google.re2j.Characters.toLowerCase(Characters.java:14)
at com.google.re2j.Unicode.simpleFold(Unicode.java:118)
at com.google.re2j.Parser.minFoldRune(Parser.java:200)
at com.google.re2j.Parser.newLiteral(Parser.java:187)
at com.google.re2j.Parser.literal(Parser.java:211)
at com.google.re2j.Parser.parseInternal(Parser.java:807)
at com.google.re2j.Parser.parse(Parser.java:790)
at com.google.re2j.RE2.compileImpl(RE2.java:185)
at com.google.re2j.Pattern.compile(Pattern.java:151)
at com.google.re2j.Pattern.compile(Pattern.java:109)

I first submitted this as a security issue since I understood the README to indicate that re2j should be considered safe to use on untrusted regexes of bounded length, but was told that this is not the case and that there are many known ways to "crash" re2j with crafted regular expressions.

I would thus welcome a clear statement on the security guarantees provided by re2j in the README.

Edit: Looks like https://groups.google.com/g/re2j-discuss/c/9XkVsnxngjc could resolve this issue.

@junyer
Copy link
Contributor

junyer commented Aug 30, 2023

RE2/J currently uses tables generated for Unicode 6.0 whereas U+1C80 was introduced in Unicode 9.0 and folds to U+0412, which folds to U+0432, which folds to U+0412, …

The Go regexp package uses the Go unicode package; the version of Unicode is inherently tied to the version of Go. RE2/J needs the version of Unicode to be tied to the version of Java as @sjamesr said; regenerating the UnicodeTables class may kick the can down the road, but I think it's worth finding a workaround for the infinite loop.

@rsc
Copy link

rsc commented Aug 30, 2023

To elaborate a bit, this is happening because the orbit tables in simpleFold must match the calls to toLower/toUpper in simpleFold. Otherwise simpleFold does not provide its documented guarantee that repeated calls will cycle through all the characters in the fold set. In particular if you start with something that is new (like U+1C80) but missing from the orbit table then toLower/toUpper/toLower/toUpper never goes back to U+1C80 so minFoldRune never terminates.

It is a bug to make RE2/J use the JDK's Unicode tables without some strategy for the orbit tables. The easiest fix is probably to check in simpleFold at the fallback that after l = toLower(r), l != r implies toUpper(l) == r. If this is not true, then the code should probably stop and compute new orbit tables by iterating over all possible runes and calling toLower. This will be slightly expensive but it need only happen once in the execution of a given program.

@sjamesr
Copy link
Contributor

sjamesr commented Aug 30, 2023

I've been thinking about how to eliminate the need for built-in Unicode tables. I was wondering if you think it would be a good idea to introduce some more instructions into the RE2J machine. For example, if we have

\p{Braille}

This turns into

rune cc{0x2800-0x28ff}

What if we introduced a series of new instructions, e.g.

script Braille
block Whitespace
category Separator

which would match the respective Unicode scripts, blocks or general categories. This might eliminate the need to generate and embed Unicode tables. JDK and RE2J would never again disagree about Unicode properties of codepoints.

On the other hand, the current architecture is nice: it is easy to generate arbitrarily selective character classes. E.g., if we have:

[abc\p{Greek}]

this currently translates into a character class containing all the Greek characters, plus a, b and c. In the new world, we'd have to rewrite this to be something like

(?:[abc]|\p{Greek})

@junyer
Copy link
Contributor

junyer commented Aug 30, 2023

[^abc\p{Greek}] and [^abc\P{Greek}] would be even trickier, I imagine. 🤔

Looking at simpleFold(), there are at least three loops in callers where it should Just Work™: this in minFoldRune() as per the stack trace above; this in appendFoldedRange(), which I discovered via a stack trace from a "stuck" unit test; and this in equalsIgnoreCase(). There's some non-loop usage as well. Any alternative would need to be able to handle all such cases.

@sjamesr
Copy link
Contributor

sjamesr commented Aug 30, 2023

Negative matches like \P{Greek} could be handled by inverse versions of the proposed instructions, e.g. not_script, not_block, not_category.

Eliminating the built-in tables seems desirable to me. Having built-in tables in any form mean that RE2/J may differ from JDK in its understanding of Unicode, potentially leading to bugs/surprises beyond just case folding.

@junyer
Copy link
Contributor

junyer commented Aug 30, 2023

For sure, but making scripts, blocks, categories et cetera first-class citizens doesn't seem like it solves the problem at hand? Specifically, how would (?i:\x{1C80}) be handled (i.e. while parsing, compiling and executing) without the UnicodeTables class? (I had been wondering how to avoid taking a dependency on ICU at runtime in order to have UCharacter.foldCase(), but I'm seeing claims that Character.toLowerCase(Character.toUpperCase(…)) is an alternative...)

@sjamesr
Copy link
Contributor

sjamesr commented Aug 30, 2023

I think that the JDK regex implementation would compare the input to the lower-, title- and upper-case versions of the pattern rune. If we did this, at least we'd be consistent with JDK.

@junyer
Copy link
Contributor

junyer commented Aug 30, 2023

I'm not seeing any use of titlecase for case-insensitive matching in the JDK implementation, but this and this corroborate the Character.toLowerCase(Character.toUpperCase(…)) claims that I saw. RE2/J could imitate this, but I haven't considered the ramifications for most of the callers of simpleFold() or, indeed, any of the code that checks for RE2.FOLD_CASE.

@rsc
Copy link

rsc commented Aug 30, 2023

According to Unicode definition of simple folding, (?i)k needs to match k (U+006B), K (U+004B), and K (U+212A, Kelvin sign). There is no way to start at lowercase k and use lower-, title- or upper-case to get to the Kelvin sign. Are you saying the JDK regex implementation does not follow the Unicode spec as far as folding?

Personally, I believe that RE2/J being consistent with RE2, Go, and Rust is more important than consistency with JDK regex.

If you need to not have precompiled tables, you could have RE2/J compile the orbit table on first use of (?i). As I noted, the way to compile it is to call unicode.toLowerCase on every rune 0..10FFFF and also unicode.toUpperCase on each of those results, discard the trivial results, and then group the non-trivial inputs by their results to form orbits. The time for this will be dominated by the 1 million toLowerCase calls and the 1 million toUpperCase calls. I don't know how long that takes in Java, so I can't say what kind of latency that would add to RE2/J's first use. In Go, those calls take a total of 45ms on my laptop.

@sjamesr
Copy link
Contributor

sjamesr commented Aug 31, 2023

There is no way to start at lowercase k and use lower-, title- or upper-case to get to the Kelvin sign.

If the pattern is (?i)k and the input is \u+212a (Kelvin sign), Java will call Character.toLowerCase(Character.toUpperCase(input)), which yields k (0x006B), so it will match correctly. If the pattern is the Kelvin sign and the input is lowercase k, it will be lowercase the pattern (again yielding lowercase k), so it matches correctly.

If java.util.regex can get away without iterating over all codepoints (which would impose a latency that could be surprising), why can't we?

@rsc
Copy link

rsc commented Aug 31, 2023

RE2/J can avoid calculating orbits if it gives up on complex character classes being precompiled down to basic ranges. It would instead have to represent character classes as more complex parse trees and execute those parse trees on each character. It's viable, but it may be slower.

@rsc
Copy link

rsc commented Aug 31, 2023

There are also other rewriting optimizations that have to be disabled if you don't have orbits. For example [Aa][Bb][Cc] optimizes to (?i)abc, but if you don't have orbits, you can't distinguish that case from [Jj][Kk][Ll], which must not optimize to (?i)jkl.

@sjamesr
Copy link
Contributor

sjamesr commented Aug 31, 2023

I've been working on a prototype of what the script and not_script instructions may look like, please see master...sjamesr:re2j:add_script_instruction

The parser is changed to deal with Unicode classes a bit differently:

  • \p{Greek} -> script{Greek} (previously, it would parse to cc{... range of Greek chars ...})
  • [abc\p{Greek}] -> alt{script{Greek}cc{0x61-0x63}} (was previously cc{abc + range of Greek chars)

The machine now understands script (matches if the current rune has the given script) and not_script (matches if the current rune does not have the current script). These instructions do not respect the case insensitive flag. I will keep working on this if you feel the change has merit.

The script tables are removed from the generated Unicode tables.

I have not disabled the optimization you're talking about (i.e. [Aa][Bb][Cc] -> (?i)abc), this is another change I would make if you think this change has merit.

The exhaustive ExecTests pass, though they probably shouldn't. :-)

More downsides to this change:

  • requires Java 1.7 to build
  • doesn't work with GWT

@rsc
Copy link

rsc commented Sep 5, 2023

I am not the RE2J maintainer so I don't think I should be passing judgement on whether the change has merit.

@herbyderby
Copy link
Collaborator

Could we just generate full orbit tables, which would allow us to avoid using the JDK upper/lower methods altogether? I would be surprised if there was a significant performance impact.

As long as we ensure that the generated tables stay at or ahead of the JDK Unicode (which should be relatively easy), I doubt that it will cause confusion in practice, because most changes are for new code points. Case folding changes to existing code points are pretty uncommon.

Historically the JDK has lagged behind the latest Unicode version quite a bit, so always using the JDK data could be problematic. At Google, at least, Unicode version consistency between RE2/J and RE2 is important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants