Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex-syntax error messages highlight incorrect characters/not handling graphemes correctly #1132

Open
V0ldek opened this issue Nov 30, 2023 · 4 comments

Comments

@V0ldek
Copy link

V0ldek commented Nov 30, 2023

What version of regex are you using?

1.10.2 (latest)

Describe the bug at a high level.

Parsing a regex string with combined characters/grapheme clusters and an error causes the error message to highlight the wrong character. This can make the error impossible to accurately track down based on the error information.

What are the steps to reproduce the behavior?

Consider this program creating a Regex from a string with a grapheme cluster and then an error (in this case, invalid hex character in the unicode escape).

use regex;

fn main() {
    let str = r"क्\u12G4";
    regex::Regex::new(str).unwrap();
}

The error is:

called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    क्\u12G4
          ^
error: invalid hexadecimal digit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)

(playground link)

as you can see the 4 is highlighted, while the actual error occurs one grapheme earlier, at the G. This is relatively benign, but note that grapheme clusters can cause the error arrow to move arbitrarily far in the stream, for example to an otherwise correct occurrence of a syntax element:

use regex;

fn main() {
    let str = r"षकषषक्षक्षक्षक्षक्षक्षक्षक्ष्\u12G4 some text \u1234";
    regex::Regex::new(str).unwrap();
}
called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    षकषषक्षक्षक्षक्षक्षक्षक्षक्ष्\u12G4 some text \u1234
                                     ^
error: invalid hexadecimal digit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)

(playground link)

What is the expected behavior?

An ideal solution would be to have the error arrow be aware of grapheme clusters and show up underneath the correct one for the user.

In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".

@BurntSushi
Copy link
Member

BurntSushi commented Nov 30, 2023

Yeah indeed, this is an unfortunate artifact of the current implementation. I realized this when I wrote it and made exactly this trade off. Basically, I perceive three ways to fix this issue:

  1. Get rid of nice error messages and just emit what happened without actually pointing at the syntax via carets.
  2. Use a library like unicode-width that attempts to determine the displayed width of &str and char. Although it's not clear to me that it correctly supports graphemes. Maybe it does.
  3. Roll our own version of unicode-width with out own Unicode tables.

My philosophy on whether to bring in dependencies for this crate effectively rules out (2). I try to keep the dependency tree very small.

We could go with (1), which means this sort of confusing case won't happen. But it makes error messages substantially worse (IMO) in the common case.

Which I think leaves option (3). I'm open to going that route, but it will need to satisfy the following things:

  1. It will need to be behind a feature that can be disabled. e.g., unicode-error or something. It can be enabled by default.
  2. The implementation needs to be manageable. That is, I'd expect something even simpler than unicode-width. Just a simple table in a similar format as the other Unicode tables, generated by ucd-generate.
  3. The actual code to integrate it into error message formatting is reasonable.
  4. There should not need to be an implementation of parsing grapheme clusters. I don't mean to say that I know this to be true, but rather, if getting this right means segmenting grahpemes, then I think it's going to eclipse my complexity threshold tolerance for fixing an issue like this.

I don't have any plans to work on this myself any time soon, so PRs are welcome. I would suggest posting a straw man implementation path before a PR so that we can have a meeting of the minds on implementation strategy. One can submit a PR first, but posting a more detailed proposal might help avoid wasting time.

@V0ldek
Copy link
Author

V0ldek commented Nov 30, 2023

What about my last suggestion?

In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".

It would at least allow the user to diagnose the syntax error manually. This doesn't rule out working on a more robust solution like what you describe, but would be a big help with (I think) minimal effort.

@BurntSushi
Copy link
Member

BurntSushi commented Nov 30, 2023

I would be open to a PR adding the "byte offset" phrasing. That's a decent idea because it's at least more precise.

The "character" phrasing would, I believe, require grapheme segmentation to get correct.

@pascalkuthe
Copy link
Contributor

One thing I would like to add as somebody who spend way too much time on geapheme width: There is no such thing as a universally agreed upon width for graphene clusters. Unicode essentially shrugs and tells you to ask the font (really the notion of monospace width just doesn't make sense for some more exotic scripts where graphemes are important).

There is a definition which is commonly used for terminals which the unicode-width crate implements. The problem is this is a per codepoint width definition not per-graphem.

Terminal emulators disagree on the exact width (some do grapheme segmentation and break backwards compatability others don't, different emulators support different unicode versions, ...).

unicode-width is essentially a pretty well optimized lookup table. I implemented something like this myself in the past. The only thing you can do to simplify it is strip out the legacy cjk handling. Their lookup table is actually pretty well optimized (they compress their LUT, there is a lot of redundancy in the table otherwise which blows up binary size).

I am not sure what you policy is with dependencies exactly but it's worth noting that most downstream users that do care about rendering diagnostics probably pull in unicode-width for their own rendering anyway. To avoid having two different lut in the binary using the dependency could be preferable (behind an optional off by default feature flag since I imagine most users do not care).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants