Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word splitting sometimes fails with accents #851

Open
LilithHafner opened this issue Oct 14, 2023 · 2 comments
Open

Word splitting sometimes fails with accents #851

LilithHafner opened this issue Oct 14, 2023 · 2 comments
Labels
bug Not as expected

Comments

@LilithHafner
Copy link

I got this error:

error: `noe` should be `not`, `no`, `node`, `know`, `now`, `note`
  --> ./stdlib/Unicode/src/Unicode.jl:239:2
    |
239 | "noël"
    |  ^^^
    |

Which I assume comes from miss-splitting noël into noe and something else.

@epage
Copy link
Collaborator

epage commented Oct 16, 2023

For reference, here is our word splitting algorithm which I believe I forked out of another crate (this seems to be a fairly common problem). The key part is the classify function . This has a wider problem if only considering ASCII lower case and upper case as characters. I'm assuming we'll need to add a "continuation" mode as these are neither lower case nor upper case. What all characters should be a part of this continuation class, I'm unsure. We'd likely want to have anything in XID be considered a character. As for things like accents, that I'm still not sure of.

@epage epage added the bug Not as expected label Oct 16, 2023
@epage
Copy link
Collaborator

epage commented Oct 16, 2023

Overlooked something but it turned out to not be a problem. We correctly identify that noël is one identifier, so this is only in our word splitting.

That does offer a short-term workaround: we could just say any identifier with non-ascii characters doesn't get split but instead always gets accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not as expected
Projects
None yet
Development

No branches or pull requests

2 participants