Word splitting sometimes fails with accents #851

LilithHafner · 2023-10-14T17:02:37Z

I got this error:

error: `noe` should be `not`, `no`, `node`, `know`, `now`, `note`
  --> ./stdlib/Unicode/src/Unicode.jl:239:2
    |
239 | "noël"
    |  ^^^
    |

Which I assume comes from miss-splitting noël into noe and something else.

The text was updated successfully, but these errors were encountered:

epage · 2023-10-16T16:42:02Z

For reference, here is our word splitting algorithm which I believe I forked out of another crate (this seems to be a fairly common problem). The key part is the classify function . This has a wider problem if only considering ASCII lower case and upper case as characters. I'm assuming we'll need to add a "continuation" mode as these are neither lower case nor upper case. What all characters should be a part of this continuation class, I'm unsure. We'd likely want to have anything in XID be considered a character. As for things like accents, that I'm still not sure of.

epage · 2023-10-16T16:44:27Z

Overlooked something but it turned out to not be a problem. We correctly identify that noël is one identifier, so this is only in our word splitting.

That does offer a short-term workaround: we could just say any identifier with non-ascii characters doesn't get split but instead always gets accepted.

LilithHafner mentioned this issue Oct 14, 2023

Word splitting sometimes fails with accents codespell-project/codespell#3145

Closed

epage added the bug Not as expected label Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word splitting sometimes fails with accents #851

Word splitting sometimes fails with accents #851

LilithHafner commented Oct 14, 2023

epage commented Oct 16, 2023

epage commented Oct 16, 2023

Word splitting sometimes fails with accents #851

Word splitting sometimes fails with accents #851

Comments

LilithHafner commented Oct 14, 2023

epage commented Oct 16, 2023

epage commented Oct 16, 2023