Skip to content

Dealing with unicode character counts #1187

Answered by BurntSushi
spyoungtech asked this question in Q&A
Discussion options

You must be logged in to vote

Grapheme clusters are a total red herring here. They aren't relevant. Python doesn't use grapheme cluster offsets. It uses codepoint offsets. I also don't really understand the \r\n diversion. They are two distinct ASCII characters. Python strings will treat them as two distinct characters, just like Rust strings. The offsets will even be the same. Python might normalize line endings when doing I/O, but that shouldn't be a concern for interfacing the regex crate with Python strings.

To respond more holistically, it's important to get terminology correct here. The issue is not that this crate "counts Unicode characters differently" from Python. That would be very bad. The issue is that the o…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@spyoungtech
Comment options

Answer selected by BurntSushi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants