Skip to content

Unable to construct match-all regex while opted out of unicode #1168

Answered by BurntSushi
Fogapod asked this question in Q&A
Discussion options

You must be logged in to vote

(?-u:.) can match invalid UTF-8. It can absolutely match mid-codepoint because it matches any arbitrary byte. You either need to use something that won't match invalid UTF-8 or disable the requirement that regexes only match UTF-8.

"can match invalid UTF-8" is interpreted in a maximalist sense. It may be the case that the regex cannot match invalid UTF-8 for your particular haystack. But regex construction doesn't know that. In that case, it may indeed be appropriate to disable the requirement that the regex must always match valid UTF-8.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@Fogapod
Comment options

@BurntSushi
Comment options

Answer selected by Fogapod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants