You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compiling the Pomsky expression [word] targeting the Python flavor produces \w. But \w matches only a subset of what it should match according to the Unicode spec. One bug also says that it matches all of \p{N} instead of \p{Nd}.
To Reproduce
Run pomsky -f python '[word]+'
Run regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"
Expected behavior
Note that Python's re module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.
Therefore, [word] should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce [a-zA-Z0-9_].
This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find \w useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting the regex module, which has much better Unicode support.
Alternatives
Add a nonstandard_unicode mode, so \w can be used in flavors where \w matches some non-ASCII word characters, but not all (i.e. Python and .NET)
Describe the bug
Compiling the Pomsky expression
[word]
targeting the Python flavor produces\w
. But\w
matches only a subset of what it should match according to the Unicode spec. One bug also says that it matches all of\p{N}
instead of\p{Nd}
.To Reproduce
Run
pomsky -f python '[word]+'
Run
regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"
Expected behavior
Note that Python's
re
module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.Therefore,
[word]
should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce[a-zA-Z0-9_]
.This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find
\w
useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting theregex
module, which has much better Unicode support.Alternatives
Add a
nonstandard_unicode
mode, so\w
can be used in flavors where\w
matches some non-ASCII word characters, but not all (i.e. Python and .NET)Related
python/cpython#44795
The text was updated successfully, but these errors were encountered: