`\w` in Python does not conform to Unicode #86

Aloso · 2023-03-28T11:54:08Z

Describe the bug

Compiling the Pomsky expression [word] targeting the Python flavor produces \w. But \w matches only a subset of what it should match according to the Unicode spec. One bug also says that it matches all of \p{N} instead of \p{Nd}.

To Reproduce

Run pomsky -f python '[word]+'

Run regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"

Expected behavior

Note that Python's re module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.

Therefore, [word] should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce [a-zA-Z0-9_].

This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find \w useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting the regex module, which has much better Unicode support.

Alternatives

Add a nonstandard_unicode mode, so \w can be used in flavors where \w matches some non-ASCII word characters, but not all (i.e. Python and .NET)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`\w` in Python does not conform to Unicode #86

`\w` in Python does not conform to Unicode #86

Aloso commented Mar 28, 2023 •

edited

\w in Python does not conform to Unicode #86

\w in Python does not conform to Unicode #86

Comments

Aloso commented Mar 28, 2023 • edited

Describe the bug

To Reproduce

Expected behavior

Alternatives

Related

`\w` in Python does not conform to Unicode #86

`\w` in Python does not conform to Unicode #86

Aloso commented Mar 28, 2023 •

edited