Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please define "portable" #55

Open
RunDevelopment opened this issue Dec 23, 2022 · 9 comments
Open

Please define "portable" #55

RunDevelopment opened this issue Dec 23, 2022 · 9 comments
Labels
bug Something isn't working C-compat Compatibility between regex flavors

Comments

@RunDevelopment
Copy link

RunDevelopment commented Dec 23, 2022

Hello!

I am someone interested in regexes, so pomsky caught my eye. This project has a lot of good ideas (I especially like ranges), but there is one thing that really bugs me: How portable is pomsky?

Pomsky advertises itself as "a portable, modern regular expression language", but I don't see it as very portable.

Examples:

  1. Predefined character classes are often defined very differently among languages. E.g. \s is defined as [ \t\n\x0B\f\r] or [\p{IsWhite_Space}] = [\x09-\x0d \x85\xa0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000] in Java and as [\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] in JavaScript. However, pomsky doesn't address those differences. [s] compiles to \s for both JavaScript and Java.

  2. % compiles to \b in JavaScript, which is a huge issue because it is not Unicode-aware. Documenting this doesn't make it more portable :(

  3. Forward references behave differently in different languages. Here is how the same regex behaves in Python and JavaScript:

    >>> r = re.compile(r"^(?:(a)|\1\1){2}$")
    >>> r.fullmatch("a") is not None
    False
    >>> r.fullmatch("aa") is not None
    True
    >>> r.fullmatch("aaa") is not None
    True
    > r = /^(?:(a)|\1\1){2}$/
    > r.test("a")
    true
    > r.test("aa")
    true
    > r.test("aaa")
    false

    This is because Python and JavaScript have different semantics for how capturing groups capture text. In JavaScript, entering a group (capturing or not) resets the captured text of all capturing groups in it. So when (?:(a)|\1\1) is entered in the second iteration, the captured text of (a) is reset, which is why it does not accept "aaa". Python, on the other hand, only resets the captured text of a capturing group after the capturing group has captured a different part of the string.

    Python and JavaScript also differ in the semantics of backreferences in one edge case. If a backreference references a group that has no captured text (either because it has never captured text or because its captured text was reset), then the backreference will always reject (being equivalent to (?!)) in Python, and it will always accept (being equivalent to (?:)) in JavaScript.

Since, pomsky doesn't seem to guarantee that regexes behave the same in different languages, what does portability mean for pomsky?


I am concerned with portability, because semantic differences can lead to security vulnerabilities.

E.g. ( [s]{2} | [ U+feff ] )+ $ will be vulnerable to exponential backtracking in some target language but not others. So if a developer uses a static analyzer or fuzzer to verify absence of exponential backtracking in one target language, they might assume that other languages are safe too.

Of course, semantic differences can cause problems in other ways too. E.g. if pomsky-generated regexes are used for input validation, then the JavaScript frontend might correctly filter out bad inputs but the Java backend does not.


Assuming that portability means "the generated regexes behave the same in all target languages", the above examples could be addressed as follows:

  1. Languages where predefined character classes don't match the semantics of pomsky need to be "polyfilled". E.g. \s can be emulated with the right character class across languages. If people actually want the \s of their respective target language, then they can use regex \s.
  2. % should just behave the same in all target languages. It's unfortunate that JS regexes will be quite long, but long regexes are better than wrong regexes.
  3. This is a really difficult problem. Making capturing groups and backreference behave consistently across language would likely require adding restrictions to how they can be used. This can be done via static analysis (example), but the additional restriction might annoy some users.
@Aloso
Copy link
Member

Aloso commented Dec 28, 2022

Regarding the space: The only differences seems to be that

  • Java doesn't include U+FEFF (Byte order mark, also used as zero-width space, deprecated in Unicode 3.2 for the latter purpose)
  • JavaScript doesn't include U+0085 (Next line, exists for compatibility with EBCDIC, not used anymore as far as I know)

This can be polyfilled with [\s\x85] and [\s\uFEFF], though this would make it impossible to negate it within a character class ([w !s '-'] -> [\w\S\-]). We can either forbid this, similar to how [!w] is forbidden in JS, or lower !s to the appropriate ranges.

@Aloso
Copy link
Member

Aloso commented Dec 28, 2022

% should just behave the same in all target languages. It's unfortunate that JS regexes will be quite long, but long regexes are better than wrong regexes.

A problem with this is that lookbehind is not supported in all JS engines (notably, Safari/WebKit doesn't support it), which I want to write a compiler warning for. I don't know any way to emulate \b/\B correctly without using lookbehind.

@Aloso Aloso added bug Something isn't working C-interop Interoperability between Pomsky and programming languages C-compat Compatibility between regex flavors and removed C-interop Interoperability between Pomsky and programming languages labels Dec 28, 2022
@RunDevelopment
Copy link
Author

I don't know any way to emulate \b/\B correctly without using lookbehind.

AFAIK, there is no general way to emulate it without lookbehinds. Lookaround optimizations can eliminate many lookarounds within patterns, but some simple cases (e.g. % "foo") are impossible to do without lookbehinds.

@RunDevelopment
Copy link
Author

That being said, the examples I gave weren't meant to be bug reports. I just wanted to illustrate that pomsky doesn't seem to be portable, with the definition of "portable" being "compiled regexes behave the same in all target languages".

So could you please clarify what portable means for pomsky? What are pomsky's goals when it comes to supporting compile targets?

@Aloso
Copy link
Member

Aloso commented Dec 29, 2022

"Portable" means that a Pomsky expression that works for one regex flavor should also work in other flavors, unless it uses an unsupported feature (like backreferences in Rust). One example is that named capturing groups use different syntax in JS than in PCRE, but in Pomsky the same syntax can be used to target any flavor. I acknowledge that Pomsky isn't portable in all situations; I initially hoped that I could get rid of all the differences with polyfills, but it has become clear that this is not always possible, so Pomsky's portability is a best-effort kinda thing.

@RunDevelopment
Copy link
Author

That pomsky accounts for the syntax of different flavors in a given IMO. After all, I don't know any regex flavors with >> lookaheads :)

I am more interested in what portability means for semantics. Right now, pomsky specifies the semantics of its elements (e.g. which sets of characters [s] and [w] match) on its website. However, it seems like flavors are not required to adhere to those semantics and are allowed to approximate them instead. This is a huge problem because it means that users have no guarantee that the regexes pomsky produces are actually correct (= behave as specified by pomsky). A best-effort approach is simply not suitable when it comes to correctness.

That being said, I do see where you're coming from, but I don't think that the current approach to portability is good. The fact, that users don't know which approximations flavors might or might not use, makes pomsky unreliable. I think letting users choose whether they want approximations would be best. This would neatly separate pomsky's semantics and flavor-specific limitations and approximations.

Pomsky could define a list of all approximations that flavors might perform, and then allow users to enable/disable approximations per flavor. Of course, shorthands for sets of options would be nice (e.g. strict to disable all optimizations, and loose to enable safe-ish optimizations). This would also allow pomsky to provide optimizations that might not be appliciable in all cases (e.g. JS regexes to emulate atomic groups using a lookahead+capturing group, but the added capturing group will show up in the list matched capturing groups, which might break user code).

(JS and lookbehinds are a compelling case for this fine-grained approach. While regexes used on the web shouldn't use lookbehinds because of Safari, regexes that used in e.g. electron apps can use lookbehinds. So why should all JS regexes use an approximation that is only necessary for one browser?)

@Aloso
Copy link
Member

Aloso commented Dec 31, 2022

First of all, I consider all 3 examples you pointed out to be bugs. It would be best to create separate issues for them and discuss only the bigger picture here, without looking at each portability issue in detail.

There are multiple possible solutions to portability issues:

  1. Use a polyfill if possible (for example \w is polyfilled in JS)

  2. Make it illegal in the affected regex flavor (for example, % could be forbidden in JS)

  3. Emit a warning pointing out the portability concern

  4. Accept that something isn't perfectly portable and document the differences.

  5. Use a different syntax when the behavior differs (for example, use the $ symbol instead of % for producing \b in JS)

5. seems like a good solution because it maintains correctness without sacrificing functionality. However, it is the least portable and goes against my goal to make Pomsky feel like a single, consistent language.

What I think you're suggesting is a configuration option to choose between 2. (strict) and 4. (loose). My current plan is to use whatever solution feels most appropriate (use polyfills whenever possible, reject code that is most likely incorrect, accept non-portable code if it is useful, emit a warning when a false positive is unlikely).

I intend to communicate this more clearly on the website. I think your idea of a "strict" mode is cool, but this sounds like a bigger project, so I probably won't have time to implement this in the near future. However, if you (or someone else) want to contribute this feature, that would be appreciated!

@RunDevelopment
Copy link
Author

I consider all 3 examples you pointed out to be bugs.

👍

However, if you (or someone else) want to contribute this feature, that would be appreciated!

This issue was basically me feeling out the goals and ambitions of this project. I wanted to contribute to a Rust project for a while now, so this will be a good first. Expect PRs in the coming year :)

@Aloso
Copy link
Member

Aloso commented Jan 18, 2023

@RunDevelopment I created a proposal to add an ASCII-only mode, addressing the problem with \b in JavaScript. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working C-compat Compatibility between regex flavors
Projects
None yet
Development

No branches or pull requests

2 participants