Path to Stage 4! #58

ljharb · 2023-09-26T10:46:16Z

oliverfoster · 2023-09-26T10:48:59Z

Congratulations on getting to Stage 2. 👍

benjamingr · 2023-09-26T11:08:58Z

Hey, need help on any of these? (I'm happy to try and contribute the test262 tests or amend the spec according to the consensus)

ljharb · 2023-09-26T23:41:12Z

I'm comfortable handling the spec, but the test262 tests would be most helpful. Be warned, though, they'll need to be extremely rigorous.

jridgewell · 2023-09-27T22:07:40Z

My review is at #60

michaelficarra · 2023-11-02T19:16:51Z

My feedback:

I don't like how this is introducing identity escapes in Unicode mode for these new punctuators. This wasn't a goal of the RegExp escaping proposal. You can accomplish RegExp escaping with hex escapes and without introducing new features to the RegExp grammar.
There's no reason to define the phrase "the ASCII punctuators that need escaping" if it's just used once, in the algorithm that follows. Just inline the string. Also I don't like how the string is denoted since it contains a ". Can you figure out some other way to denote it?

Maybe a table (code point, hex escape string) would solve both of my issues.

ljharb · 2023-11-02T21:08:59Z

For the second, I have no strong opinions about how to denote the characters; I can certainly inline the string.

For the first, isn't that required, otherwise this proposal won't be able to produce output that works in both u and v mode?

michaelficarra · 2023-11-02T21:11:45Z

No, I believe hex escapes are sufficient for this purpose. Do you have a specific counterexample?

ljharb · 2023-11-02T21:12:28Z

cc @bakkot since this was a result of their research

bakkot · 2023-11-02T21:24:12Z

Hex encoding works, it just makes the output completely unreadable. I can't imagine we're going to make \& mean anything other than & in u-mode regexps, so I don't think there's much cost in making those legal, and I think we ought to pay that cost for the benefit of more readable output.

michaelficarra · 2023-11-03T00:19:25Z

I do not value the readability of the output. This function is meant only for composing and then compiling RegExps.

bakkot · 2023-11-03T00:29:06Z

Sometimes you have to debug compiled regexps.

michaelficarra · 2023-11-03T00:32:54Z

You can decompile them and represent them however you like. Paste them into a visualiser.

bakkot · 2023-11-03T00:44:40Z

That's a bunch more work than doing console.log(regex). Why should anyone have to do that work? That's a very concrete cost to your suggestion and I see approximately no offsetting benefit. What benefit do you see to your suggestion?

ljharb · 2023-11-16T01:56:00Z

@michaelficarra as far as the spec review goes, I've addressed your editorial comment; can I check you off?

The normative one we can certainly discuss further if needed.

michaelficarra · 2023-11-17T00:12:52Z

The spec text is fine for what you intended. I still have issue with the escaping design though.

ljharb · 2023-11-17T04:02:45Z

Thanks! I'll check you off, but can you file a new issue to further discuss the escaping design? That seems like the only obstacle to seeking stage 3.

sophiebits · 2023-11-20T20:52:44Z

Escaping using hex escapes also has the advantage that it's straightforwardly polyfillable in practice, whereas changing the UnicodeMode grammar would seem to not really be.

gibson042 · 2023-11-27T06:16:08Z

My review:

In |IdentityEscape|, / should only appear once in the new expansion to non-|SyntaxCharacter| punctuators and ` should be included (but I am in favor of introducing the new expansions that improve parity with non-Unicode literals as anticipated by How is escaping of plain text affected by this proposal? proposal-regexp-v-flag#71 (comment) — it cuts off any future special use of those sequences, but that doesn't seem like a problem to me).
In the RegExp.escape introduction, the phrase "control characters" should be replaced with something less confusable with the similarly-named ASCII concept (e.g., "This method takes a string and returns a similar string in which each character that is potentially special in a regular expression |Pattern| has been replaced by an escape sequence representing that character.").

[Which leading characters should be escaped? #66] In the RegExp.escape algorithm, I think it is necessary to escape not just a leading decimal digit but also a leading ASCII letter (consider e.g. new RegExp("\\c" + RegExp.escape("J")) in a web browser implementing |ExtendedAtom|, where the result should match "\\cJ" and [unlike /\cJ/] fail to match "\n"), and possibly just to always escape the first character. I also think comprehensibility would be improved by adding a note explaining that logic branch and by making it more complete (rather than separating \x3 from the second hexadecimal digit by an intervening Else step). For example, assuming it's acceptable to always emit a \u… rather than something more dynamic like QuoteJSONString:

1. For each code point _c_ in _cpList_, do
  1. If _escapedList_ is empty [and _c_ needs escaping], then
    1. NOTE: Escaping is required to ensure that the output may be suffixed onto an arbitrary prefix without being misinterpreted as part of an escape sequence that starts within that prefix.
    1. Let _codeUnits_ be UTF16EncodeCodePoint(_c_).
    1. Let _len_ be the length of _codeUnits_.
    1. Assert: _len_ = 1 or _len_ = 2.
    1. Let _codeUnit_ be the code unit at index 0 within _codeUnits_.
    1. Let _escapedCodeUnit_ be UnicodeEscape(_codeUnit_).
    1. Append to _escapedList_ the elements of StringToCodePoints(_escapedCodeUnit_).
    1. If len = 2, then
      1. Set _codeUnit_ to the code unit at index 1 within _codeUnits_.
      1. Set _escapedCodeUnit_ to UnicodeEscape(_codeUnit_).
      1. Append to _escapedList_ the elements of StringToCodePoints(_escapedCodeUnit_).
  1. Else,
    1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace|, then
      1. Append code unit U+005C (REVERSE SOLIDUS) to _escapedList_.
    1. Append _c_ to _escapedList_.
1. Return CodePointsToString(_escapedList_).

The first issue is trivial to fix and the second is purely editorial, leaving only the third as something to actually resolve (potentially even by accepting the risk and making no normative change). In my opinion, this is ready for stage 3.

See #58 (comment)

ljharb · 2023-11-27T16:54:08Z

First two are fixed; we can discuss the third in #66 but iirc the intention was simply to not support using the output of RegExp.escape in contexts like that.

syg · 2024-01-26T20:48:25Z

Spec draft is fairly short and I didn't see any glaring editorial issues.

I find the alias definition for "the ASCII punctuators that need escaping" as a separate paragraph the precede the algorithmic steps strange. Why not have a step that says "Let the ASCII punctuators [...] be the String [...]"?

ljharb · 2024-01-26T20:50:46Z

Sure, I could do that - happy to go with whatever yall want there.

See #58 (comment)

ljharb · 2024-03-25T21:21:26Z

ok, this has been reworked with #67 - @jridgewell @michaelficarra @gibson042 @syg @bakkot, can you confirm that you're signed off, assuming you are?

jridgewell · 2024-03-25T21:27:38Z

LGTM

bakkot · 2024-03-25T21:44:07Z

LGTM

michaelficarra · 2024-03-25T22:24:52Z

@bakkot Does the string coercion in step 1 align with our new coercion strategy?

bakkot · 2024-03-25T22:33:19Z

Oh, good point. This should throw a TypeError on any non-string inputs.

michaelficarra · 2024-03-27T21:15:41Z

Let punctuators be the string-concatenation of "(){}[]|,.?*+-^$=<>/#&!%:;@~'`", the code unit 0x0022 (QUOTATION MARK), and the code unit 0x005C (REVERSE SOLIDUS).

ljharb · 2024-03-27T21:17:32Z

ah ok, gotcha, thanks.

ljharb · 2024-03-27T21:19:13Z

k, that's landed in c95db79.

ljharb · 2024-03-27T23:26:54Z

Filxed #71 to handle the surrogate pair stuff.

ljharb · 2024-03-28T20:59:48Z

@gibson042 @syg latest changes are landed; can you confirm if/that you've signed off?

gibson042 · 2024-03-28T21:17:25Z

Yep, that resolves the normative issues so works for me. I'll open PRs to demonstrate my editorial preferences.

ljharb · 2024-03-29T19:46:15Z

@syg given your stamp on #73, does that mean you've signed off?

syg · 2024-04-08T15:41:24Z

@syg given your stamp on #73, does that mean you've signed off?

Yep, editorially lgtm.

ljharb · 2024-04-08T17:59:31Z

No advancement this meeting:

Which leading characters should be escaped? #66 has been reopened
unpaired surrogates need to be escaped
things like newlines should be escaped as \n, so it's more readable, as long as it's valid syntax in unicode mode. @waldemarhorwat, it would be super helpful if you could file an issue listing the items you had in mind.

Once these are resolved, I'll return and re-ask for 2.7.

michaelficarra · 2024-04-08T18:12:30Z

@ljharb I imagine it means anything in ControlEscape, which doesn't have any alternative conditional on the UnicodeMode flag.

bakkot · 2024-04-08T18:15:39Z

Presumably also SyntaxCharacter and /, i.e. those characters which can be used in IdentityEscape even in u-mode RegExps.

bakkot · 2024-04-08T18:19:55Z

The other thing in CharacterEscape is \0. \0 is an interesting case, since you can't use it if the next character is an ASCII digit. I would favor not doing \0; \x00 is one of the few hex escapes you don't need to memorize.

waldemarhorwat · 2024-04-09T07:04:40Z

You can't do \0 for the NUL character because bad things could happen if the escaped string ended with a NUL and got concatenated with something that started with a digit. \x00 is fine for NUL.

mevanlc · 2024-04-23T03:19:42Z

I've been watching the progress on this repo for a long time now. I just want to say I appreciate the hard work and careful thought everyone has put into this (including the upcoming work as well.) It's been fascinating observing the concerns and challenges that the web environment raises for ECMAScript that were not much of a concern for other languages that have long had the same feature. I am wondering if the repo owner(s) would consider opening the GitHub Discussions tab in order to allow more back-and-forth discussion with fewer concerns about constantly pinging the people watching the Issues? I am sure if anything interesting distills out of Discussions, people will be more than happy to filter the info back into Issues (I don't think there is an expectation that the owners will monitor / reply to discussions, besides the unlikely chance that moderation is needed.) Just a thought.

ljharb · 2024-04-23T04:40:09Z

I'm not sure there's any way to disable issue notifications while enabling discussion notifications, and I think the same expectations exist for both about owner responsiveness.

ljharb · 2024-05-13T05:05:34Z

@jridgewell @michaelficarra @gibson042 @syg the proposal has been updated, and a fresh review would be appreciated :-)

jridgewell · 2024-05-13T18:36:45Z

LGTM

gibson042 · 2024-05-16T17:33:39Z

Comments:

I don't think 𝔽(_c_) for code point c is valid, because 𝔽 is defined as "conversion from a mathematical value or extended mathematical value", while a code point is neither of those (a code point has a numeric value which is an integer, but is not equal to it, cf. Source Text and Identity [among other sections]). Either the "Let hex be Number::toString(𝔽(c), 16)" step should be split to address this, or 𝔽 should be updated to explicitly accommodate it.
As noted at Address feedback from plenary #77 (comment) , I consider it unwise to use character escapes rather than hexadecimal escapes for |SyntaxCharacter| because it's inviting future issues. I can live with this behavior, but would still like to discuss it in plenary.
The population of |WhiteSpace| can change over time, therefore the behavior of RegExp.escape is not guaranteed to be stable (e.g., the addition of a code point to general category Space_Separator in some version of the Unicode Standard means that implementations incorporating that version will escape it while implementations incorporating an earlier version will not). This seems reasonable, but probably merits an editorial note.

michaelficarra · 2024-05-16T19:05:01Z

All the same points as @gibson042, plus there's some wording things that I would change, but they wouldn't hold up stages 2.7/3. LGTM otherwise.

ljharb · 2024-05-17T13:42:21Z

Thanks; that’s a spec bug that we’ll fix but doesnt affect semantics obv.

Sure, we can discuss it in plenary - hopefully either outcome can still allow for advancement in that meeting, given that the effective behavior is the same.

The changing nature of whitespace holds for regexes themselves as well as the trim methods, none of which have a note - what do you suggest?

gibson042 · 2024-05-17T20:09:30Z

Sure, we can discuss it in plenary - hopefully either outcome can still allow for advancement in that meeting, given that the effective behavior is the same.

👍

The changing nature of whitespace holds for regexes themselves as well as the trim methods, none of which have a note - what do you suggest?

I think it's more obvious for those cases, especially when it comes to \s and \S in patterns... that the behavior of something like RegExp.escape may not be stable over time is surprising. But I agree with you that it is also true of other operations, and have opened tc39/ecma262#3331 accordingly. This proposal can either track that or try to anticipate it; either approach seems fine to me.

ljharb · 2024-05-17T21:12:50Z

It will be quite stable in terms of "how it behaves in regexes" - nobody should be noticing or caring about the stability of the string it produces otherwise :-)

michaelficarra · 2024-05-17T21:20:54Z

@ljharb Wouldn't that be nice? Unfortunately, we both know how this job goes. If it can be observed, someone will observe it, and it must eventually become stable and compatible across engines.

ljharb · 2024-05-17T21:23:58Z

@michaelficarra that is true, and yet we've changed the whitespace value in JS multiple times and the only place it seemed to break people is in my projects' test cases :-p

@gibson042 filed #78 to address the first bullet point.

ljharb · 2024-06-11T13:44:51Z

Consensus today on character escapes and stage 2.7.

Help writing test262 tests is much appreciated :-)

michaelficarra mentioned this issue Nov 21, 2023

change escaping to hex escape sequences #65

Closed

gibson042 mentioned this issue Nov 27, 2023

Which leading characters should be escaped? #66

Closed

ljharb added a commit that referenced this issue Nov 27, 2023

[spec] review feedback

588218d

See #58 (comment)

ljharb added a commit that referenced this issue Mar 20, 2024

[spec] remove dfn

44ab978

See #58 (comment)

ljharb added a commit that referenced this issue Mar 27, 2024

[spec] address #58 (comment)

c95db79

This was referenced Mar 28, 2024

Editorial: Simplify algorithms by using strings rather than Lists #73

Merged

Editorial: Reword the description #74

Merged

gibson042 mentioned this issue Apr 1, 2024

Editorial: Add note about leading-digit escaping #75

Merged

ljharb mentioned this issue May 17, 2024

[spec] properly get the numeric value of a code point #78

Merged

Path to Stage 4! #58

Path to Stage 4! #58

Comments

ljharb commented Sep 26, 2023 • edited

Stage 4

Stage 3

Stage 2.7

Stage 2

Stage 1

oliverfoster commented Sep 26, 2023

benjamingr commented Sep 26, 2023

ljharb commented Sep 26, 2023

jridgewell commented Sep 27, 2023

michaelficarra commented Nov 2, 2023

ljharb commented Nov 2, 2023

michaelficarra commented Nov 2, 2023

ljharb commented Nov 2, 2023

bakkot commented Nov 2, 2023

michaelficarra commented Nov 3, 2023

bakkot commented Nov 3, 2023

michaelficarra commented Nov 3, 2023

bakkot commented Nov 3, 2023

ljharb commented Nov 16, 2023

michaelficarra commented Nov 17, 2023

ljharb commented Nov 17, 2023

sophiebits commented Nov 20, 2023

gibson042 commented Nov 27, 2023 • edited by ljharb

ljharb commented Nov 27, 2023 • edited

syg commented Jan 26, 2024

ljharb commented Jan 26, 2024

ljharb commented Mar 25, 2024

jridgewell commented Mar 25, 2024

bakkot commented Mar 25, 2024

michaelficarra commented Mar 25, 2024

bakkot commented Mar 25, 2024

michaelficarra commented Mar 27, 2024

ljharb commented Mar 27, 2024

ljharb commented Mar 27, 2024

ljharb commented Mar 27, 2024

ljharb commented Mar 28, 2024

gibson042 commented Mar 28, 2024

ljharb commented Mar 29, 2024

syg commented Apr 8, 2024

ljharb commented Apr 8, 2024 • edited

michaelficarra commented Apr 8, 2024

bakkot commented Apr 8, 2024

bakkot commented Apr 8, 2024

waldemarhorwat commented Apr 9, 2024

mevanlc commented Apr 23, 2024 • edited

ljharb commented Apr 23, 2024

ljharb commented May 13, 2024

jridgewell commented May 13, 2024

gibson042 commented May 16, 2024

michaelficarra commented May 16, 2024

ljharb commented May 17, 2024

gibson042 commented May 17, 2024

ljharb commented May 17, 2024

michaelficarra commented May 17, 2024

ljharb commented May 17, 2024

ljharb commented Jun 11, 2024

ljharb commented Sep 26, 2023 •

edited

gibson042 commented Nov 27, 2023 •

edited by ljharb

ljharb commented Nov 27, 2023 •

edited

ljharb commented Apr 8, 2024 •

edited

mevanlc commented Apr 23, 2024 •

edited