Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ascii_only changes semantics of redundantly escaped regexes #1506

Closed
DavidBuchanan314 opened this issue Mar 16, 2024 · 1 comment
Closed
Labels

Comments

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Mar 16, 2024

$ npx terser -V
terser 5.29.2

Steps to repro:

$ cat testcase.js
console.log("hello\\world❗".split(/[\❗]/));
$ node testcase.js 
[ 'hello\\world', '' ]
$ npx terser testcase.js -f ascii_only=true | tee min.js
console.log("hello\\world\u2757".split(/[\\u2757]/));
$ node min.js
[ 'hello', 'world❗' ]

When it occurs inside a regex literal, \❗ is equivalent to just - the redundant escape is ignored by JS runtimes (as far as I can tell?).

But, once the emoji has been replaced with a unicode escape sequence (by terser), the \\ substring parses as an escaped backslash, changing the meaning of the regex.

Maybe there needs to be an initial pass to strip redundant backslashes?

(By the way, when \❗ occurs inside a string literal, it appears to be handled correctly)

@fabiosantoscode
Copy link
Collaborator

fabiosantoscode commented Mar 20, 2024

Cool bug!

The behavior with emojis that need two \uXXXX is also fun. The backslash of the first gets escaped, not the second.

"".split(/[\😁]/);

becomes

"".split(/\\ud83d\ude01/);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants