revert 2841 #2879

UziTech · 2023-07-17T06:09:53Z

Marked version: 5.1.1

Description

Revert #2841

Fixes Improper emoji rendering with v5.1.0 #2865

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

CI is green (no forced merge required).
Squash and Merge PR following conventional commit guidelines.

vercel · 2023-07-17T06:10:09Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
marked-website	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 17, 2023 6:10am

calculuschild · 2023-07-17T06:22:39Z

Let me play with it first. I suspect it's not \p{P} but my changes to the Lexer masking step.

calculuschild · 2023-07-20T02:36:06Z

Mmmm.... I have found the issue. Making it work is a bit obnoxious, but I managed to do it. I can share an alternative PR tomorrow with my changes for you to compare, although the workaround looks a bit odd

The issue is that we need to use RegEx in unicode mode (u), to process \p{P}. In unicode mode, the RegEx also treats multi-char emojis as single unicode characters, so "💁" for example is counted as a single character instead of string of 2 unicode chars. This throws off string lengths for slicing at multiple points (particularly using match.index which is off by 1 in the new test cases).

Essentially, the solution is to destructure strings into an array and slice on the array instead, since this preserves unicode chars which will match the counts from the RegEx:

const raw = src.slice(0, lLength + match.index + rLength + 1);

     becomes

const raw = [...src].slice(0, lLength + match.index + rLength + 1).join('');

Speed-wise, it seems the same to me, so the tradeoff looks like:

Pros:

more complete coverage of all punctuation chars
simpler regex rule

Cons:

Odd handling of string slicing in the emStrong tokenizer

I would also note that this PR has other tweaks included that are not related to this issue. If we decide to revert, I would request we only revert any parts related to the u regex mode (keep the other logic tweaks and variable renames).

UziTech · 2023-07-20T02:52:05Z

Nice catch! I would rather fix the problem than revert. I'll wait for your fix.

calculuschild · 2023-08-14T18:36:03Z

Apologies. I kind of forgot about this. New PR here: #2942

UziTech added 2 commits July 17, 2023 00:08

add emoji test

c4f65a4

revert 2841

3b4b78d

vercel bot deployed to Preview July 17, 2023 06:10 View deployment

UziTech requested a review from calculuschild July 17, 2023 06:10

calculuschild mentioned this pull request Aug 14, 2023

Fix unicode Regex miscounting emoji length #2942

Merged

5 tasks

UziTech closed this Aug 15, 2023

UziTech deleted the revert-2841 branch August 26, 2023 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert 2841 #2879

revert 2841 #2879

UziTech commented Jul 17, 2023

vercel bot commented Jul 17, 2023

calculuschild commented Jul 17, 2023

calculuschild commented Jul 20, 2023 •

edited

UziTech commented Jul 20, 2023

calculuschild commented Aug 14, 2023

revert 2841 #2879

revert 2841 #2879

Conversation

UziTech commented Jul 17, 2023

Description

Contributor

Committer

vercel bot commented Jul 17, 2023

calculuschild commented Jul 17, 2023

calculuschild commented Jul 20, 2023 • edited

Pros:

Cons:

UziTech commented Jul 20, 2023

calculuschild commented Aug 14, 2023

calculuschild commented Jul 20, 2023 •

edited