Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert 2841 #2879

Closed
wants to merge 2 commits into from
Closed

revert 2841 #2879

wants to merge 2 commits into from

Conversation

UziTech
Copy link
Member

@UziTech UziTech commented Jul 17, 2023

Marked version: 5.1.1

Description

Revert #2841

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Jul 17, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
marked-website ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 17, 2023 6:10am

@calculuschild
Copy link
Contributor

Let me play with it first. I suspect it's not \p{P} but my changes to the Lexer masking step.

@calculuschild
Copy link
Contributor

calculuschild commented Jul 20, 2023

Mmmm.... I have found the issue. Making it work is a bit obnoxious, but I managed to do it. I can share an alternative PR tomorrow with my changes for you to compare, although the workaround looks a bit odd

The issue is that we need to use RegEx in unicode mode (u), to process \p{P}. In unicode mode, the RegEx also treats multi-char emojis as single unicode characters, so "💁" for example is counted as a single character instead of string of 2 unicode chars. This throws off string lengths for slicing at multiple points (particularly using match.index which is off by 1 in the new test cases).

Essentially, the solution is to destructure strings into an array and slice on the array instead, since this preserves unicode chars which will match the counts from the RegEx:

const raw = src.slice(0, lLength + match.index + rLength + 1);

     becomes

const raw = [...src].slice(0, lLength + match.index + rLength + 1).join('');

Speed-wise, it seems the same to me, so the tradeoff looks like:

Pros:
  • more complete coverage of all punctuation chars
  • simpler regex rule
Cons:
  • Odd handling of string slicing in the emStrong tokenizer

I would also note that this PR has other tweaks included that are not related to this issue. If we decide to revert, I would request we only revert any parts related to the u regex mode (keep the other logic tweaks and variable renames).

@UziTech
Copy link
Member Author

UziTech commented Jul 20, 2023

Nice catch! I would rather fix the problem than revert. I'll wait for your fix.

@calculuschild
Copy link
Contributor

Apologies. I kind of forgot about this. New PR here: #2942

@UziTech UziTech closed this Aug 15, 2023
@UziTech UziTech deleted the revert-2841 branch August 26, 2023 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improper emoji rendering with v5.1.0
2 participants