Define line break styles for east asian characters as options #411

henry0312 · 2023-08-27T06:18:37Z

In this pull request, we've addressed the rendering of line breaks, especially concerning the interaction between Western and East Asian wide characters. The aim is to ensure that the output is more intuitive and naturally rendered.

Changes:

Refactoring in renderText Method:
- Simplified the condition that determines whether a new line should be added between characters. We've used a more straightforward and readable condition using De Morgan's law.
- This change primarily impacts the way line breaks are handled between East Asian wide characters and Western characters.
Updated Test Cases:
- We've modified the expected results for the test cases in cjk_test.go to reflect the desired behavior.
- Specifically, we've adjusted tests that involve soft line breaks between a Western character and an East Asian wide character. This ensures that our tests are in line with the new rendering logic.

Through these changes, we hope to enhance the readability of rendered content, especially when dealing with mixed character types. We kindly ask for your review and feedback on these modifications.

Example 1:

Input Markdown:

私はプログラマーです。
GoでWebアプリケーションを開発しています。

Current Output:

<p>私はプログラマーです。\nGoでWebアプリケーションを開発しています。</p>

Expected Output:

<p>私はプログラマーです。GoでWebアプリケーションを開発しています。</p>

Example 2:

Input Markdown:

I am a programmer. <!-- notice: there is a white space after the last period -->
私はプログラマーです。

Current Output:

<p>I am a programmer. \n私はプログラマーです。</p>

Expected Output:

<p>I am a programmer. 私はプログラマーです。</p>

This commit aims to produce more natural line breaks in the rendered output.

yuin · 2023-08-28T12:36:50Z

In this case, "naturally" varies from person to person.
pandoc (that is already widely used) east_asian_line_breaks handles this case as same as current goldmark.

It is more preferable for me this is an option for users.

henry0312 · 2023-08-29T12:23:53Z

Thank you for your feedback.

I completely understand your viewpoint. The perception of what appears "natural" can indeed vary among different users.

Regarding an appropriate option name:
- How about naming it CustomEastAsianLineBreaks? This name signifies that it's tailored for specific East Asian line break behaviors and distinguishes it from the standard east_asian_line_breaks option provided by tools like pandoc. I'm open to other suggestions as well.
Regarding the changes in this pull request:
- I believe the modifications are sound and align with the intent of providing more intuitive rendering, especially for mixed character types. However, making it an optional behavior ensures that we are not forcing a particular style on all users.

With that in mind, I fully support making this feature an optional behavior for users. I can proceed with implementing this as an option if everyone agrees.

Looking forward to further feedback and suggestions.

yuin · 2023-09-07T12:50:21Z

It might be better to say that options for EastAsianLineBreaks to be functional options for WithEastAsianLineBreaks.

For example:

extension.NewCJK(
  extension.WithEastAsianLineBreaks(
    extension.WithXXX(),
  ),
)

henry0312 · 2023-09-10T06:28:25Z

I've followed your suggestion and added a WorksEvenWithOneSide sub-option to the EastAsianLineBreak option.
I would greatly appreciate it if you could review the changes in this PR.
Thank you in advance for your time and feedback.

yuin · 2023-09-10T12:51:15Z

Thanks for updating PR.
I feel your implementation focuses too much on a specific issue.

Segmentation break problems have been discussed for a long, especially in CSS specification. 'Natural' segmentation break is really hard, depends on many languages.

In the past, CSS specifications draft defined segmentation breaks as follows.

If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

Otherwise, the segment break is converted to a space (U+0020).

I think option names like CSS word break property values are more flexible than concrete option names like WithWorksEvenWithOneSide that you implemented. If Korean developer clams to change CJK extension behavior, they will make PR by following your implementation. It can end up giving the CJK extension a lot of detailed and confused options.

Like the following options, 'style' aggregates detailed options into one option value:

type EastAsianLineBreakStyle  int

const (
  EastAsianLineBreakStyleSimple = iota // Pandoc style line break
  EastAsianLineBreakCSS3Draft // 
  // etc...
)

And WithEastAsianLineBreaks takes a this option:

WithEastAsianLineBreaks(EastAsianLineBreakCSS3Draft)

Note that for backward compatibilities, WithEastAsianLineBreaks should be WithEastAsianLineBreaks(...EastAsianLineBreakStyle) not WithEastAsianLineBreaks(EastAsianLineBreakStyle).

henry0312 · 2023-09-24T06:18:15Z

Thank you for your detailed feedback.
First and foremost, I apologize for the delay in responding; reviewing and researching the CSS draft took a considerable amount of time.

I've closely aligned the line break specifications with your intentions, which is why we opted for the current implementation. We wanted a solution that was both intuitive for users and consistent with the desired functionality.

Your reference to the ongoing discussions surrounding the CSS specifications has been invaluable. I have delved into the extensive debates and found them to be quite intricate. Honestly, fully comprehending the nuances of the CSS line break specifications has been a challenge. Given its complexity, achieving a perfect implementation of the CSS line break specifications would indeed be tough.

However, in keeping with your intentions and after considering your feedback, I've aimed for an approach that enhances code readability and extensibility. This ensures that future modifications or additions can be seamlessly integrated. Your suggestion of using EastAsianLineBreakStyle definitely aligns with this goal, offering a more streamlined and organized solution.

Regarding backward compatibility, I've ensured that WithEastAsianLineBreaks remains variadic with ...EastAsianLineBreakStyle to not disrupt the experience for current users.

I genuinely value your insights and constructive feedback. I'll continue refining the implementation to make it as adaptable and efficient as possible. Please let me know if there are other concerns or recommendations you'd like to share.

henry0312 · 2023-09-24T06:19:09Z

extension/cjk.go

+	EastAsianLineBreaksStyleSimple EastAsianLineBreaksStyle = iota
+	// EastAsianLineBreaksCSS3Draft is a style where soft line breaks are ignored
+	// even if only one side of the break is an east asian wide character.
+	EastAsianLineBreaksCSS3Draft


Do you have a better name than EastAsianLineBreaksCSS3Draft?

It seems your implementation does not satisfy CSS3 draft rules. We may have choices...

Implement CSS3 Draft rules and leave this name as it is

Change this name to represent your implementation

I prefer 1.

Implement CSS text level 3 draft rules.

Implement additional enhancement(i.e. resolves [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)

Write README like 'This option implements CSS text level3 Segment Break Transformation Rules with some enhancements'

Thank you for your thorough feedback.

I’m on board with option 1 and will give the CSS text level 3 rules and additional enhancements a shot. Admittedly, I'm not a pro with this CSS issue, so while I'll do my best, I might miss some nuances. Any extra guidance or pointers while I work through this would be awesome!

Will keep you posted on the progress. Talk to you soon!

henry0312 · 2023-09-24T06:25:13Z

renderer/html/html.go

 				sibling := node.NextSibling()
 				if sibling != nil && sibling.Kind() == ast.KindText {
 					if siblingText := sibling.(*ast.Text).Text(source); len(siblingText) != 0 {
 						thisLastRune := util.ToRune(value, len(value)-1)
 						siblingFirstRune, _ := utf8.DecodeRune(siblingText)
-						if !(util.IsEastAsianWideRune(thisLastRune) &&
-							util.IsEastAsianWideRune(siblingFirstRune)) {
+						if r.EastAsianLineBreaks.EastAsianLineBreaksFunction.SoftLineBreak(thisLastRune, siblingFirstRune) {


I am convinced that this is one of the things we should implement.

yuin · 2023-10-14T08:33:18Z

renderer/html/html.go

+	EastAsianLineBreaksCSS3Draft
+)
+
+type eastAsianLineBreaksFunction interface {


This name does not meet conventions in Go.
softLineBreaker is better for this kind of interfaces.

For this kind of interfaces that has only one method, we often define a factory function (i.e. http.Handler and http.HandlerFunc).

yuin · 2023-10-14T08:40:02Z

renderer/html/html.go

+type eastAsianLineBreaksCSS3Draft struct{}
+
+func (e *eastAsianLineBreaksCSS3Draft) SoftLineBreak(thisLastRune rune, siblingFirstRune rune) bool {
+	return !(util.IsEastAsianWideRune(thisLastRune) || util.IsEastAsianWideRune(siblingFirstRune))


It seems this does not satisfy CSS3 Draft rules.

If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

Otherwise, the segment break is converted to a space (U+0020).

So CSS3Draft is not suitable for this implementation.

I'll try implementing based on the provided direction, though I'm not fully confident I've grasped all the details (still wrapping my head around Unicode handling). Might need your patience and continued reviews – really appreciate it!

After reviewing the CSS Text Module Level 3, it appears the current usage is as shown in the following screenshot. May I proceed to implement the specifications (+ enhancements) you pointed out as CSS3Draft?

CSS3Draft:

If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or U+3000, then the segment break is removed (cf. [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)

Otherwise, the segment break is converted to a space (U+0020).

Yes. Some implementation of this algorithm may be helpful for you

https://github.com/markdown-it/markdown-it-cjk-breaks

This plugin finds and removes newlines that cannot be converted to space, algorithm matches CSS Text Module Level 3:

Thank you for the information. I will begin implementing CSS3Draft using the rules discussed in this thread!

yuin · 2023-10-14T08:51:18Z

extension/cjk.go

+	EastAsianLineBreaksStyleSimple EastAsianLineBreaksStyle = iota
+	// EastAsianLineBreaksCSS3Draft is a style where soft line breaks are ignored
+	// even if only one side of the break is an east asian wide character.
+	EastAsianLineBreaksCSS3Draft


It seems your implementation does not satisfy CSS3 draft rules. We may have choices...

Implement CSS3 Draft rules and leave this name as it is

Change this name to represent your implementation

I prefer 1.

Implement CSS text level 3 draft rules.

Implement additional enhancement(i.e. resolves [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)

Write README like 'This option implements CSS text level3 Segment Break Transformation Rules with some enhancements'

henry0312 · 2023-10-24T13:10:56Z

I've finally completed the implementation of CSS3Draft!
Please review it.

P.S.
Several lint errors are occurring in Github Actions.
These errors seem unrelated to the PR.
https://github.com/yuin/goldmark/pull/411/files#diff-2fc06eaa27f973f12f6d355127a8e1e8d0c49e88dca4c6e9fc843aebea12d0a5

yuin · 2023-10-28T08:59:33Z

I've merged the PR and made some improvements codes in 9c90033.

Thanks for your contribution.

…eak styles This commit follows yuin/goldmark#411

henry0312 · 2023-10-29T09:27:24Z

Thank you for the merge. I also appreciate the extensive reviews and explanations of the direction.

Talhasaleem110 · 2023-11-08T03:40:53Z

In this pull request, we've addressed the rendering of line breaks, especially concerning the interaction between Western and East Asian wide characters. The aim is to ensure that the output is more intuitive and naturally rendered.

Changes:

Refactoring in renderText Method:

Simplified the condition that determines whether a new line should be added between characters. We've used a more straightforward and readable condition using De Morgan's law.

This change primarily impacts the way line breaks are handled between East Asian wide characters and Western characters.

Updated Test Cases:

We've modified the expected results for the test cases in cjk_test.go to reflect the desired behavior.

Specifically, we've adjusted tests that involve soft line breaks between a Western character and an East Asian wide character. This ensures that our tests are in line with the new rendering logic.

Through these changes, we hope to enhance the readability of rendered content, especially when dealing with mixed character types. We kindly ask for your review and feedback on these modifications.

Example 1:

Input Markdown:
私はプログラマーです。
GoでWebアプリケーションを開発しています。
Current Output:
私はプログラマーです。\nGoでWebアプリケーションを開発しています。
Expected Output:
私はプログラマーです。GoでWebアプリケーションを開発しています。
Example 2:

Input Markdown:
I am a programmer. 
私はプログラマーです。
Current Output:
I am a programmer. \n私はプログラマーです。
Expected Output:
I am a programmer. 私はプログラマーです。

Improve line breaking behavior for east asian characters

6ef9b10

This commit aims to produce more natural line breaks in the rendered output.

henry0312 added 2 commits September 10, 2023 15:08

Add a WorksEvenWithOneSide option to EastAsianLineBreak

6cbcfeb

add comments

2367b9f

fix tests

dc2230c

henry0312 added 3 commits September 24, 2023 14:25

Define EastAsianLineBreaksStyle to specify behavior of line breaking

9d0b1b6

Updat README.md

792af68

fix errors of lints

8c6830d

henry0312 changed the title ~~Improve line breaking behavior for east asian characters~~ Define line break styles for east asian characters as options Sep 24, 2023

henry0312 commented Sep 24, 2023

View reviewed changes

yuin reviewed Oct 14, 2023

View reviewed changes

Implements CSS3Draft

6b3067e

yuin merged commit a89ad04 into yuin:master Oct 28, 2023
4 of 6 checks passed

henry0312 deleted the update_cond_east_asian_line_breaks branch October 29, 2023 05:52

henry0312 added a commit to henry0312/hugo that referenced this pull request Oct 29, 2023

markup/goldmark: update the CJK extension to allow specifying line br…

5a6a385

…eak styles This commit follows yuin/goldmark#411

henry0312 mentioned this pull request Oct 29, 2023

markup/goldmark: update the CJK extension to allow specifying line break styles gohugoio/hugo#11622

Merged

bep pushed a commit to gohugoio/hugo that referenced this pull request Oct 29, 2023

markup/goldmark: Update the CJK extension to allow specifying line br…

db14238

…eak styles This commit follows yuin/goldmark#411

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define line break styles for east asian characters as options #411

Define line break styles for east asian characters as options #411

henry0312 commented Aug 27, 2023

yuin commented Aug 28, 2023

henry0312 commented Aug 29, 2023

yuin commented Sep 7, 2023

henry0312 commented Sep 10, 2023

yuin commented Sep 10, 2023

henry0312 commented Sep 24, 2023

henry0312 Sep 24, 2023

yuin Oct 14, 2023

henry0312 Oct 15, 2023

henry0312 Sep 24, 2023

yuin Oct 14, 2023

yuin Oct 14, 2023

henry0312 Oct 15, 2023

henry0312 Oct 15, 2023

yuin Oct 15, 2023

henry0312 Oct 16, 2023

yuin Oct 14, 2023

henry0312 commented Oct 24, 2023

yuin commented Oct 28, 2023 •

edited

henry0312 commented Oct 29, 2023

Talhasaleem110 commented Nov 8, 2023

Changes:

Example 1:

Example 2:

Define line break styles for east asian characters as options #411

Define line break styles for east asian characters as options #411

Conversation

henry0312 commented Aug 27, 2023

Changes:

Example 1:

Example 2:

yuin commented Aug 28, 2023

henry0312 commented Aug 29, 2023

yuin commented Sep 7, 2023

henry0312 commented Sep 10, 2023

yuin commented Sep 10, 2023

henry0312 commented Sep 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henry0312 commented Oct 24, 2023

yuin commented Oct 28, 2023 • edited

henry0312 commented Oct 29, 2023

Talhasaleem110 commented Nov 8, 2023

Changes:

Example 1:

Example 2:

yuin commented Oct 28, 2023 •

edited