Line length should consider Unicode character width #1197

uranusjr · 2019-12-09T07:00:57Z

Is your feature request related to a problem? Please describe. A clear and concise

In some cases one would want to put non-ASCII characters in source. This is most common
for strings, but in Python 3 you can use non-ASCII identifiers as well.

But consider the following example:

parser.add_argument(
    "poem", help="Poem to process",
    default="ちはやふる神代もきかず竜田川からくれなゐに水くくるとは",
)

The default line is visually long, and I’d like to put it on its own line. But Black would insist formatting it into

parser.add_argument(
    "poem", help="Poem to process", default="ちはやふる神代もきかず竜田川からくれなゐに水くくるとは"
)

since it thinks there is enough space (indeed, the formatted line is only 73 characters long).

Describe the solution you'd like

When calculating line lengths, Black should consult the width attribute of each character, and count double width for characters with EAW F or W.

Describe alternatives you've considered

Having strictly single-width characters in source, and pull in double-width strings at runtime (via i18n or \u sequences. The translation system is usually an overkill cases (e.g. developing an in-house CLI where all users speak a common East Asian language), and replacing all strings with \u make the source illegible to humans.

Usgin fmt: on/off becomes tedious and hard to the eye quickly when you need a lot of string scattered around (e.g. for logging purposes).

Additional context Add any other context or screenshots about the feature request
here.

The text was updated successfully, but these errors were encountered:

mk-pmb · 2019-12-09T23:34:58Z

Black should consult the width attribute of each character, and count double width for characters with EAW F or W.

👍 and I'd be willing to try and implement it if maintainers agree they want the feature.

mk-pmb · 2019-12-09T23:39:29Z

Related @ maintainers: Would you also want zero-width characters to not add to line length?

JelleZijlstra · 2019-12-10T00:19:14Z

This makes sense to me in general, but I'm not familiar with the Unicode subtleties involved here.

I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width.

I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters.

mk-pmb · 2019-12-10T00:56:37Z

for the common case of having only normal-width characters.

This is partially a chicken-and-egg problem: In a culture where it would make sense to use lots of fullwidth characters, a lack of tooling support might nudge people to making their code base more "common" in the meaning of "more American".

mk-pmb · 2019-12-10T01:03:29Z

Still it doesn't change the fact, even in an asian python code base, there will be lots of lines with just American characters. We could, as a first mitigation, scan the line for whether there are any multibyte chars in them at all. Maybe I can even optimize it so only the multibyte chars are checked, not sure whether that will really be faster. Another optimization could be to not actually check each char but only larger chunks so we could make a regexp range. I'll go research Unicode to be able to give better feedback on this.

mk-pmb · 2019-12-10T01:13:19Z

(Sorry to anyone annoyed by my multi-replies; I do it so people can vote on them separately.)

how good, Unicode-aware code editors display these characters

That differs widely. As I understand the black philosophy, I assume the black answer would be "Anyone who uses such characters should just get editors that can deal with them properly."

However, I personally tend to have more of a lenient "let them configure stuff" attitude. Sometimes all the editors that manage to get the one feature right, are lacking lots of other features. And support for asian-specific bugs is (in my view) very lacking in lots of free software projects, so I think it would not be helpful to present users with yet another case of "your problem almost could have been solved, but…".

uranusjr · 2019-12-10T02:18:41Z

Thanks for the discussion. Some of my own observations:

I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width.

The support for EAW (i.e. single vs double width) is generally quite good. All GUI (including web) ones are very good at this. CLI editors generally rely on the terminal’s support, which is also quite universal these days, since it affects emojis as well.

Support is variant for characters with ambigious or unspecified length. with most engines displaying them as single-width (even zero-width characters). But IMO this is not a problem from a practical standpoint.

People tend to care less (at least I do) when the line wraps too early (i.e. break at 86 instead of 88 characters), and it’s unlikely for a source have a lot of zero-width characters in your string[*]. EAW characters, however, make the line wrap too late, with extremely long lines that cause line overflow or soft wrapping, both very noticible and with significant usability issues.

[*]: Non-printing characters are not a good idea anyway since they are inherently hard to spot, causing difficulties in maintainance. It is more advised to use escape code instead. Combining characters should be counted as zero-width when they are combining in an ideal world, but they are nore difficult since the length depends on context. And again, this is less of a problem in practice since it only causes early wraps, not late.

I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters.

I understand what you’re trying to say, but this expression is slightly, um, annoying (avoiding the r word here). Hopefully you don’t really mean our native languages are not as “normal” as English? 🙂

mk-pmb · 2019-12-10T02:56:47Z

not a problem from a practical standpoint.

If we simplify the problem to the subset relevant for real-life use cases, and…

People tend to care less (at least I do) when the line wraps too early

… simplify further to accept some amount of too-early line break, I guess we'll end up with a relatively small set of character range expressions and an almost trivial implementation as regexp or filtered list comprehension, whichever will turn out more performant.

Next we'd need to decide which real-life codebases we want to use for the performance testing.

rbanffy · 2021-10-10T21:57:55Z

I was just bitten by this issue with composing characters. I assume this happens with languages that have accents that don't have corresponding unified accented symbols. In this case, while looking to implement missing symbols from the IBM 3270 terminal, I implemented numbers with dotted overlines that are sometimes displayed in their status line with the "combining four dots above" symbol (u+20dc). In the example below, the line is considered too long because every combined character is counted as two.

composed_glyphs = "Composed glyphs: ÀÉI͂o̓N̈́AͅB̊͆Ȍ͇U͈D̈ẢB̊A̋ĎA̍J̎Ȁ1⃜2⃜3⃜4⃜5⃜6⃜7⃜8⃜9⃜0⃜\n"

I solved it by breaking the line, but I wouldn't say it's pretty.

https://github.com/rbanffy/3270font/blob/710719fb18705137a6440b8cc0c4d0368f690846/test_font_rendering.py#L16-L19

This change to my code is in commit rbanffy/3270font@710719f

I'd love to help fixing this. Is anyone working on it?

mk-pmb · 2021-10-13T00:50:28Z

In case anyone hoped for me, because I was very interested back then – nope, I ditched black meanwhile.

hirmiura · 2022-11-05T06:34:48Z

bump.
I would like this feature too.

dahlia · 2022-12-17T17:02:40Z

I tried to address this (partly): #3445. However, I am not sure if my approach is appropriate and I implemented things properly. I would appreciate if someone take a look at my patch. Thanks!

This patch changes the preview style so that string splitters respect Unicode East Asian Width[^1] property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples. Traditionally, East Asian characters (including punctuation) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters: ``` abcdefg. 글、字。 ``` The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their *width* rather than the number of characters. Again, the following characters: ``` 글、字。 ``` These are just 4 characters, but their total width is 8. Suppose we want to maintain up to 4 columns per line with the following text: ``` abcdefg. 글、字。 ``` How should it be then? We want it to look like: ``` abcd efg. 글、字。 ``` However, Black currently turns it into like this: ``` abcd efg. 글、字。 ``` It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named `East_Asian_Width`. Unicode grouped every single character according to their width in fixed-width typeset. This partially addresses #1197, but only for string splitters. The other parts need to be fixed as well in future patches. This was implemented by copying rich's own approach to handling wide characters: generate a table using wcwidth, check it into source control, and use in to drive helper functions in Black's logic. This gets us the best of both worlds: accuracy and performance (and let's us update as per our stability policy too!). Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>

uranusjr added the T: enhancement New feature or request label Dec 9, 2019

ichard26 mentioned this issue Sep 27, 2020

Treat double-width unicode characters #1154

Closed

m15a mentioned this issue Aug 19, 2022

MD013: Unicode character width DavidAnson/markdownlint#564

Open

dahlia mentioned this issue Dec 17, 2022

Let string splitters respect East_Asian_Width property #3445

Merged

3 tasks

wanderingstan mentioned this issue Dec 27, 2023

Black does not format correctly strings containing emojis microsoft/vscode-black-formatter#395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line length should consider Unicode character width #1197

Line length should consider Unicode character width #1197

uranusjr commented Dec 9, 2019

mk-pmb commented Dec 9, 2019 •

edited

mk-pmb commented Dec 9, 2019

JelleZijlstra commented Dec 10, 2019

mk-pmb commented Dec 10, 2019

mk-pmb commented Dec 10, 2019 •

edited

mk-pmb commented Dec 10, 2019

uranusjr commented Dec 10, 2019

mk-pmb commented Dec 10, 2019

rbanffy commented Oct 10, 2021 •

edited

mk-pmb commented Oct 13, 2021

hirmiura commented Nov 5, 2022

dahlia commented Dec 17, 2022

Line length should consider Unicode character width #1197

Line length should consider Unicode character width #1197

Comments

uranusjr commented Dec 9, 2019

mk-pmb commented Dec 9, 2019 • edited

mk-pmb commented Dec 9, 2019

JelleZijlstra commented Dec 10, 2019

mk-pmb commented Dec 10, 2019

mk-pmb commented Dec 10, 2019 • edited

mk-pmb commented Dec 10, 2019

uranusjr commented Dec 10, 2019

mk-pmb commented Dec 10, 2019

rbanffy commented Oct 10, 2021 • edited

mk-pmb commented Oct 13, 2021

hirmiura commented Nov 5, 2022

dahlia commented Dec 17, 2022

mk-pmb commented Dec 9, 2019 •

edited

mk-pmb commented Dec 10, 2019 •

edited

rbanffy commented Oct 10, 2021 •

edited