Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line length should consider Unicode character width #1197

Open
uranusjr opened this issue Dec 9, 2019 · 12 comments
Open

Line length should consider Unicode character width #1197

uranusjr opened this issue Dec 9, 2019 · 12 comments
Labels
T: enhancement New feature or request

Comments

@uranusjr
Copy link
Contributor

uranusjr commented Dec 9, 2019

Is your feature request related to a problem? Please describe. A clear and concise

In some cases one would want to put non-ASCII characters in source. This is most common
for strings, but in Python 3 you can use non-ASCII identifiers as well.

But consider the following example:

parser.add_argument(
    "poem", help="Poem to process",
    default="ちはやふる神代もきかず竜田川からくれなゐに水くくるとは",
)

The default line is visually long, and I’d like to put it on its own line. But Black would insist formatting it into

parser.add_argument(
    "poem", help="Poem to process", default="ちはやふる神代もきかず竜田川からくれなゐに水くくるとは"
)

since it thinks there is enough space (indeed, the formatted line is only 73 characters long).

Describe the solution you'd like

When calculating line lengths, Black should consult the width attribute of each character, and count double width for characters with EAW F or W.

Describe alternatives you've considered

Having strictly single-width characters in source, and pull in double-width strings at runtime (via i18n or \u sequences. The translation system is usually an overkill cases (e.g. developing an in-house CLI where all users speak a common East Asian language), and replacing all strings with \u make the source illegible to humans.

Usgin fmt: on/off becomes tedious and hard to the eye quickly when you need a lot of string scattered around (e.g. for logging purposes).

Additional context Add any other context or screenshots about the feature request
here.

Screenshot 2019-12-09 at 15 00 35

@uranusjr uranusjr added the T: enhancement New feature or request label Dec 9, 2019
@mk-pmb
Copy link

mk-pmb commented Dec 9, 2019

Black should consult the width attribute of each character, and count double width for characters with EAW F or W.

👍 and I'd be willing to try and implement it if maintainers agree they want the feature.

@mk-pmb
Copy link

mk-pmb commented Dec 9, 2019

Related @ maintainers: Would you also want zero-width characters to not add to line length?

@JelleZijlstra
Copy link
Collaborator

This makes sense to me in general, but I'm not familiar with the Unicode subtleties involved here.

I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width.

I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters.

@mk-pmb
Copy link

mk-pmb commented Dec 10, 2019

for the common case of having only normal-width characters.

This is partially a chicken-and-egg problem: In a culture where it would make sense to use lots of fullwidth characters, a lack of tooling support might nudge people to making their code base more "common" in the meaning of "more American".

@mk-pmb
Copy link

mk-pmb commented Dec 10, 2019

Still it doesn't change the fact, even in an asian python code base, there will be lots of lines with just American characters. We could, as a first mitigation, scan the line for whether there are any multibyte chars in them at all. Maybe I can even optimize it so only the multibyte chars are checked, not sure whether that will really be faster. Another optimization could be to not actually check each char but only larger chunks so we could make a regexp range. I'll go research Unicode to be able to give better feedback on this.

@mk-pmb
Copy link

mk-pmb commented Dec 10, 2019

(Sorry to anyone annoyed by my multi-replies; I do it so people can vote on them separately.)

how good, Unicode-aware code editors display these characters

That differs widely. As I understand the black philosophy, I assume the black answer would be "Anyone who uses such characters should just get editors that can deal with them properly."

However, I personally tend to have more of a lenient "let them configure stuff" attitude. Sometimes all the editors that manage to get the one feature right, are lacking lots of other features. And support for asian-specific bugs is (in my view) very lacking in lots of free software projects, so I think it would not be helpful to present users with yet another case of "your problem almost could have been solved, but…".

@uranusjr
Copy link
Contributor Author

Thanks for the discussion. Some of my own observations:

I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width.

The support for EAW (i.e. single vs double width) is generally quite good. All GUI (including web) ones are very good at this. CLI editors generally rely on the terminal’s support, which is also quite universal these days, since it affects emojis as well.

Support is variant for characters with ambigious or unspecified length. with most engines displaying them as single-width (even zero-width characters). But IMO this is not a problem from a practical standpoint.

People tend to care less (at least I do) when the line wraps too early (i.e. break at 86 instead of 88 characters), and it’s unlikely for a source have a lot of zero-width characters in your string[*]. EAW characters, however, make the line wrap too late, with extremely long lines that cause line overflow or soft wrapping, both very noticible and with significant usability issues.

[*]: Non-printing characters are not a good idea anyway since they are inherently hard to spot, causing difficulties in maintainance. It is more advised to use escape code instead. Combining characters should be counted as zero-width when they are combining in an ideal world, but they are nore difficult since the length depends on context. And again, this is less of a problem in practice since it only causes early wraps, not late.

I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters.

I understand what you’re trying to say, but this expression is slightly, um, annoying (avoiding the r word here). Hopefully you don’t really mean our native languages are not as “normal” as English? 🙂

@mk-pmb
Copy link

mk-pmb commented Dec 10, 2019

not a problem from a practical standpoint.

If we simplify the problem to the subset relevant for real-life use cases, and…

People tend to care less (at least I do) when the line wraps too early

… simplify further to accept some amount of too-early line break, I guess we'll end up with a relatively small set of character range expressions and an almost trivial implementation as regexp or filtered list comprehension, whichever will turn out more performant.

Next we'd need to decide which real-life codebases we want to use for the performance testing.

@rbanffy
Copy link

rbanffy commented Oct 10, 2021

I was just bitten by this issue with composing characters. I assume this happens with languages that have accents that don't have corresponding unified accented symbols. In this case, while looking to implement missing symbols from the IBM 3270 terminal, I implemented numbers with dotted overlines that are sometimes displayed in their status line with the "combining four dots above" symbol (u+20dc). In the example below, the line is considered too long because every combined character is counted as two.

composed_glyphs = "Composed glyphs: ÀÉI͂o̓N̈́AͅB̊͆Ȍ͇U͈D̈ẢB̊A̋ĎA̍J̎Ȁ1⃜2⃜3⃜4⃜5⃜6⃜7⃜8⃜9⃜0⃜\n"

I solved it by breaking the line, but I wouldn't say it's pretty.

https://github.com/rbanffy/3270font/blob/710719fb18705137a6440b8cc0c4d0368f690846/test_font_rendering.py#L16-L19

This change to my code is in commit rbanffy/3270font@710719f

I'd love to help fixing this. Is anyone working on it?

@mk-pmb
Copy link

mk-pmb commented Oct 13, 2021

In case anyone hoped for me, because I was very interested back then – nope, I ditched black meanwhile.

@hirmiura
Copy link

hirmiura commented Nov 5, 2022

bump.
I would like this feature too.
black日本語

@dahlia
Copy link
Contributor

dahlia commented Dec 17, 2022

I tried to address this (partly): #3445. However, I am not sure if my approach is appropriate and I implemented things properly. I would appreciate if someone take a look at my patch. Thanks!

ichard26 pushed a commit that referenced this issue Mar 19, 2023
This patch changes the preview style so that string splitters respect
Unicode East Asian Width[^1] property.  If you are not familiar to CJK
languages it is not clear immediately.  Let me elaborate with some
examples.

Traditionally, East Asian characters (including punctuation) have
taken up space twice than European letters and stops when they are
rendered in monospace typeset.  Compare the following characters:

```
abcdefg.
글、字。
```

The characters at the first line are half-width, and the second line
are full-width.  (Also note that the last character with a small
circle, the East Asian period, is also full-width.)  Therefore, if we
want to prevent those full-width characters to exceed the maximum
columns per line, we need to count their *width* rather than the number
of characters.  Again, the following characters:

```
글、字。
```

These are just 4 characters, but their total width is 8.

Suppose we want to maintain up to 4 columns per line with the following
text:

```
abcdefg.
글、字。
```

How should it be then?  We want it to look like:

```
abcd
efg.
글、
字。
```

However, Black currently turns it into like this:

```
abcd
efg.
글、字。
```

It's because Black currently counts the number of characters in the line
instead of measuring their width. So, how could we measure the width?
How can we tell if a character is full- or half-width? What if half-width
characters and full-width ones are mixed in a line? That's why Unicode
defined an attribute named `East_Asian_Width`. Unicode grouped every
single character according to their width in fixed-width typeset.

This partially addresses #1197, but only for string splitters. The other
parts need to be fixed as well in future patches.

This was implemented by copying rich's own approach to handling wide
characters: generate a table using wcwidth, check it into source
control, and use in to drive helper functions in Black's logic. This
gets us the best of both worlds: accuracy and performance (and let's us
update as per our stability policy too!).

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants