Incorrect character counting for non-ASCII characters #11396

kotet · 2024-05-13T02:47:29Z

I am a Japanese speaker and am working on a project where Japanese is used as part of the code.
I have gotten different output from black and ruff for the line length.

black:

s = "テスト用コードです。改行を除いて87文字です" + " (This is a test code. 87 characters w/o line breaks)"

t = "this is a test code. 87 characters" + " w/o line breaks and non-ascii characters"

ruff:

$ ruff format test.py

s = (
    "テスト用コードです。改行を除いて87文字です"
    + " (This is a test code. 87 characters w/o line breaks)"
)

t = "this is a test code. 87 characters" + " w/o line breaks and non-ascii characters"

ruff version: ruff 0.4.4

The text was updated successfully, but these errors were encountered:

zanieb · 2024-05-13T02:50:12Z

Hi! I believe this is a known deviation from Black.

Ruff uses the Unicode width of a line to determine if a line fits. Black uses Unicode width for strings, and character width for all other tokens. Ruff also uses Unicode width for identifiers and comments.

Related #3714 and psf/black#3445

kotet · 2024-05-13T03:00:47Z

If all tokens are counted in unicode width, line splitting should not occur since s = "テスト用コードです。改行を除いて87文字です" + " (This is a test code. 87 characters w/o line breaks)" is 8786 characters long.

kotet · 2024-05-13T03:22:55Z

These two lines have the same number of characters (88 + newline):

u = "this is test"  # comment 🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍

u = "this is test" + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"

However, the formatting results are different:

u = "this is test"  # comment 🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍

u = (
    "this is test"
    + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
)

kotet · 2024-05-13T04:11:03Z

ruff checkresult:

$ ruff check test.py 
test.py:1:60: E501 Line too long (146 > 88)
test.py:3:56: E501 Line too long (153 > 88)
Found 2 errors.

It seems that the character count will not work correctly in some situations.

153957 · 2024-05-13T05:27:20Z

I believe ruff does not format (wrap) long trailing comments if associated with code on the same line. Probably because in this case it does not know if the comment is about the variable name or the value. Using ASCII characters in your example also does not cause ruff formatting to rewrap:

u = 'this is test'  # comment comment commentcommentcomment commentcommentcomment commentcommentcommentcommentcommentcommentcommentcommentcommentcommentcommentcommentcomment

kotet · 2024-05-13T06:52:52Z

Sorry, my example are misguided! I created a more appropriate reproduction code.
All four lines below have the same number of characters (48).

"🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍! + !🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
"🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍" + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa! + !aaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" + "aaaaaaaaaaa"

pyproject.toml:

[tool.ruff]
[tool.ruff.lint]
select = [
    "E",
    "F",
    "W",
]

lint result:

$ ruff check test.py 
test.py:1:48: E501 Line too long (89 > 88)
test.py:2:48: E501 Line too long (89 > 88)
Found 2 errors.

format result:

"🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍! + !🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"

(
    "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
    + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
)
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa! + !aaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" + "aaaaaaaaaaa"

charliermarsh · 2024-05-13T17:04:34Z

To be clear, we do not use character count. We use character width.

kotet · 2024-05-14T02:42:08Z

Oh, I see! I misunderstood "character width" to mean the number of bytes. I understand this works as intended.

kotet closed this as completed May 14, 2024

dhruvmanila added the question Asking for support or clarification label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect character counting for non-ASCII characters #11396

Incorrect character counting for non-ASCII characters #11396

kotet commented May 13, 2024

zanieb commented May 13, 2024 •

edited

kotet commented May 13, 2024

kotet commented May 13, 2024

kotet commented May 13, 2024

153957 commented May 13, 2024

kotet commented May 13, 2024

charliermarsh commented May 13, 2024

kotet commented May 14, 2024

Incorrect character counting for non-ASCII characters #11396

Incorrect character counting for non-ASCII characters #11396

Comments

kotet commented May 13, 2024

zanieb commented May 13, 2024 • edited

kotet commented May 13, 2024

kotet commented May 13, 2024

kotet commented May 13, 2024

153957 commented May 13, 2024

kotet commented May 13, 2024

charliermarsh commented May 13, 2024

kotet commented May 14, 2024

zanieb commented May 13, 2024 •

edited