Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect character counting for non-ASCII characters #11396

Closed
kotet opened this issue May 13, 2024 · 8 comments
Closed

Incorrect character counting for non-ASCII characters #11396

kotet opened this issue May 13, 2024 · 8 comments
Labels
question Asking for support or clarification

Comments

@kotet
Copy link

kotet commented May 13, 2024

I am a Japanese speaker and am working on a project where Japanese is used as part of the code.
I have gotten different output from black and ruff for the line length.

black:

s = "テスト用コードです。改行を除いて87文字です" + " (This is a test code. 87 characters w/o line breaks)"

t = "this is a test code. 87 characters" + " w/o line breaks and non-ascii characters"

ruff:

$ ruff format test.py
s = (
    "テスト用コードです。改行を除いて87文字です"
    + " (This is a test code. 87 characters w/o line breaks)"
)

t = "this is a test code. 87 characters" + " w/o line breaks and non-ascii characters"

ruff version: ruff 0.4.4

@zanieb
Copy link
Member

zanieb commented May 13, 2024

Hi! I believe this is a known deviation from Black.

Ruff uses the Unicode width of a line to determine if a line fits. Black uses Unicode width for strings, and character width for all other tokens. Ruff also uses Unicode width for identifiers and comments.

Related #3714 and psf/black#3445

@kotet
Copy link
Author

kotet commented May 13, 2024

If all tokens are counted in unicode width, line splitting should not occur since s = "テスト用コードです。改行を除いて87文字です" + " (This is a test code. 87 characters w/o line breaks)" is 8786 characters long.

@kotet
Copy link
Author

kotet commented May 13, 2024

These two lines have the same number of characters (88 + newline):

u = "this is test"  # comment 🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍

u = "this is test" + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"

However, the formatting results are different:

u = "this is test"  # comment 🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍

u = (
    "this is test"
    + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
)

@kotet
Copy link
Author

kotet commented May 13, 2024

ruff checkresult:

$ ruff check test.py 
test.py:1:60: E501 Line too long (146 > 88)
test.py:3:56: E501 Line too long (153 > 88)
Found 2 errors.

It seems that the character count will not work correctly in some situations.

@153957
Copy link
Contributor

153957 commented May 13, 2024

I believe ruff does not format (wrap) long trailing comments if associated with code on the same line. Probably because in this case it does not know if the comment is about the variable name or the value. Using ASCII characters in your example also does not cause ruff formatting to rewrap:

u = 'this is test'  # comment comment commentcommentcomment commentcommentcomment commentcommentcommentcommentcommentcommentcommentcommentcommentcommentcommentcommentcomment

@kotet
Copy link
Author

kotet commented May 13, 2024

Sorry, my example are misguided! I created a more appropriate reproduction code.
All four lines below have the same number of characters (48).

"🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍! + !🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
"🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍" + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa! + !aaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" + "aaaaaaaaaaa"

pyproject.toml:

[tool.ruff]
[tool.ruff.lint]
select = [
    "E",
    "F",
    "W",
]

lint result:

$ ruff check test.py 
test.py:1:48: E501 Line too long (89 > 88)
test.py:2:48: E501 Line too long (89 > 88)
Found 2 errors.

format result:

"🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍! + !🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"

(
    "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
    + "🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍🐍"
)
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa! + !aaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" + "aaaaaaaaaaa"

@charliermarsh
Copy link
Member

To be clear, we do not use character count. We use character width.

@kotet
Copy link
Author

kotet commented May 14, 2024

Oh, I see! I misunderstood "character width" to mean the number of bytes. I understand this works as intended.

@kotet kotet closed this as completed May 14, 2024
@dhruvmanila dhruvmanila added the question Asking for support or clarification label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Asking for support or clarification
Projects
None yet
Development

No branches or pull requests

5 participants