`tokenize.generate_tokens()` performance regression in 3.12 #119118

devdanzin · 2024-05-17T11:49:37Z

Bug report

Bug description:

There seems to be a significant performance regression in tokenize.generate_tokens() between 3.11 and 3.12 when tokenizing a (very) large dict on a single line. I searched the existing issues but couldn't find anything about this.

To reproduce, rename the file largedict.py.txt to largedict.py in the same directory as the script below, then run the script. That file comes from nedbat/coveragepy#1785.

import io, time, sys, tokenize
import largedict

text = largedict.d
readline = io.StringIO(text).readline
glob_start = start = time.time()
print(f"{sys.implementation.name} {sys.platform} {sys.version}")

for i, (ttype, ttext, (sline, scol), (_, ecol), _) in enumerate(tokenize.generate_tokens(readline)):
    if i % 500 == 0:
        print(i, ttype, ttext, sline, scol, time.time() - start)
        start = time.time()
    if i % 5000 == 0:
        print(time.time() - glob_start)

print(f"Time taken: {time.time() - glob_start}")

For Python 3.12, this results in:

cpython linux 3.12.3 (main, May 17 2024, 07:19:22) [GCC 11.4.0]
0 1 a_large_dict_literal 1 0 0.04641866683959961
0.046633005142211914
500 3 ':tombol_a_(golongan_darah):' 1 2675 9.689745903015137
1000 3 ':flagge_anguilla:' 1 5261 9.767053604125977
1500 3 ':флаг_Армения:' 1 7879 9.258271932601929
[...]

For Python 3.11, this results in:

cpython linux 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
0 1 a_large_dict_literal 1 0 0.013637304306030273
0.013663768768310547
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.002939462661743164
1000 3 ':flagge_anguilla:' 1 5261 0.0028715133666992188
1500 3 ':флаг_Армения:' 1 7879 0.002806425094604492
[...]
352500 3 'pt' 1 2589077 0.003370046615600586
Time taken: 2.1244866847991943

That is, each 500 tokens in Python 3.12 is taking over 9 seconds to process, while the 352500 tokens in Python 3.11 is taking a bit over 2 seconds to process.

I can reproduce this on Linux (WSL) and Windows. Also seems to affect 3.13.

CPython versions tested on:

3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, Windows

Linked PRs

The text was updated successfully, but these errors were encountered:

erlend-aasland · 2024-05-17T13:56:28Z

Suspecting it could be related to the work done on PEP-701, which landed in 3.12.

pablogsal · 2024-05-17T14:05:46Z

Yeah we are now using a completely different backend (the actual parser) so we need to investigate a bit what's going on here.

devdanzin · 2024-05-17T16:51:05Z

A data point that might help: when calling list(tokenize.generate_tokens(readline)) (or tuple(), memory blows out very fast with a large single line dict, while the same dict broken into lines doesn't cause memory issues.

pablogsal · 2024-05-17T16:54:29Z

Oh wow that's very intriguing. I may be able to investigate this next week after PyCon but thanks for the hint!

pablogsal · 2024-05-17T18:42:34Z

Ok, I think I have a guess (i need time to test my theory) but I am 99% sure I know the problem. The problem is this line:

cpython/Python/Python-tokenize.c

Line 213 in 31a28cb

line = PyUnicode_DecodeUTF8(line_start, size, "replace");

We are creating unique strings per token with the entire line. If the dict spans the entire line this is a very expensive operation that we are doing over and over and over with the same result. The fix is to keep reusing the same result until there is a new line.

erlend-aasland · 2024-05-17T19:03:04Z

~~Confirming that, Pablo; if I cache the line¹ on the tokenizer iter object instance, I no longer see the performance penalty.~~ Forget that, it's not that simple.

the PyObject *line object ↩

lysnikolaou · 2024-05-17T20:32:15Z

Well, yeah, it's not that simple. Doing some scrappy caching seems to have brought down the performance penalty, but it's much worse than 3.11.

cpython on  main +16 -1 [!] via C v15.0.0-clang via 🐍 v3.11.3 
❯ ./python.exe tmp/t.py
cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 15:17:08) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.07759904861450195
0.07767128944396973
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.08086895942687988
1000 3 ':flagge_anguilla:' 1 5261 0.10818815231323242
1500 3 ':флаг_Армения:' 1 7879 0.1338660717010498
2000 3 ':grupo_sanguíneo_b:' 1 10470 0.1593310832977295
2500 3 ':bayrak_belarus:' 1 13074 0.18597793579101562
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 15514 0.2100069522857666
3500 3 ':Bulgaria:' 1 18735 0.23893284797668457
4000 3 ':bandiera_camerun:' 1 21240 0.2700800895690918
4500 2 2 1 23919 0.2944760322570801
5000 3 ':bendera_pulau_natal:' 1 26767 0.32537102699279785
2.0853629112243652
5500 3 'es' 1 29643 0.3519558906555176
6000 3 'ar' 1 32240 0.38205909729003906
6500 3 'ko' 1 34917 0.411639928817749
7000 2 2 1 37580 0.45705175399780273
7500 3 'fa' 1 40214 0.4636220932006836
8000 55 , 1 43013 0.49083518981933594
8500 3 ':加纳:' 1 45615 0.5203256607055664
9000 3 ':bandera_guatemala:' 1 48073 0.5623579025268555
...

So we get about 0.1-0.5 seconds per 500 tokens, instead of the original ~9 seconds, but it's still very far away.

lysnikolaou · 2024-05-17T20:39:44Z

Oh found it, it's not just that line, but the calls to _PyPegen_byte_offset_to_character_offset that are made a bit further below that eventually call PyUnicode_* APIs as well. After additionally caching those, I get:

cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 16:37:58) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.08269691467285156
0.08281183242797852
500 3 ':tombol_a_(golongan_darah):' 1 0 0.004801034927368164
1000 3 ':flagge_anguilla:' 1 0 0.003465890884399414
1500 3 ':флаг_Армения:' 1 0 0.004238128662109375
2000 3 ':grupo_sanguíneo_b:' 1 0 0.0038290023803710938
2500 3 ':bayrak_belarus:' 1 0 0.005483865737915039
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 0 0.005335807800292969
3500 3 ':Bulgaria:' 1 0 0.0034456253051757812
4000 3 ':bandiera_camerun:' 1 0 0.0051038265228271484
4500 2 2 1 0 0.003175020217895508
5000 3 ':bendera_pulau_natal:' 1 0 0.0032072067260742188
0.1256239414215088
5500 3 'es' 1 0 0.003175973892211914
[...]
351500 3 'es' 1 0 0.0032269954681396484
352000 3 'ko' 1 0 0.003362894058227539
352500 3 'pt' 1 0 0.003036022186279297
Time taken: 2.6852049827575684

I'll push a PR shortly.

nedbat · 2024-05-25T19:37:40Z

@lysnikolaou any news on this? We have other people reporting a problem like this in 3.12.

lysnikolaou · 2024-05-25T19:44:06Z

Thanks for the reminder @nedbat! Turns out that this was even trickier than I though and the fix I showed in #119118 (comment) was buggy. My patch is almost there, but I didn't get the time to fix some final details. I'll have a PR up by early next week.

pablogsal · 2024-05-25T19:56:50Z

Tell me if you want me to help or pick it up if you done have time @lysnikolaou

- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.

* gh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…19615) (#119682) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…19615) (#119683) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

lysnikolaou · 2024-05-28T20:49:50Z

Fixed in #119615.

devdanzin added the type-bug An unexpected behavior, bug, or error label May 17, 2024

AlexWaygood added the performance Performance or resource usage label May 17, 2024

AlexWaygood assigned pablogsal and lysnikolaou May 17, 2024

nedbat mentioned this issue May 17, 2024

MemoryError in coverage report on Python 3.12 nedbat/coveragepy#1785

Closed

devdanzin mentioned this issue May 27, 2024

Memory leak when generating reports nedbat/coveragepy#1791

Closed

bedevere-app bot mentioned this issue May 27, 2024

gh-119118: Fix performance regression in tokenize module #119615

Merged

devdanzin mentioned this issue May 28, 2024

tokenize in 3.12 makes copies of each line, 3.11 does not #119654

Closed

miss-islington mentioned this issue May 28, 2024

[3.13] gh-119118: Fix performance regression in tokenize module (GH-119615) #119682

Merged

miss-islington mentioned this issue May 28, 2024

[3.12] gh-119118: Fix performance regression in tokenize module (GH-119615) #119683

Merged

lysnikolaou closed this as completed May 28, 2024

JDBetteridge mentioned this issue May 30, 2024

JDBetteridge/python312 firedrakeproject/firedrake#3546

Open

bersbersbers mentioned this issue Jun 7, 2024

Python 3.12 runs much slower than Python 3.11 microsoft/debugpy#1496

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tokenize.generate_tokens()` performance regression in 3.12 #119118

`tokenize.generate_tokens()` performance regression in 3.12 #119118

devdanzin commented May 17, 2024 •

edited by bedevere-app bot

erlend-aasland commented May 17, 2024

pablogsal commented May 17, 2024

devdanzin commented May 17, 2024

pablogsal commented May 17, 2024

pablogsal commented May 17, 2024 •

edited

erlend-aasland commented May 17, 2024 •

edited

lysnikolaou commented May 17, 2024 •

edited

lysnikolaou commented May 17, 2024 •

edited

nedbat commented May 25, 2024

lysnikolaou commented May 25, 2024

pablogsal commented May 25, 2024

lysnikolaou commented May 28, 2024

tokenize.generate_tokens() performance regression in 3.12 #119118

tokenize.generate_tokens() performance regression in 3.12 #119118

Comments

devdanzin commented May 17, 2024 • edited by bedevere-app bot

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

erlend-aasland commented May 17, 2024

pablogsal commented May 17, 2024

devdanzin commented May 17, 2024

pablogsal commented May 17, 2024

pablogsal commented May 17, 2024 • edited

erlend-aasland commented May 17, 2024 • edited

Footnotes

lysnikolaou commented May 17, 2024 • edited

lysnikolaou commented May 17, 2024 • edited

nedbat commented May 25, 2024

lysnikolaou commented May 25, 2024

pablogsal commented May 25, 2024

lysnikolaou commented May 28, 2024

`tokenize.generate_tokens()` performance regression in 3.12 #119118

`tokenize.generate_tokens()` performance regression in 3.12 #119118

devdanzin commented May 17, 2024 •

edited by bedevere-app bot

pablogsal commented May 17, 2024 •

edited

erlend-aasland commented May 17, 2024 •

edited

lysnikolaou commented May 17, 2024 •

edited

lysnikolaou commented May 17, 2024 •

edited