Match lexers and formatters by extension separatly #2328

dirkmueller · 2023-01-30T22:30:58Z

d0487e3 "Remove filename pattern caches" (#2153) introduced a huge performance regession. While it is true that fnmatch already uses functools.lru_cache, that cache is limited to 127 entries and we have over 1000 matching patterns, which means the cache is evicted entirely on every iteration.

We can be more clever without reverting that patch by avoiding fnmatch calls alltogether, which provides an even better speedup. The bulk (> 99%) of all filename matches are on the filename extension, so in gen_mapfiles.py we split these into a separate tuple list that we match without calling fnmatch().

This restores previous performance speedup without the overhead of an extra regexp cache.

jeanas · 2023-01-31T05:05:11Z

Indeed, I checked the cache size in the CPython code, but it was actually changed recently (python/cpython#86965), so this will differ between Python 3.11 and old versions. My bad.

That said, can you please provide the benchmarks you have been using to get to the conclusion that this approach is faster than just reverting the offending commit? I'm not keen on introducing complexity without good justification. In a quick and simplistic test I did on Python 3.10, this was actually a bit slower than the revert. Also, while current Pygments takes ~50ms on Python 3.10 to do all the matches, both this PR and the revert take ~1ms. I'm not sure there are many use cases where better performance than this for filename matching is important.

This introduced a performance regession. While it is true that fnmatch already uses functools.lru_cache, that cache is limited to 256 on python 3.10 and older and we have over 1000 matching patterns, which means the cache is evicted entirely on every iteration. This reverts commit 951c894.

dirkmueller · 2023-01-31T14:46:27Z

I haven't done a full microbenchmark on this other than simple timeit executions:

print(timeit.timeit('fnmatch.fnmatch("foo.py", "*.py")', setup='import fnmatch'))
print(timeit.timeit('"foo.py".endswith(".py")'))
print(timeit.timeit('"py" in ("lala", "loop", "py")'))

which prints here:

0.5052752160008822
0.08419276200220338
0.028339595002762508

so a "if str in tuple of strings" is ~ factor 20 faster than fnmatch(), even with a cached regexp, so beyond the pathological case where the regexp cache is overflowing like in my original case. Also, many of the regular expressions actually have no wildcard at all. > 90% have wildcards that match only on extension, and another small percentage is matching on the full filename (so no regular expression at all). a meaningful regexp is the execption.

Thats what led me into writing the full patch, which I applied in my (much larger) program and my profiled bottleneck went completely away. There's certainly a bit of cleanup that can be done on my patch to make it less convoluted. an API break would make this simpler (if the plugin would specify matches on the extension separately for example).

my original program that let me to profile this essentially runs lexers.guess_lexer_for_filename() several 10 thousands of times.

dirkmueller · 2023-01-31T14:47:09Z

I have pushed a plain revert (with some import cleanups to pass tests) instead. let me know how you want to proceed.

jeanas · 2023-01-31T15:11:19Z

For meaningful benchmarking, you have to compare apples with apples. For example, your test does not take into account the overhead of splitting the file name.

I used this:

from pygments.lexers import find_lexer_class_for_filename
from timeit import timeit

# Choosing a file name where no lexer matches, to make sure all fnmatch patterns
# are run on it
fn = "file.unknown"
# confirming
assert find_lexer_class_for_filename(fn) is None

print(timeit("find_lexer_class_for_filename(fn)", number=1000, globals=globals()))

Results:

$ # master
$ python3.11 benchmark.py 
12.148645579000004
$ python3.10 benchmark.py 
48.05816511699959
$ 
$ # simple revert
$ python3.11 benchmark.py 
12.021800829000313
$ python3.10 benchmark.py 
1.452163715000097
$ 
$ # your original approach
$ python3.11 benchmark.py 
12.00767945699954
$ python3.10 benchmark.py 
1.446750384000552

As you can see, the difference is barely noticeable. So I prefer the simple revert.

Python 3.11 is still slower than before, that would be interesting to investigate.

dirkmueller · 2023-01-31T17:29:58Z

@jean-abou-samra I don't have such a slowdown with python 3.11. with the simple revert patch applied, your benchmark.py
is

python3.11:
0.7443678290001117

python3.10:
0.7769828950004012

which is in line what is expected of the general speedup of python 3.11. so looks good to me. thank you for the quick merge!

dirkmueller force-pushed the pygments_matcher_speedup branch from 8872d85 to ea91f3e Compare January 31, 2023 11:50

dirkmueller force-pushed the pygments_matcher_speedup branch from ea91f3e to 0a9a56b Compare January 31, 2023 14:34

jeanas merged commit c467d61 into pygments:master Jan 31, 2023

Anteru added this to the 2.15.0 milestone Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match lexers and formatters by extension separatly #2328

Match lexers and formatters by extension separatly #2328

dirkmueller commented Jan 30, 2023

jeanas commented Jan 31, 2023

dirkmueller commented Jan 31, 2023

dirkmueller commented Jan 31, 2023

jeanas commented Jan 31, 2023

dirkmueller commented Jan 31, 2023

Match lexers and formatters by extension separatly #2328

Match lexers and formatters by extension separatly #2328

Conversation

dirkmueller commented Jan 30, 2023

jeanas commented Jan 31, 2023

dirkmueller commented Jan 31, 2023

dirkmueller commented Jan 31, 2023

jeanas commented Jan 31, 2023

dirkmueller commented Jan 31, 2023