Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match lexers and formatters by extension separatly #2328

Merged
merged 1 commit into from Jan 31, 2023

Conversation

dirkmueller
Copy link
Contributor

d0487e3 "Remove filename pattern caches" (#2153) introduced a huge performance regession. While it is true that fnmatch already uses functools.lru_cache, that cache is limited to 127 entries and we have over 1000 matching patterns, which means the cache is evicted entirely on every iteration.

We can be more clever without reverting that patch by avoiding fnmatch calls alltogether, which provides an even better speedup. The bulk (> 99%) of all filename matches are on the filename extension, so in gen_mapfiles.py we split these into a separate tuple list that we match without calling fnmatch().

This restores previous performance speedup without the overhead of an extra regexp cache.

@jeanas
Copy link
Contributor

jeanas commented Jan 31, 2023

Indeed, I checked the cache size in the CPython code, but it was actually changed recently (python/cpython#86965), so this will differ between Python 3.11 and old versions. My bad.

That said, can you please provide the benchmarks you have been using to get to the conclusion that this approach is faster than just reverting the offending commit? I'm not keen on introducing complexity without good justification. In a quick and simplistic test I did on Python 3.10, this was actually a bit slower than the revert. Also, while current Pygments takes ~50ms on Python 3.10 to do all the matches, both this PR and the revert take ~1ms. I'm not sure there are many use cases where better performance than this for filename matching is important.

This introduced a performance regession. While it is true that
fnmatch already uses functools.lru_cache, that cache is limited to 256
on python 3.10 and older and we have over 1000 matching patterns, which
means the cache is evicted entirely on every iteration.

This reverts commit 951c894.
@dirkmueller
Copy link
Contributor Author

I haven't done a full microbenchmark on this other than simple timeit executions:

print(timeit.timeit('fnmatch.fnmatch("foo.py", "*.py")', setup='import fnmatch'))
print(timeit.timeit('"foo.py".endswith(".py")'))
print(timeit.timeit('"py" in ("lala", "loop", "py")'))

which prints here:

0.5052752160008822
0.08419276200220338
0.028339595002762508

so a "if str in tuple of strings" is ~ factor 20 faster than fnmatch(), even with a cached regexp, so beyond the pathological case where the regexp cache is overflowing like in my original case. Also, many of the regular expressions actually have no wildcard at all. > 90% have wildcards that match only on extension, and another small percentage is matching on the full filename (so no regular expression at all). a meaningful regexp is the execption.

Thats what led me into writing the full patch, which I applied in my (much larger) program and my profiled bottleneck went completely away. There's certainly a bit of cleanup that can be done on my patch to make it less convoluted. an API break would make this simpler (if the plugin would specify matches on the extension separately for example).

my original program that let me to profile this essentially runs lexers.guess_lexer_for_filename() several 10 thousands of times.

@dirkmueller
Copy link
Contributor Author

I have pushed a plain revert (with some import cleanups to pass tests) instead. let me know how you want to proceed.

@jeanas
Copy link
Contributor

jeanas commented Jan 31, 2023

For meaningful benchmarking, you have to compare apples with apples. For example, your test does not take into account the overhead of splitting the file name.

I used this:

from pygments.lexers import find_lexer_class_for_filename
from timeit import timeit

# Choosing a file name where no lexer matches, to make sure all fnmatch patterns
# are run on it
fn = "file.unknown"
# confirming
assert find_lexer_class_for_filename(fn) is None

print(timeit("find_lexer_class_for_filename(fn)", number=1000, globals=globals()))

Results:

$ # master
$ python3.11 benchmark.py 
12.148645579000004
$ python3.10 benchmark.py 
48.05816511699959
$ 
$ # simple revert
$ python3.11 benchmark.py 
12.021800829000313
$ python3.10 benchmark.py 
1.452163715000097
$ 
$ # your original approach
$ python3.11 benchmark.py 
12.00767945699954
$ python3.10 benchmark.py 
1.446750384000552

As you can see, the difference is barely noticeable. So I prefer the simple revert.

Python 3.11 is still slower than before, that would be interesting to investigate.

@jeanas jeanas merged commit c467d61 into pygments:master Jan 31, 2023
@dirkmueller
Copy link
Contributor Author

@jean-abou-samra I don't have such a slowdown with python 3.11. with the simple revert patch applied, your benchmark.py
is

python3.11:
0.7443678290001117

python3.10:
0.7769828950004012

which is in line what is expected of the general speedup of python 3.11. so looks good to me. thank you for the quick merge!

@Anteru Anteru added this to the 2.15.0 milestone Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants