New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match lexers and formatters by extension separatly #2328
Conversation
Indeed, I checked the cache size in the CPython code, but it was actually changed recently (python/cpython#86965), so this will differ between Python 3.11 and old versions. My bad. That said, can you please provide the benchmarks you have been using to get to the conclusion that this approach is faster than just reverting the offending commit? I'm not keen on introducing complexity without good justification. In a quick and simplistic test I did on Python 3.10, this was actually a bit slower than the revert. Also, while current Pygments takes ~50ms on Python 3.10 to do all the matches, both this PR and the revert take ~1ms. I'm not sure there are many use cases where better performance than this for filename matching is important. |
8872d85
to
ea91f3e
Compare
This introduced a performance regession. While it is true that fnmatch already uses functools.lru_cache, that cache is limited to 256 on python 3.10 and older and we have over 1000 matching patterns, which means the cache is evicted entirely on every iteration. This reverts commit 951c894.
ea91f3e
to
0a9a56b
Compare
I haven't done a full microbenchmark on this other than simple timeit executions: print(timeit.timeit('fnmatch.fnmatch("foo.py", "*.py")', setup='import fnmatch'))
print(timeit.timeit('"foo.py".endswith(".py")'))
print(timeit.timeit('"py" in ("lala", "loop", "py")')) which prints here:
so a "if str in tuple of strings" is ~ factor 20 faster than fnmatch(), even with a cached regexp, so beyond the pathological case where the regexp cache is overflowing like in my original case. Also, many of the regular expressions actually have no wildcard at all. > 90% have wildcards that match only on extension, and another small percentage is matching on the full filename (so no regular expression at all). a meaningful regexp is the execption. Thats what led me into writing the full patch, which I applied in my (much larger) program and my profiled bottleneck went completely away. There's certainly a bit of cleanup that can be done on my patch to make it less convoluted. an API break would make this simpler (if the plugin would specify matches on the extension separately for example). my original program that let me to profile this essentially runs lexers.guess_lexer_for_filename() several 10 thousands of times. |
I have pushed a plain revert (with some import cleanups to pass tests) instead. let me know how you want to proceed. |
For meaningful benchmarking, you have to compare apples with apples. For example, your test does not take into account the overhead of splitting the file name. I used this: from pygments.lexers import find_lexer_class_for_filename
from timeit import timeit
# Choosing a file name where no lexer matches, to make sure all fnmatch patterns
# are run on it
fn = "file.unknown"
# confirming
assert find_lexer_class_for_filename(fn) is None
print(timeit("find_lexer_class_for_filename(fn)", number=1000, globals=globals())) Results: $ # master
$ python3.11 benchmark.py
12.148645579000004
$ python3.10 benchmark.py
48.05816511699959
$
$ # simple revert
$ python3.11 benchmark.py
12.021800829000313
$ python3.10 benchmark.py
1.452163715000097
$
$ # your original approach
$ python3.11 benchmark.py
12.00767945699954
$ python3.10 benchmark.py
1.446750384000552 As you can see, the difference is barely noticeable. So I prefer the simple revert. Python 3.11 is still slower than before, that would be interesting to investigate. |
@jean-abou-samra I don't have such a slowdown with python 3.11. with the simple revert patch applied, your benchmark.py
which is in line what is expected of the general speedup of python 3.11. so looks good to me. thank you for the quick merge! |
d0487e3 "Remove filename pattern caches" (#2153) introduced a huge performance regession. While it is true that fnmatch already uses functools.lru_cache, that cache is limited to 127 entries and we have over 1000 matching patterns, which means the cache is evicted entirely on every iteration.
We can be more clever without reverting that patch by avoiding fnmatch calls alltogether, which provides an even better speedup. The bulk (> 99%) of all filename matches are on the filename extension, so in gen_mapfiles.py we split these into a separate tuple list that we match without calling fnmatch().
This restores previous performance speedup without the overhead of an extra regexp cache.