New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hash comparison for pyc cache files #11418
base: main
Are you sure you want to change the base?
Conversation
2387efe
to
452c2b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cdce8p, I like this change. I left some comments.
src/_pytest/assertion/rewrite.py
Outdated
@@ -302,21 +302,25 @@ def _write_pyc_fp( | |||
# as of now, bytecode header expects 32-bit numbers for size and mtime (#4903) | |||
mtime = int(source_stat.st_mtime) & 0xFFFFFFFF | |||
size = source_stat.st_size & 0xFFFFFFFF | |||
# 64-bit source file hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from PEP 552 is that for if we want to use checked-hash pyc files, we need to set the lowest bit of the flags
to indicate it's a hash-based, and the 2nd lowest bit to indicate that it's checked, and get rid of the mtime and size bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I specifically what to use all three (size
, mtime
, and hash
) to get the best performance. (There is really no need to calc the hash if size
and mtime
match. Similarly if the size is different we can abort early.) That's how other applications to do it as well (e.g. black, mypy). It will thus deviate from CPython but I guess the comment here is relevant
pytest/src/_pytest/assertion/rewrite.py
Lines 303 to 305 in 71f265f
# Technically, we don't have to have the same pyc format as | |
# (C)Python, since these "pycs" should never be seen by builtin | |
# import. However, there's little reason to deviate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, but I wonder if we can reduce complexity by just always doing the hashing.
Were you able to get a measurable benefit in home-assistant repo by using this branch?
Are you using the --check-hash-based-pycs
Python flag in home-assistant? If so, with which value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you able to get a measurable benefit in home-assistant repo by using this branch?
Not yet. It's on my todo list. Although it's sadly not as easy as adding caching to the CI job. We use pytest-xdist
with 10 different jobs, so I might have to combine the different cache files later based on mtimes. That's something I'll explore soon.
You are correct, but I wonder if we can reduce complexity by just always doing the hashing.
That's certainly an option, however I don't think it's really that complicated. The combined logic is quite simple. Even if calculating the hash is fast, we still need to read the source file first to calculate it. That's not necessary if we are sure that it isn't a match -> different size
, or that it is because the mtimes
are the same.
Are you using the
--check-hash-based-pycs
Python flag in home-assistant? If so, with which value?
I'll probably need to use always
for Home Assistant, just because the mtimes are reset when the cache is restored., but that's only for the CI environment. I won't use that locally! (The biggest speedup from not always using the hash will also be locally.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the pytest pyc files are actually pyc files. It allows inspecting them using standard tools and such. We can definitely diverge, then we should change the magic and file extension to make it clear, but I'd rather not if possible.
I wish Python just added the hash on top of the mtime/size instead of replacing them, but what can you do. So I think it will be best to take on the little bit of extra complexity and implement the check-hash-based-pycs logic. The value is available in _imp.check_hash_based_pycs
(since _imp
is private, we should fall back to default
if the import fails). The cpython logic is in here.
Assuming that caching CIs that want the hashing behavior would want to set check-hash-based-pycs
anyway, while local users don't, that would do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. Especially the part that following the pyc file layout allows inspecting them using standard tools. Let me see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at the source code and reading PEP 552. Unless I'm missing something --check-hash-based-pycs
1 doesn't actually affect how pycs files are written. Even with it set to always
, it won't overwrite timestamp based pycs files. The only way to generate hashed based ones seems to be using python -m compile
or python -m compileall
2 with either --invalidation-mode checked-hash
or SOURCE_DATE_EPOCH
set.
This certainly complicates things a bit for caching the pycs files in CI jobs as I'll likely need to add a precompile job.
However, the relevant question here now is, how do we want to handle it in pytest? Would it make sense to add a --invalidation-mode
flag with timestamp | checked-hash
values? --check-hash-based-pycs
won't be that useful probably as it seems default
will always be enough.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you are correct, I assumed that --check-hash-based-pycs
also affects the writing but it doesn't!
The --invalidation-mode
flag is starting to get a little complicated... Before we proceed, can you please try your branch with --invalidation-mode checked-hash
and check that it makes a measurable impact?
If we decide that it's actually worth it, I hope we can just use the checked-hash mode always. I wrote a script to check the hashing overhead:
import time, pathlib
from importlib.util import source_hash
hashing_time, count, size = 0, 0, 0
for entry in pathlib.Path('/usr/lib/python3.11').glob("**/*.py"):
source = entry.read_bytes()
count += 1
size += len(source)
start = time.perf_counter()
source_hash(source)
hashing_time += time.perf_counter() - start
print(f'{count} files, {size/1024/1024} MiB, {hashing_time} seconds')
On my lousy 2015 laptop, it gives
17699 files, 273.7363386154175 MiB, 0.10418461405697599 seconds
This seems fast enough to me to not worry about. (BTW, the current PR code does the hashing always even in timestamp mode, so the overhead is present anyway, but that is fixable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
--invalidation-mode
flag is starting to get a little complicated... Before we proceed, can you please try your branch with--invalidation-mode checked-hash
and check that it makes a measurable impact?
Took some time but I got it working now. That impact is smaller than I thought but still measurable. Including the time it takes to pre-populate the cache with (pytest --invalidation-mode checked-hash --collect-only
), it's about a 6% improvement.
Edit:
10%
for the second run.
The 6% number is for caching only the pytest pyc files. Haven't gotten to it yet, but I also like to check if caching the regular Python pyc files helps as well. Need to pre-compile them manually though as Python would always write timestamp based pyc files.
Another idea, although specific to the Home Assistant setup, would be to explore how that can be better used with pytest-test-groups
. If we already run the --collect-only
job, maybe collection can be skipped entirely for the actual pytest runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10% is a good improvement. Do you happen to have a measurement of the initial (uncached) hash-computing run vs. a timestamp one? This is to measure the hashing overhead, if any.
As I said in my previous comment, I think we can just use hash mode always and it will be fine.
Another idea, although specific to the Home Assistant setup, would be to explore how that can be better used with pytest-test-groups. If we already run the --collect-only job, maybe collection can be skipped entirely for the actual pytest runs?
Caching collection is a thorny issue, I'd avoid attempting it unless you are willing to put in a lot of effort.
Hi folks, Late to the discussion, but my gut feeling is that we should avoid introducing a new option, and use the optimal method behind the scenes always. I like the approach of using size +mtime as shortcut and fallback to hashing if they still match. I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that? We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently. |
The pyc header doesn't allow both hash and timestamp, so a pyc files needs to be one or the other.
I ran a benchmark here, my conclusion is that for the initial run, hashing is unlikely to be noticeable even for large repositories, compared to actually reading and parsing the files. The overhead probably comes in on a cache hit -- with a timestamp you only need to do a |
Hi guys, I'd like to continue with this one. Since the time I left it, the testing structure in Home Assistant was changed which allows for a better performance comparison now. Just a little preview, using this PR with For Home Assistant we now run a
Home Assistant currently has around 35k tests in 3.3k different files.
Tbh I don't really have a preference here as long as there is a way to tell pytest to also use the hash. If you like to keep the
Probably. On any modern machine this shouldn't really be an issue. Although I'm not sure about edge cases. The size+mtime comparison works well (just not in CI environments), so I don't really see a need to change that. If at all, I would probably just go with the custom file format and then use the hash as fallback like originally suggested. This would at least simplify the logic a bit. -- One followup would be an independent way to control if pyc files should be written. I'd like to use One might think that caching the regular bytecode files would be an improvement as well. However, the benefit was only minor (+2%) while it required manually compiling all files with |
Compare pyc cache files by source hash.
Especially in CI environments, it's common for mtimes to be reset when restoring a cache. To work around that, fallback to a source hash comparison if both size and mtime didn't match.
Related: