Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash comparison for pyc cache files #11418

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

cdce8p
Copy link
Contributor

@cdce8p cdce8p commented Sep 8, 2023

Compare pyc cache files by source hash.

Especially in CI environments, it's common for mtimes to be reset when restoring a cache. To work around that, fallback to a source hash comparison if both size and mtime didn't match.

Related:

@cdce8p cdce8p force-pushed the cache branch 4 times, most recently from 2387efe to 452c2b9 Compare September 9, 2023 00:55
Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cdce8p, I like this change. I left some comments.

changelog/11418.feature.rst Outdated Show resolved Hide resolved
src/_pytest/assertion/rewrite.py Outdated Show resolved Hide resolved
testing/test_assertrewrite.py Outdated Show resolved Hide resolved
@@ -302,21 +302,25 @@ def _write_pyc_fp(
# as of now, bytecode header expects 32-bit numbers for size and mtime (#4903)
mtime = int(source_stat.st_mtime) & 0xFFFFFFFF
size = source_stat.st_size & 0xFFFFFFFF
# 64-bit source file hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from PEP 552 is that for if we want to use checked-hash pyc files, we need to set the lowest bit of the flags to indicate it's a hash-based, and the 2nd lowest bit to indicate that it's checked, and get rid of the mtime and size bytes.

Copy link
Contributor Author

@cdce8p cdce8p Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I specifically what to use all three (size, mtime, and hash) to get the best performance. (There is really no need to calc the hash if size and mtime match. Similarly if the size is different we can abort early.) That's how other applications to do it as well (e.g. black, mypy). It will thus deviate from CPython but I guess the comment here is relevant

# Technically, we don't have to have the same pyc format as
# (C)Python, since these "pycs" should never be seen by builtin
# import. However, there's little reason to deviate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, but I wonder if we can reduce complexity by just always doing the hashing.

Were you able to get a measurable benefit in home-assistant repo by using this branch?

Are you using the --check-hash-based-pycs Python flag in home-assistant? If so, with which value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to get a measurable benefit in home-assistant repo by using this branch?

Not yet. It's on my todo list. Although it's sadly not as easy as adding caching to the CI job. We use pytest-xdist with 10 different jobs, so I might have to combine the different cache files later based on mtimes. That's something I'll explore soon.

You are correct, but I wonder if we can reduce complexity by just always doing the hashing.

That's certainly an option, however I don't think it's really that complicated. The combined logic is quite simple. Even if calculating the hash is fast, we still need to read the source file first to calculate it. That's not necessary if we are sure that it isn't a match -> different size, or that it is because the mtimes are the same.

Are you using the --check-hash-based-pycs Python flag in home-assistant? If so, with which value?

I'll probably need to use always for Home Assistant, just because the mtimes are reset when the cache is restored., but that's only for the CI environment. I won't use that locally! (The biggest speedup from not always using the hash will also be locally.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the pytest pyc files are actually pyc files. It allows inspecting them using standard tools and such. We can definitely diverge, then we should change the magic and file extension to make it clear, but I'd rather not if possible.

I wish Python just added the hash on top of the mtime/size instead of replacing them, but what can you do. So I think it will be best to take on the little bit of extra complexity and implement the check-hash-based-pycs logic. The value is available in _imp.check_hash_based_pycs (since _imp is private, we should fall back to default if the import fails). The cpython logic is in here.

Assuming that caching CIs that want the hashing behavior would want to set check-hash-based-pycs anyway, while local users don't, that would do the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Especially the part that following the pyc file layout allows inspecting them using standard tools. Let me see what I can do.

Copy link
Contributor Author

@cdce8p cdce8p Sep 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at the source code and reading PEP 552. Unless I'm missing something --check-hash-based-pycs1 doesn't actually affect how pycs files are written. Even with it set to always, it won't overwrite timestamp based pycs files. The only way to generate hashed based ones seems to be using python -m compile or python -m compileall2 with either --invalidation-mode checked-hash or SOURCE_DATE_EPOCH set.

This certainly complicates things a bit for caching the pycs files in CI jobs as I'll likely need to add a precompile job.

However, the relevant question here now is, how do we want to handle it in pytest? Would it make sense to add a --invalidation-mode flag with timestamp | checked-hash values? --check-hash-based-pycs won't be that useful probably as it seems default will always be enough.

Footnotes

  1. https://docs.python.org/3/using/cmdline.html#cmdoption-check-hash-based-pycs

  2. https://docs.python.org/3/library/compileall.html#cmdoption-compileall-invalidation-mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you are correct, I assumed that --check-hash-based-pycs also affects the writing but it doesn't!

The --invalidation-mode flag is starting to get a little complicated... Before we proceed, can you please try your branch with --invalidation-mode checked-hash and check that it makes a measurable impact?

If we decide that it's actually worth it, I hope we can just use the checked-hash mode always. I wrote a script to check the hashing overhead:

import time, pathlib
from importlib.util import source_hash

hashing_time, count, size = 0, 0, 0
for entry in pathlib.Path('/usr/lib/python3.11').glob("**/*.py"):
    source = entry.read_bytes()
    count += 1
    size += len(source)
    start = time.perf_counter()
    source_hash(source)
    hashing_time += time.perf_counter() - start
print(f'{count} files, {size/1024/1024} MiB, {hashing_time} seconds')

On my lousy 2015 laptop, it gives

17699 files, 273.7363386154175 MiB, 0.10418461405697599 seconds

This seems fast enough to me to not worry about. (BTW, the current PR code does the hashing always even in timestamp mode, so the overhead is present anyway, but that is fixable).

Copy link
Contributor Author

@cdce8p cdce8p Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --invalidation-mode flag is starting to get a little complicated... Before we proceed, can you please try your branch with --invalidation-mode checked-hash and check that it makes a measurable impact?

Took some time but I got it working now. That impact is smaller than I thought but still measurable. Including the time it takes to pre-populate the cache with (pytest --invalidation-mode checked-hash --collect-only), it's about a 6% improvement.

Edit: 10% for the second run.

The 6% number is for caching only the pytest pyc files. Haven't gotten to it yet, but I also like to check if caching the regular Python pyc files helps as well. Need to pre-compile them manually though as Python would always write timestamp based pyc files.

Another idea, although specific to the Home Assistant setup, would be to explore how that can be better used with pytest-test-groups. If we already run the --collect-only job, maybe collection can be skipped entirely for the actual pytest runs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10% is a good improvement. Do you happen to have a measurement of the initial (uncached) hash-computing run vs. a timestamp one? This is to measure the hashing overhead, if any.

As I said in my previous comment, I think we can just use hash mode always and it will be fine.

Another idea, although specific to the Home Assistant setup, would be to explore how that can be better used with pytest-test-groups. If we already run the --collect-only job, maybe collection can be skipped entirely for the actual pytest runs?

Caching collection is a thorny issue, I'd avoid attempting it unless you are willing to put in a lot of effort.

@nicoddemus
Copy link
Member

nicoddemus commented Sep 13, 2023

Hi folks,

Late to the discussion, but my gut feeling is that we should avoid introducing a new option, and use the optimal method behind the scenes always. I like the approach of using size +mtime as shortcut and fallback to hashing if they still match.

I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that?

We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently.

@bluetech
Copy link
Member

I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that?

The pyc header doesn't allow both hash and timestamp, so a pyc files needs to be one or the other.

We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently.

I ran a benchmark here, my conclusion is that for the initial run, hashing is unlikely to be noticeable even for large repositories, compared to actually reading and parsing the files. The overhead probably comes in on a cache hit -- with a timestamp you only need to do a stat of the py file, with hashing you need to read it again. If running on huge test files and a slow disk, it might be noticeable.

@cdce8p
Copy link
Contributor Author

cdce8p commented Apr 27, 2024

Hi guys, I'd like to continue with this one. Since the time I left it, the testing structure in Home Assistant was changed which allows for a better performance comparison now. Just a little preview, using this PR with --invalidation-mode=checked-hash yields close to a 50% improvement in our test collection times with Github actions. Of course that's highly project depended, but an improvement is noticeable.

For Home Assistant we now run a --collect-only job before the actual tests. With that we create custom test buckets for the test matrix jobs. Here are some numbers

Before After
Collection 05:38min 03:48min +48%
Tests 45:05min 41:27min +8.7%
Total 50:43min 45:15min +12%

Home Assistant currently has around 35k tests in 3.3k different files.

I like the approach of using size +mtime as shortcut and fallback to hashing if they still match.

Tbh I don't really have a preference here as long as there is a way to tell pytest to also use the hash. If you like to keep the pyc file format, then the current state with --invalidation-mode is probably the way to go. Maybe we could add a fallback to SOURCE_DATE_EPOCH like cpython has as well. Otherwise we would essentially create a new format. Not sure if that's desirable.

I think we can just use hash mode always and it will be fine.

Probably. On any modern machine this shouldn't really be an issue. Although I'm not sure about edge cases. The size+mtime comparison works well (just not in CI environments), so I don't really see a need to change that. If at all, I would probably just go with the custom file format and then use the hash as fallback like originally suggested. This would at least simplify the logic a bit.

--

One followup would be an independent way to control if pyc files should be written. I'd like to use PYTHONDONTWRITEBYTECODE: 1 to be able to only cache the pytest pyc files.

One might think that caching the regular bytecode files would be an improvement as well. However, the benefit was only minor (+2%) while it required manually compiling all files with py_compile to be able to also use the hash comparison for those. I think it's likely that the C loop to create the bytecode files is simply much faster than the pytest code for assertion rewriting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants