Add hash comparison for pyc cache files #11418

cdce8p · 2023-09-08T23:23:44Z

Compare pyc cache files by source hash.

Especially in CI environments, it's common for mtimes to be reset when restoring a cache. To work around that, fallback to a source hash comparison if both size and mtime didn't match.

bluetech

Thanks @cdce8p, I like this change. I left some comments.

changelog/11418.feature.rst

src/_pytest/assertion/rewrite.py

testing/test_assertrewrite.py

bluetech · 2023-09-09T17:13:58Z

src/_pytest/assertion/rewrite.py

@@ -302,21 +302,25 @@ def _write_pyc_fp(
    # as of now, bytecode header expects 32-bit numbers for size and mtime (#4903)
    mtime = int(source_stat.st_mtime) & 0xFFFFFFFF
    size = source_stat.st_size & 0xFFFFFFFF
+    # 64-bit source file hash


My understanding from PEP 552 is that for if we want to use checked-hash pyc files, we need to set the lowest bit of the flags to indicate it's a hash-based, and the 2nd lowest bit to indicate that it's checked, and get rid of the mtime and size bytes.

Hmm, I specifically what to use all three (size, mtime, and hash) to get the best performance. (There is really no need to calc the hash if size and mtime match. Similarly if the size is different we can abort early.) That's how other applications to do it as well (e.g. black, mypy). It will thus deviate from CPython but I guess the comment here is relevant

pytest/src/_pytest/assertion/rewrite.py

Lines 303 to 305 in 71f265f

# Technically, we don't have to have the same pyc format as

# (C)Python, since these "pycs" should never be seen by builtin

# import. However, there's little reason to deviate.

You are correct, but I wonder if we can reduce complexity by just always doing the hashing.

Were you able to get a measurable benefit in home-assistant repo by using this branch?

Are you using the --check-hash-based-pycs Python flag in home-assistant? If so, with which value?

Were you able to get a measurable benefit in home-assistant repo by using this branch?

Not yet. It's on my todo list. Although it's sadly not as easy as adding caching to the CI job. We use pytest-xdist with 10 different jobs, so I might have to combine the different cache files later based on mtimes. That's something I'll explore soon.

You are correct, but I wonder if we can reduce complexity by just always doing the hashing.

That's certainly an option, however I don't think it's really that complicated. The combined logic is quite simple. Even if calculating the hash is fast, we still need to read the source file first to calculate it. That's not necessary if we are sure that it isn't a match -> different size, or that it is because the mtimes are the same.

Are you using the --check-hash-based-pycs Python flag in home-assistant? If so, with which value?

I'll probably need to use always for Home Assistant, just because the mtimes are reset when the cache is restored., but that's only for the CI environment. I won't use that locally! (The biggest speedup from not always using the hash will also be locally.)

I like that the pytest pyc files are actually pyc files. It allows inspecting them using standard tools and such. We can definitely diverge, then we should change the magic and file extension to make it clear, but I'd rather not if possible.

I wish Python just added the hash on top of the mtime/size instead of replacing them, but what can you do. So I think it will be best to take on the little bit of extra complexity and implement the check-hash-based-pycs logic. The value is available in _imp.check_hash_based_pycs (since _imp is private, we should fall back to default if the import fails). The cpython logic is in here.

Assuming that caching CIs that want the hashing behavior would want to set check-hash-based-pycs anyway, while local users don't, that would do the right thing.

Sounds reasonable. Especially the part that following the pyc file layout allows inspecting them using standard tools. Let me see what I can do.

I've been looking at the source code and reading PEP 552. Unless I'm missing something --check-hash-based-pycs¹ doesn't actually affect how pycs files are written. Even with it set to always, it won't overwrite timestamp based pycs files. The only way to generate hashed based ones seems to be using python -m compile or python -m compileall² with either --invalidation-mode checked-hash or SOURCE_DATE_EPOCH set.

This certainly complicates things a bit for caching the pycs files in CI jobs as I'll likely need to add a precompile job.

However, the relevant question here now is, how do we want to handle it in pytest? Would it make sense to add a --invalidation-mode flag with timestamp | checked-hash values? --check-hash-based-pycs won't be that useful probably as it seems default will always be enough.

Footnotes

https://docs.python.org/3/using/cmdline.html#cmdoption-check-hash-based-pycs ↩

https://docs.python.org/3/library/compileall.html#cmdoption-compileall-invalidation-mode ↩

Hmm you are correct, I assumed that --check-hash-based-pycs also affects the writing but it doesn't!

The --invalidation-mode flag is starting to get a little complicated... Before we proceed, can you please try your branch with --invalidation-mode checked-hash and check that it makes a measurable impact?

If we decide that it's actually worth it, I hope we can just use the checked-hash mode always. I wrote a script to check the hashing overhead:

import time, pathlib from importlib.util import source_hash hashing_time, count, size = 0, 0, 0 for entry in pathlib.Path('/usr/lib/python3.11').glob("**/*.py"): source = entry.read_bytes() count += 1 size += len(source) start = time.perf_counter() source_hash(source) hashing_time += time.perf_counter() - start print(f'{count} files, {size/1024/1024} MiB, {hashing_time} seconds')

On my lousy 2015 laptop, it gives

17699 files, 273.7363386154175 MiB, 0.10418461405697599 seconds

This seems fast enough to me to not worry about. (BTW, the current PR code does the hashing always even in timestamp mode, so the overhead is present anyway, but that is fixable).

The --invalidation-mode flag is starting to get a little complicated... Before we proceed, can you please try your branch with --invalidation-mode checked-hash and check that it makes a measurable impact?

Took some time but I got it working now. That impact is smaller than I thought but still measurable. Including the time it takes to pre-populate the cache with (pytest --invalidation-mode checked-hash --collect-only), it's about a 6% improvement.

Edit: 10% for the second run.

The 6% number is for caching only the pytest pyc files. Haven't gotten to it yet, but I also like to check if caching the regular Python pyc files helps as well. Need to pre-compile them manually though as Python would always write timestamp based pyc files.

Another idea, although specific to the Home Assistant setup, would be to explore how that can be better used with pytest-test-groups. If we already run the --collect-only job, maybe collection can be skipped entirely for the actual pytest runs?

10% is a good improvement. Do you happen to have a measurement of the initial (uncached) hash-computing run vs. a timestamp one? This is to measure the hashing overhead, if any.

As I said in my previous comment, I think we can just use hash mode always and it will be fine.

Another idea, although specific to the Home Assistant setup, would be to explore how that can be better used with pytest-test-groups. If we already run the --collect-only job, maybe collection can be skipped entirely for the actual pytest runs?

Caching collection is a thorny issue, I'd avoid attempting it unless you are willing to put in a lot of effort.

nicoddemus · 2023-09-13T13:25:00Z

Hi folks,

Late to the discussion, but my gut feeling is that we should avoid introducing a new option, and use the optimal method behind the scenes always. I like the approach of using size +mtime as shortcut and fallback to hashing if they still match.

I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that?

We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently.

bluetech · 2023-09-13T19:50:20Z

I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that?

The pyc header doesn't allow both hash and timestamp, so a pyc files needs to be one or the other.

We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently.

I ran a benchmark here, my conclusion is that for the initial run, hashing is unlikely to be noticeable even for large repositories, compared to actually reading and parsing the files. The overhead probably comes in on a cache hit -- with a timestamp you only need to do a stat of the py file, with hashing you need to read it again. If running on huge test files and a slow disk, it might be noticeable.

cdce8p · 2024-04-27T22:08:53Z

Hi guys, I'd like to continue with this one. Since the time I left it, the testing structure in Home Assistant was changed which allows for a better performance comparison now. Just a little preview, using this PR with --invalidation-mode=checked-hash yields close to a 50% improvement in our test collection times with Github actions. Of course that's highly project depended, but an improvement is noticeable.

For Home Assistant we now run a --collect-only job before the actual tests. With that we create custom test buckets for the test matrix jobs. Here are some numbers

	Before	After
Collection	05:38min	03:48min	+48%
Tests	45:05min	41:27min	+8.7%
Total	50:43min	45:15min	+12%

Home Assistant currently has around 35k tests in 3.3k different files.

I like the approach of using size +mtime as shortcut and fallback to hashing if they still match.

Tbh I don't really have a preference here as long as there is a way to tell pytest to also use the hash. If you like to keep the pyc file format, then the current state with --invalidation-mode is probably the way to go. Maybe we could add a fallback to SOURCE_DATE_EPOCH like cpython has as well. Otherwise we would essentially create a new format. Not sure if that's desirable.

I think we can just use hash mode always and it will be fine.

Probably. On any modern machine this shouldn't really be an issue. Although I'm not sure about edge cases. The size+mtime comparison works well (just not in CI environments), so I don't really see a need to change that. If at all, I would probably just go with the custom file format and then use the hash as fallback like originally suggested. This would at least simplify the logic a bit.

--

One followup would be an independent way to control if pyc files should be written. I'd like to use PYTHONDONTWRITEBYTECODE: 1 to be able to only cache the pytest pyc files.

One might think that caching the regular bytecode files would be an improvement as well. However, the benefit was only minor (+2%) while it required manually compiling all files with py_compile to be able to also use the hash comparison for those. I think it's likely that the C loop to create the bytecode files is simply much faster than the pytest code for assertion rewriting.

cdce8p force-pushed the cache branch 4 times, most recently from 2387efe to 452c2b9 Compare September 9, 2023 00:55

bluetech requested changes Sep 9, 2023

View reviewed changes

cdce8p added 2 commits April 25, 2024 16:44

Add hash comparison for pyc cache files

ac98ff5

Add invalidation-mode option

2f12594

cdce8p force-pushed the cache branch from 109346a to 2f12594 Compare April 27, 2024 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hash comparison for pyc cache files #11418

Add hash comparison for pyc cache files #11418

cdce8p commented Sep 8, 2023 •

edited

bluetech left a comment

bluetech Sep 9, 2023

cdce8p Sep 9, 2023 •

edited

bluetech Sep 9, 2023

cdce8p Sep 9, 2023

bluetech Sep 9, 2023

cdce8p Sep 9, 2023

cdce8p Sep 10, 2023 •

edited

bluetech Sep 11, 2023

cdce8p Sep 11, 2023 •

edited

bluetech Sep 12, 2023

nicoddemus commented Sep 13, 2023 •

edited

bluetech commented Sep 13, 2023

cdce8p commented Apr 27, 2024 •

edited

	# Technically, we don't have to have the same pyc format as
	# (C)Python, since these "pycs" should never be seen by builtin
	# import. However, there's little reason to deviate.

Add hash comparison for pyc cache files #11418

Are you sure you want to change the base?

Add hash comparison for pyc cache files #11418

Conversation

cdce8p commented Sep 8, 2023 • edited

bluetech left a comment

Choose a reason for hiding this comment

bluetech Sep 9, 2023

Choose a reason for hiding this comment

cdce8p Sep 9, 2023 • edited

Choose a reason for hiding this comment

bluetech Sep 9, 2023

Choose a reason for hiding this comment

cdce8p Sep 9, 2023

Choose a reason for hiding this comment

bluetech Sep 9, 2023

Choose a reason for hiding this comment

cdce8p Sep 9, 2023

Choose a reason for hiding this comment

cdce8p Sep 10, 2023 • edited

Choose a reason for hiding this comment

Footnotes

bluetech Sep 11, 2023

Choose a reason for hiding this comment

cdce8p Sep 11, 2023 • edited

Choose a reason for hiding this comment

bluetech Sep 12, 2023

Choose a reason for hiding this comment

nicoddemus commented Sep 13, 2023 • edited

bluetech commented Sep 13, 2023

cdce8p commented Apr 27, 2024 • edited

cdce8p commented Sep 8, 2023 •

edited

cdce8p Sep 9, 2023 •

edited

cdce8p Sep 10, 2023 •

edited

cdce8p Sep 11, 2023 •

edited

nicoddemus commented Sep 13, 2023 •

edited

cdce8p commented Apr 27, 2024 •

edited