Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache unit test results [NET-4965] #18333

Closed
wants to merge 20 commits into from

Conversation

nfi-hashicorp
Copy link
Collaborator

@nfi-hashicorp nfi-hashicorp commented Jul 31, 2023

I was curious and started looking into setup-go's new on-by-default cache flag, and it turns out it probably doesn't work very well unfortunately. TLDR: it not only caches GOMODCACHE, but also GOCACHE; it isn't specific enough about input hashing and ends up caching a bunch of junk that actually slows the build down.

As part of repro, I experimented with this repo and ended up getting about halfway to a working implementation of GOCACHE including test caching, so I did a bunch more experimenting, tidied it up, and here we are.

There are some kind of big changes here to the CI strategy, so I'll justify them with data from my experimental PR.

Test result and build object caching

The main course. In Go, test results and build objects are stored in the same cache.

I made some attempts to keep them separate (e.g. go test -c to pre-compile tests), but to no avail. It would have been nice, since the build cache is 100s of MBs, and the test cache is 10s.

Pros

A run from an empty cache takes ~ 9 minutes, which is already better than main, which takes ~ 11 min. A second run takes ~ xxx min.

Cons

Cache blobs are scoped to branches (GHA policy). The way I've designed this, we will have a cache blob for each (Go version, GOARCH, module) tuple (at least), which at current count is 8. These caches are 100s of MBs, so I can imagine them getting out of hand. There is a 10GB repo-level cache size limit, but it apparently isn't enforced since we are currently sitting at about 50GB.

Cache blobs mostly only grow, until they are reaped by GHA's recency-based garbage collector. In this implementation, I do no GC inside the cache blob. Go itself does trim cache objects, but it is based on mtime and not disk usage. There's no way to set a maximum disk usage limit.

Cache blobs are relatively fast to restore/save. IME it's on the order of 15s for a typical 500 MB blob, either way.

For some reason, cache blobs are not overwriteable. So I worked around this by deleting and then pushing the amended blob. If you're doing something non-linear like rerunning an old commit while the current commit is running, last write wins, and there's no locking AFAICT, so you might end up writing a stale cache.

No shared dev build

I deleted the dev-build step. Now the tests just each build it. dev-build builds a consul binary and shares it as an artifact to all the tests. It's a well-intentioned idea, but unfortunately the binary is quite big 100s of MB) and GHA artifact upload/download is for some mysterious reason about 10x slower than cache save/restore.

Also, after push, the tests all need to wait for dev-build and can do no useful work until then.

I'd be willing to bet the consul binary shares some cache objects with the unit tests. So on subsequent builds, it should be 0s to build it.

No test splitting

Mostly I removed this because caching would be hard. We need per-runner caches because there's no way to do concurrent writes to the same cache (safely). So with test splitting by package into N buckets, we'd have to do some sort of consistent hashing/sharding/etc that is just not the sort of thing I want to do in Bash.

But also, I found that by fixing flakiness and adding test/build caching, module-level splitting (like we have in this PR) is plenty fast.

Explicitly pass the Go version

I declare the Go versions we use (N and N-1) at the top and explicitly feed them to every unit test. Caches are sensitive to any change in Go version, so we must keep this controlled. Previously, we would just specify 1.20.x to setup-go and it may or may not download the latest.

No gotestsum retries

A test results cache is basically like retries. IMO gotestsum retries are a hack. IMO if we consistently see a test flaking, we should disable it and fix it later. (I made a PR here. Probably bears further discussion.)

Additionally, I've discovered that the go test -run flag is not specially handled; go test -run '.*' . is a different cache key than go test ., even though they should be equivalent. gotestsum's retry mechanism uses -run, so we can't use them together effectively. When the first run (without -run) fails, none of its PASSes are cached. (This is maybe a bug with Go, maybe with gotestsum, IDK.)

Sidequest: freeport retries

I added a new function to freeport: RetryMustGetN, which is like GetN but retries. This was a big source of flakiness.

Sidequest: fix test results artifacts

Our invocations of reusable-unit{,split}.yml actually saved their test results over top of one another. The artifact namespace is shared for all jobs in the workflow (even called workflows).

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern
  • changelog: I only touched CI and test code; nothing user-facing

@github-actions github-actions bot added the theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies label Jul 31, 2023
@nfi-hashicorp nfi-hashicorp changed the title add freeport.RetryMustGetN and use it cache unit test results [NET-4965] Jul 31, 2023
@nfi-hashicorp nfi-hashicorp added backport/1.15 Deprecated: use backport/ent/1.15 backport/1.16 Deprecated: use backport/ent/1.16 pr/no-changelog PR does not need a corresponding .changelog entry labels Jul 31, 2023
@huikang
Copy link
Collaborator

huikang commented Aug 2, 2023

Also, after push, the tests all need to wait for dev-build and can do no useful work until then.

Could you point to one or two examples in some tests to support this statement?

@nfi-hashicorp
Copy link
Collaborator Author

Also, after push, the tests all need to wait for dev-build and can do no useful work until then.

Could you point to one or two examples in some tests to support this statement?

All of the tests depend on the dev-build job in the workflow, so they can't start until it's finished. For example, latest run on main: https://github.com/hashicorp/consul/actions/runs/5739962589, you can see that it blocks:

image

In dev-build, it spends 25s uploading the artifact
image

Then later in each test job, a few seconds downloading it. Compare this to the build time, which is 33s. Not much savings to be had.

@nfi-hashicorp
Copy link
Collaborator Author

nfi-hashicorp commented Aug 2, 2023

Noticing some cache misses due to testcache: github.com/hashicorp/consul/tlsutil: input file /home/runner/work/consul/consul/tlsutil/cert.pem: file used as input is too new; looking into it.

  • /tlsutil/cert.pem cache miss

@nfi-hashicorp nfi-hashicorp force-pushed the nfi-cache-unit-test-results branch 7 times, most recently from 3dcb67b to 3285fcc Compare August 2, 2023 21:01
@nfi-hashicorp
Copy link
Collaborator Author

nfi-hashicorp commented Aug 2, 2023

There is something uncacheable happening with oss_root_amd64

image

Compare to any_root_386

image
  • investigate oss_root_amd64 caching

any_root_386 uses -short, so I guess it must be a long test

@nfi-hashicorp
Copy link
Collaborator Author

nfi-hashicorp commented Aug 2, 2023

It may also be worth saving the GOMODCACHE. Even with cached binaries, I suspect Go still needs to download all the source. We should have gotestsum and our build of consul in cache, so all this time (almost a minute) must be downloading source?

image

Update: it saved about 20s between this and this re-run. The cost is a 612MB.

@nfi-hashicorp
Copy link
Collaborator Author

nfi-hashicorp commented Aug 3, 2023

Hmm, I've discovered a surprising behavior. go test doesn't really cache at the test function level, it tests at the invocation level. -run is keyed literally, so regexes that select the same tests actually cache separately! Even equivalent regexes, like ^ and .*.

For example
cachetest % cat cache_test.go 
package test

import "testing"

func TestA(t *testing.T) {

}

func TestB(t *testing.T) {

}

func TestC(t *testing.T) {

}
cachetest % GODEBUG=gocachetest=1 GOCACHE=/tmp/foo go test -v .
testcache: github.com/hashicorp/consul/cachetest: test ID 4d57304767637a616a435375486550573334757a => fc9086925f29b4fb6f5b160239f09241e426a5c299b35e9698a96b74d6241439
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/fc/fc9086925f29b4fb6f5b160239f09241e426a5c299b35e9698a96b74d6241439-a: no such file or directory
testcache: github.com/hashicorp/consul/cachetest: test ID 3859515357797133445f754c39746b5844336c41 => cdf05723beb1e5f9d1474de78ff6355918f05f603dfef5814d7111caacac25d2
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/cd/cdf05723beb1e5f9d1474de78ff6355918f05f603dfef5814d7111caacac25d2-a: no such file or directory
=== RUN   TestA
--- PASS: TestA (0.00s)
=== RUN   TestB
--- PASS: TestB (0.00s)
=== RUN   TestC
--- PASS: TestC (0.00s)
PASS
ok      github.com/hashicorp/consul/cachetest   0.460s
testcache: github.com/hashicorp/consul/cachetest: save test ID fc9086925f29b4fb6f5b160239f09241e426a5c299b35e9698a96b74d6241439 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => 647d17e204bbd809d03a83e20e15b017fb6bf1babbb5148a035a23d1d6ce3452
testcache: github.com/hashicorp/consul/cachetest: save test ID cdf05723beb1e5f9d1474de78ff6355918f05f603dfef5814d7111caacac25d2 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => fd57189da362f7c3e41d812d175b51b81169f6c342e8dd33bfe43cfd74103757
cachetest % GODEBUG=gocachetest=1 GOCACHE=/tmp/foo go test -run '.*' -v .  
testcache: github.com/hashicorp/consul/cachetest: test ID 4d57304767637a616a435375486550573334757a => 40584c921d8b9af89bbed4c85a7b4c548252a4d7e6c17556b9f3ad7eca3b7442
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/40/40584c921d8b9af89bbed4c85a7b4c548252a4d7e6c17556b9f3ad7eca3b7442-a: no such file or directory
testcache: github.com/hashicorp/consul/cachetest: test ID 3859515357797133445f754c39746b5844336c41 => b6743758461c64234ec892693221c32ad05fc7f67192fc9dc3c04bb5c7f79e11
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/b6/b6743758461c64234ec892693221c32ad05fc7f67192fc9dc3c04bb5c7f79e11-a: no such file or directory
=== RUN   TestA
--- PASS: TestA (0.00s)
=== RUN   TestB
--- PASS: TestB (0.00s)
=== RUN   TestC
--- PASS: TestC (0.00s)
PASS
ok      github.com/hashicorp/consul/cachetest   0.155s
testcache: github.com/hashicorp/consul/cachetest: save test ID 40584c921d8b9af89bbed4c85a7b4c548252a4d7e6c17556b9f3ad7eca3b7442 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => 6bf89d65b0147026d24412adeb0b8f58b9e5f9b087ea2e559879c37a30b2c3c7
testcache: github.com/hashicorp/consul/cachetest: save test ID b6743758461c64234ec892693221c32ad05fc7f67192fc9dc3c04bb5c7f79e11 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => bcd5bc9bb4c714711a24e3e2df77c0d4185e17400abea766b5ecbf0ade44307e
cachetest % GODEBUG=gocachetest=1 GOCACHE=/tmp/foo go test -run 'Test.*' -v .
testcache: github.com/hashicorp/consul/cachetest: test ID 4d57304767637a616a435375486550573334757a => fbfd9fa1d1152f5c478e6bb0274ba944e1fa4c354452e78191e1ba2bf999b435
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/fb/fbfd9fa1d1152f5c478e6bb0274ba944e1fa4c354452e78191e1ba2bf999b435-a: no such file or directory
testcache: github.com/hashicorp/consul/cachetest: test ID 3859515357797133445f754c39746b5844336c41 => 8dd5185abd7e636035d8cb504056a731097128ddf3156efa7a98e6e6f0cfc7fe
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/8d/8dd5185abd7e636035d8cb504056a731097128ddf3156efa7a98e6e6f0cfc7fe-a: no such file or directory
=== RUN   TestA
--- PASS: TestA (0.00s)
=== RUN   TestB
--- PASS: TestB (0.00s)
=== RUN   TestC
--- PASS: TestC (0.00s)
PASS
ok      github.com/hashicorp/consul/cachetest   0.119s
testcache: github.com/hashicorp/consul/cachetest: save test ID fbfd9fa1d1152f5c478e6bb0274ba944e1fa4c354452e78191e1ba2bf999b435 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => 17e04a10db95742bbfaa9cfd7d34c412b0fd4ec7dde26260e9ca0bf82da42578
testcache: github.com/hashicorp/consul/cachetest: save test ID 8dd5185abd7e636035d8cb504056a731097128ddf3156efa7a98e6e6f0cfc7fe => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => cff587dad94536fcdc754b5b942dce86881323c92f004275515c5c83a16eeebd

To be fair, I think it says this on the tin, just surprising

Update: this totally precludes us from combining the Go test cache with gotestsum reruns, which use -run to rerun specific test funcs. Will update the caveats section.

@nfi-hashicorp nfi-hashicorp changed the base branch from main to disable-flaky August 4, 2023 03:12
@nfi-hashicorp
Copy link
Collaborator Author

I rebased onto this branch which moves retries inside the function in question. However, I think it was probably too blunt, because those flaky tests still flaked. We may need to capture FailNow and friends somehow (recover()?).

@nfi-hashicorp
Copy link
Collaborator Author

In light of #18365 (comment), I've gone back to just skipping the flaky tests and basing on main. Given that I don't have a great solution to deal with the flaky tests (short of y'know, actually fixing them), I'm going to put this back in draft

@nfi-hashicorp nfi-hashicorp changed the base branch from disable-flaky to main August 8, 2023 19:34
@github-actions
Copy link

github-actions bot commented Oct 8, 2023

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Oct 8, 2023
@github-actions github-actions bot removed the meta/stale Automatically flagged for inactivity by stalebot label Nov 18, 2023
Copy link

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Jan 18, 2024
Copy link

Closing due to inactivity. If you feel this was a mistake or you wish to re-open at any time in the future, please leave a comment and it will be re-surfaced for the maintainers to review.

@github-actions github-actions bot closed this Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.15 Deprecated: use backport/ent/1.15 backport/1.16 Deprecated: use backport/ent/1.16 meta/stale Automatically flagged for inactivity by stalebot pr/no-changelog PR does not need a corresponding .changelog entry theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants