Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-23714: Lazy load registry repo list #11163

Closed
wants to merge 17 commits into from
Closed

Conversation

dcaravel
Copy link
Contributor

@dcaravel dcaravel commented May 17, 2024

Description

Currently when Central starts all integrations are loaded into memory sequentially. For each integration a list of associated repositories is created. Populating this list takes time and delays Central startup (ref).

The repo list is used when determining if a registry integration is a match for an image.

Instead of loading the initial repo list when creating a new image integration, load it lazily upon first call to Match (and then as usual on subsequent calls).

This will result in faster Central startup and reduced initial memory consumption.

In the future repo list should be removed entirely, ACS may be making a wrong assumption that if a repo is not in the list then it is not in the registry, which is contrary to this doc:

... a missing entry does not mean that the registry does not have the repository. More succinctly, the presence of a repository only guarantees that it is there but not that it is not there.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps
  • Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)

If any of these don't apply, please comment below.

Testing Performed

Unit tests as well as an infra OCP cluster with a mock registry deployed (broken-image-reg) configured to add a 10 second delay to the /v2/_catalog repo list calls - and 20 pull secrets with unique routes to the broken registry created in the OCP cluster.

Without the fix, Central startup is delayed just under two minutes:

2024/05/21 00:52:14.621790 orchestrator.go:231: Info: Successfully fetched 0 Istio CVEs
2024/05/21 00:53:56.624237 instance_config.go:113: Info: Phonehome telemetry collection disabled.

A followup roxctl image scan returns instantly (error expected):

$ time rctl image scan --image=broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500 --retries 0
ERROR:	scaning image failed after 0 retries: could not scan image: "broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500": rpc error: code = Internal desc = image enrichment error: error getting metadata for image: broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500 error: getting metadata from registry: "Autogenerated https://broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems for cluster remote": failed to get the manifest digest: Head "https://broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/v2/broken/error/manifests/500": http: non-successful response (status=500 body="")

real	0m0.432s
user	0m0.109s
sys	0m0.046s

With the fix, Central starts up faster:

2024/05/21 00:56:22.685922 orchestrator.go:231: Info: Successfully fetched 0 Istio CVEs
2024/05/21 00:56:23.409195 singleton.go:43: Info: Injecting 86 policies into detectors.

The initial repo list load is instead made during the first call to match, which is demonstrated via a roxctl image scan. Notice how it returns after about 10 seconds (the timeout set on the /v2/_catalog endpoint):

$ time rctl image scan --image=broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500 --retries 0
ERROR:	scaning image failed after 0 retries: could not scan image: "broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500": rpc error: code = Internal desc = image enrichment error: error getting metadata for image: broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500 error: getting metadata from registry: "Autogenerated https://broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems for cluster remote": failed to get the manifest digest: Head "https://broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/v2/broken/error/manifests/500": http: non-successful response (status=500 body="")

real	0m10.391s
user	0m0.114s
sys	0m0.047s

A second call to roxctl returns immediately indicating the repo list call is/was not made again:

$ time rctl image scan --image=broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500 --retries 0
ERROR:	scaning image failed after 0 retries: could not scan image: "broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500": rpc error: code = Internal desc = image enrichment error: error getting metadata for image: broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/broken/error:500 error: getting metadata from registry: "Autogenerated https://broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems for cluster remote": failed to get the manifest digest: Head "https://broken1-broken-image-reg.apps.dc-05-20-1.ocp.infra.rox.systems/v2/broken/error/manifests/500": http: non-successful response (status=500 body="")

real	0m0.405s
user	0m0.119s
sys	0m0.047s

Copy link

openshift-ci bot commented May 17, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@rhacs-bot
Copy link
Contributor

rhacs-bot commented May 17, 2024

Images are ready for the commit at 10aec9c.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.4.x-726-g10aec9ce03.

@dcaravel dcaravel closed this May 21, 2024
@dcaravel
Copy link
Contributor Author

Rebases gone bad... closing

@dcaravel
Copy link
Contributor Author

#11186 <- new PR

@dcaravel dcaravel deleted the dc/lazy-repo-list branch May 21, 2024 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants