New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an --enable-cached-compositions feature flag #5644
Conversation
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
- Parse CLI flags to feature flags and log them early. - Use feature flags to gate any further logic (e.g. setting up functions) Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Mehmet Enes <menes.onus@gmail.com>
4b9161e
to
6a7001c
Compare
// If the unstructured cache does not have the claim, hit the API Server | ||
if err := r.apiReader.Get(ctx, req.NamespacedName, cm); err != nil { | ||
// There's no need to requeue if we no longer exist. Otherwise we'll be | ||
// requeued implicitly because we return an error. | ||
log.Debug(errGetClaim, "error", err) | ||
return reconcile.Result{}, errors.Wrap(resource.IgnoreNotFound(err), errGetClaim) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, is it possible for the reconciler's workqueue to get ahead of the cache backing the client? I thought the workqueue used the same informer as the client cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cause behind the unstructured cache's unsynced behavior is a bit unknown to me actually. I think there is a synchronization problem but I dont know why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enesonus I think I know what's going on.
In controller-runtime, reconciles work as follows:
- An informer watches a type of resource. An event handler adds reconciles to the controller's work queue whenever a resource is created, updated, or deleted.
- At the beginning of each reconcile, the controller uses a client to get the latest version of the resource that triggered the reconcile.
Usually the controller's client (used in step 2) would use a cache backed by the same informer that is watching for changes and triggering reconciles (step 1).
For Crossplane XR controllers, that's not the case.
A controller runtime cache is mostly just a bunch of informers. An informer works by listing and then watching resources of a particular type. When a client backed by a cache tries to get or list a new type, the cache starts an informer for that type.
When an XRD is deleted, we need to stop the informer for the XR type, or it will log errors indefinitely due to being unable to list a type that no longer exists. However, when we built the XR controllers there wasn't any way to stop an informer.
We worked around this by creating an entire cache for each XR controller. When an XRD is deleted, instead of stopping an informer within the cache, we stop the whole cache:
https://github.com/crossplane/crossplane-runtime/blob/v1.15.1/pkg/controller/engine.go#L248
However, the XR reconciler uses the controller manager's client, which is backed by a separate cache:
The controller manager's client doesn't cache *Unstructured
by default, so what happens is:
- The XR controller's special cache notices a change, and triggers a reconcile
- The XR reconciler gets the changed resource directly from the API server
With this PR that enables caching of *Unstructured
in controller manager's client, what happens is:
- The XR controller's special cache notices a change, and triggers a reconcile
- The XR reconciler tries to get the changed resource from the controller manager's default, global cache
So we're probably seeing one cache trailing behind the other, pretty much as you suspected.
I think the right solution here is to use only one cache. This is for two reasons:
- We won't have issues with two caches being out of sync
- If we read an XR using the controller-manager's cache we'll need to stop the informer for that XR type when its XRD is deleted, or the informer will break and spew error messages
I noticed that late last year kubernetes-sigs/controller-runtime#2285 added support for stopping informers in controller-runtime, so I'm taking a look at that as a possible solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that's great news. I was also looking at the crossplane-runtime to understand the issue but I was not very succesful since the codebase is a bit new to me. Are you available for creating the relevant PR? I can also make the changes but I will probably need some help in terms of which things to change 😅
Currently this PR might not pass all E2E tests but it passes more tests than the original PR it is built on. Tests behavior is a bit inconsistent currently (it sometimes passes all tests). I am still investigating and trying to understand the underlying issue. |
Description of your changes
Fixes #5338
This PR is built on top of @negz's #5339 PR. This PR enables caching for *Unstructured objects by fixing the early cache read problem at #5339 that caused E2E tests to fail.
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR.Need help with this checklist? See the cheat sheet.