Add an --enable-cached-compositions feature flag #5644

enesonus · 2024-05-01T20:21:28Z

Description of your changes

This PR is built on top of @negz's #5339 PR. This PR enables caching for *Unstructured objects by fixing the early cache read problem at #5339 that caused E2E tests to fail.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added or updated unit tests.
Added or updated e2e tests.
Linked a PR or a docs tracking issue to document this change.
Added backport release-x.y labels to auto-backport this PR.

Need help with this checklist? See the cheat sheet.

Signed-off-by: Nic Cope <nicc@rk0n.org>

- Parse CLI flags to feature flags and log them early. - Use feature flags to gate any further logic (e.g. setting up functions) Signed-off-by: Nic Cope <nicc@rk0n.org>

Signed-off-by: Mehmet Enes <menes.onus@gmail.com>

negz · 2024-05-01T20:34:45Z

internal/controller/apiextensions/claim/reconciler.go

+		// If the unstructured cache does not have the claim, hit the API Server
+		if err := r.apiReader.Get(ctx, req.NamespacedName, cm); err != nil {
+			// There's no need to requeue if we no longer exist. Otherwise we'll be
+			// requeued implicitly because we return an error.
+			log.Debug(errGetClaim, "error", err)
+			return reconcile.Result{}, errors.Wrap(resource.IgnoreNotFound(err), errGetClaim)
+		}


Hrm, is it possible for the reconciler's workqueue to get ahead of the cache backing the client? I thought the workqueue used the same informer as the client cache.

The cause behind the unstructured cache's unsynced behavior is a bit unknown to me actually. I think there is a synchronization problem but I dont know why.

@enesonus I think I know what's going on.

In controller-runtime, reconciles work as follows:

An informer watches a type of resource. An event handler adds reconciles to the controller's work queue whenever a resource is created, updated, or deleted.

At the beginning of each reconcile, the controller uses a client to get the latest version of the resource that triggered the reconcile.

Usually the controller's client (used in step 2) would use a cache backed by the same informer that is watching for changes and triggering reconciles (step 1).

For Crossplane XR controllers, that's not the case.

A controller runtime cache is mostly just a bunch of informers. An informer works by listing and then watching resources of a particular type. When a client backed by a cache tries to get or list a new type, the cache starts an informer for that type.

When an XRD is deleted, we need to stop the informer for the XR type, or it will log errors indefinitely due to being unable to list a type that no longer exists. However, when we built the XR controllers there wasn't any way to stop an informer.

We worked around this by creating an entire cache for each XR controller. When an XRD is deleted, instead of stopping an informer within the cache, we stop the whole cache:

https://github.com/crossplane/crossplane-runtime/blob/v1.15.1/pkg/controller/engine.go#L248

However, the XR reconciler uses the controller manager's client, which is backed by a separate cache:

https://github.com/crossplane/crossplane/blob/v1.15.2/internal/controller/apiextensions/composite/reconciler.go#L382

The controller manager's client doesn't cache *Unstructured by default, so what happens is:

The XR controller's special cache notices a change, and triggers a reconcile

The XR reconciler gets the changed resource directly from the API server

With this PR that enables caching of *Unstructured in controller manager's client, what happens is:

The XR controller's special cache notices a change, and triggers a reconcile

The XR reconciler tries to get the changed resource from the controller manager's default, global cache

So we're probably seeing one cache trailing behind the other, pretty much as you suspected.

I think the right solution here is to use only one cache. This is for two reasons:

We won't have issues with two caches being out of sync

If we read an XR using the controller-manager's cache we'll need to stop the informer for that XR type when its XRD is deleted, or the informer will break and spew error messages

I noticed that late last year kubernetes-sigs/controller-runtime#2285 added support for stopping informers in controller-runtime, so I'm taking a look at that as a possible solution.

Wow that's great news. I was also looking at the crossplane-runtime to understand the issue but I was not very succesful since the codebase is a bit new to me. Are you available for creating the relevant PR? I can also make the changes but I will probably need some help in terms of which things to change 😅

enesonus · 2024-05-01T21:09:41Z

Currently this PR might not pass all E2E tests but it passes more tests than the original PR it is built on. Tests behavior is a bit inconsistent currently (it sometimes passes all tests). I am still investigating and trying to understand the underlying issue.

negz and others added 4 commits May 1, 2024 22:43

Add a feature flag to enable caching in composition controllers

6bda9fd

Signed-off-by: Nic Cope <nicc@rk0n.org>

Run all E2E tests with cached compositions

cb0c070

Signed-off-by: Nic Cope <nicc@rk0n.org>

Refactor feature flag handling

d40300d

- Parse CLI flags to feature flags and log them early. - Use feature flags to gate any further logic (e.g. setting up functions) Signed-off-by: Nic Cope <nicc@rk0n.org>

fix cache early read problem

6a7001c

Signed-off-by: Mehmet Enes <menes.onus@gmail.com>

enesonus force-pushed the unstructured-cache branch from 4b9161e to 6a7001c Compare May 1, 2024 20:22

negz reviewed May 1, 2024

View reviewed changes

enesonus mentioned this pull request May 1, 2024

Add an --enable-cached-compositions feature flag #5339

Closed

6 tasks

enesonus closed this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an --enable-cached-compositions feature flag #5644

Add an --enable-cached-compositions feature flag #5644

enesonus commented May 1, 2024 •

edited

negz May 1, 2024

enesonus May 1, 2024

negz May 4, 2024 •

edited

enesonus May 5, 2024

enesonus commented May 1, 2024 •

edited

Add an --enable-cached-compositions feature flag #5644

Add an --enable-cached-compositions feature flag #5644

Conversation

enesonus commented May 1, 2024 • edited

Description of your changes

negz May 1, 2024

Choose a reason for hiding this comment

enesonus May 1, 2024

Choose a reason for hiding this comment

negz May 4, 2024 • edited

Choose a reason for hiding this comment

enesonus May 5, 2024

Choose a reason for hiding this comment

enesonus commented May 1, 2024 • edited

enesonus commented May 1, 2024 •

edited

negz May 4, 2024 •

edited

enesonus commented May 1, 2024 •

edited