🌱 Proposal for dynamic informer cache #2285

maxsmythe · 2023-04-22T02:18:26Z

This PR shows how Gatekeeper has forked controller runtime to support the dynamic addition/removal of informers.

Happy to flesh this out if people are interested. Not sure what the correct licensing actions are for moving code across CNCF projects.

This is related to #2159 and #1884

Basically, something like this PR will be necessary to clean up informers once no more controllers are using them.

In the interim, having this code would be helpful to Gatekeeper by eliminating our need to maintain a fork, which has been fairly labor intensive. It also may give other members of the community a way to meet their needs for dynamic watches while waiting for the dynamic controller/reference counting approach to be implemented.

k8s-ci-robot · 2023-04-22T02:18:34Z

Welcome @maxsmythe!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-04-22T02:18:34Z

Hi @maxsmythe. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ritazh · 2023-05-03T21:11:17Z

@vincepri can you please help take a look at this?

braghettos · 2023-05-04T08:16:08Z

We have also a use case where a feature like this one could be really helpful: we are going to generate at runtime CRDs and we would like to watch all the CRs based on the created CRDs.

vincepri · 2023-05-04T14:14:49Z

In the queue, should be done reviewing by EOW tomorrow

vincepri · 2023-05-04T14:14:58Z

/ok-to-test

vincepri · 2023-05-04T14:15:20Z

@maxsmythe We should probably add some tests before merging as well

pkg/cache/informer_cache.go

pkg/cache/internal/informers.go

maxsmythe · 2023-05-05T01:44:31Z

Thanks for the review!

Regarding tests, I was holding off until I got signal that there was interest in the idea. It sounds like that hurdle is cleared?

pkg/cache/informer_cache.go

maxsmythe · 2023-05-06T00:13:06Z

@vincepri Any feedback on the correct thing to do vis-a-vis licensing, since this code is copied from the Gatekeeper project (which is Apache licensed and CNCF-owned)?

Not sure if this requires a NOTICE file, or a citation in the comment or similar.

vincepri · 2023-05-08T19:04:10Z

Where is the code copied from? In general, I have no idea how that works, but I'd expect code for this specific feature to go be net-new in controller runtime, especially with the changes proposed above

maxsmythe · 2023-05-10T01:16:50Z

Code currently lives here:

https://github.com/open-policy-agent/gatekeeper/tree/release-3.12/third_party/sigs.k8s.io/controller-runtime/pkg/dynamiccache

That directory is essentially a fork of controller runtime, where the code changes look pretty much like what I've done in this PR.

braghettos · 2023-05-12T09:58:23Z

@maxsmythe @vincepri any updates on the proposed PR? This looks so useful for many use cases!

maxsmythe · 2023-05-13T02:57:51Z

@braghettos Sorry, I got distracted. Just pushed a first pass responding to feedback now.

@vincepri I took a stab at addressing your comments. LMK what you think. Also let me know what, if any, additional tests you'd like to see.

Tests pass locally, not sure why there is a failure currently.

maxsmythe · 2023-05-13T02:57:57Z

/retest

maxsmythe · 2023-05-13T03:06:01Z

Unit tests passed on retest. Still haven't been able to get them to fail locally. Rare flake?

braghettos · 2023-05-19T07:00:29Z

Any updates on this PR?

stevekuznetsov · 2023-05-19T14:25:05Z

Kubernetes upstream added support for dynamic informer lifecycle handling for the shared informer factory in 1.26. Any chances that we could lean on that?

alvaroaleman · 2023-10-23T20:21:03Z

@maxsmythe given that we do not have anything to dynamically manage watches on controllers, would you be ok if we change this to StopInformer but not remove it from the map? It seems that would still solve your problem but not result in confusing behavior if the informer is then later re-created (existing event sources do not get event)

maxsmythe · 2023-10-23T20:28:35Z

Thanks for merging!!

@vincepri @alvaroaleman Definitely interested in making this more user-friendly. Lemme know how I can help.

@alvaroaleman The one concern I would have is memory leaks. The main reason to remove the informer is to also remove the cache data, cleaning stale state.

Is there some way to remove the cache but keep a record of interested sources (this could still be a memory leak, but a smaller one)? Is there a way to have two methods (stop informer and remove informer)? Maybe figuring out the edge cases is part of the larger "dynamic watcher" story, that this is more a stopgap on the way to?

alvaroaleman · 2023-10-23T20:31:42Z

@maxsmythe memory leak because there is still data in the informer (the state of the world before it was stopped) or because of the informer itself?

alvaroaleman · 2023-10-23T20:34:12Z

Definitely interested in making this more user-friendly. Lemme know how I can help.

Its less about this particular change but about having an overall "Dynamic management of event sources" story which would include things like removing/adding them to a controller post-start. Right now it seems ppl are submitting code for their particular itch but I'd really like someone to step up and have the overall user stories and public interface changes described in a design doc. This would not mean that that person also has to implement any of it but it would ensure that what we merge is aligned with the overall north star.

maxsmythe · 2023-10-23T20:48:44Z

Mostly concerned about the state of the world, as that's the larger dataset (there can be a LOT of very large config maps in a cluster, for instance).

Having a persistent record of an informer could break down if there is a lot of creation/destruction of CRDs with novel kinds. This seems a harder edge case to get into, but IMO always worth architecting defensively, or at least giving users the ability to mitigate the issue if they're subject to it (though it may require extra effort on their part).

WRT "dynamic management of event sources" happy to take a stab at that. AFAIK the first attempt is #2159, which appears to be focused on starting/stopping controllers ad-hoc. Is there any other state I should know about? I can sketch some rough thoughts on a Google doc to start with, unless you have a more preferred way of handling these things?

alvaroaleman · 2023-10-23T20:52:41Z

Mostly concerned about the state of the world, as that's the larger dataset (there can be a LOT of very large config maps in a cluster, for instance).

Yeah, maybe there is a way to clean up the store in the informer?

AFAIK the first attempt is #2159, which appears to be focused on starting/stopping controllers ad-hoc. Is there any other state I should know about? I can sketch some rough thoughts on a Google doc to start with, unless you have a more preferred way of handling these things?

Only the linked issues to my knowledge. And starting to sketch something would be really helpful :) The PRs make it too easy to miss the forest from all the trees because these changes are big.

maxsmythe · 2023-11-04T01:39:54Z

Still working on the stories. @alvaroaleman Is there a timeline yet for when a release may be cut that has this?

Crossplane uses a controller engine to dynamically start claim and XR controllers when a new XRD is installed. Before this commit, each controller gets at least one cache. This is because when I built this functionality, you couldn't stop a single informer within a cache (a cache is basically a map of informers by GVK). When realtime composition is enabled, there are even more caches. One per composed resource GVK. A GVK routed cache routes cache lookups to these various delegate caches. Meanwhile, controller-runtime recently made it possible to stop an informer within a cache. It's also been possible to remove an event handler from an informer for some time (since Kubernetes 1.26). kubernetes-sigs/controller-runtime#2285 kubernetes-sigs/controller-runtime#2046 This commit uses a single client, backed by a single cache, across all dynamic controllers (specifically the definition, offered, claim, and XR controllers). Compared to the current implementation, this commit: * Ensures all dynamic controllers use clients backed by the same cache used to power watches (i.e. trigger reconciles). * Takes fewer global locks when realtime compositions are enabled. Locking is now mostly at the controller scope. * Works with the breaking changes to source.Source introduced in controller-runtime v0.18. :) Notably when realtime compositions are enabled, XR controllers will get XRs and composed resources from cache. Before this commit, their client wasn't backed by a cache. They'd get resources directly from the API server. Similarly, the claim controller will read claims from cache. Finally, I think this makes the realtime composition code a little easier to follow by consolodating it into the ControllerEngine, but that's pretty subjective. Signed-off-by: Nic Cope <nicc@rk0n.org>

Crossplane uses a controller engine to dynamically start claim and XR controllers when a new XRD is installed. Before this commit, each controller gets at least one cache. This is because when I built this functionality, you couldn't stop a single informer within a cache (a cache is basically a map of informers by GVK). When realtime composition is enabled, there are even more caches. One per composed resource GVK. A GVK routed cache routes cache lookups to these various delegate caches. Meanwhile, controller-runtime recently made it possible to stop an informer within a cache. It's also been possible to remove an event handler from an informer for some time (since Kubernetes 1.26). kubernetes-sigs/controller-runtime#2285 kubernetes-sigs/controller-runtime#2046 This commit uses a single client, backed by a single cache, across all dynamic controllers (specifically the definition, offered, claim, and XR controllers). Compared to the current implementation, this commit: * Takes fewer global locks when realtime compositions are enabled. Locking is now mostly at the controller scope. * Works with the breaking changes to source.Source introduced in controller-runtime v0.18. :) I think this makes the realtime composition code a little easier to follow by consolodating it into the ControllerEngine, but that's pretty subjective. Signed-off-by: Nic Cope <nicc@rk0n.org>

Crossplane uses a controller engine to dynamically start claim and XR controllers when a new XRD is installed. Before this commit, each controller gets at least one cache. This is because when I built this functionality, you couldn't stop a single informer within a cache (a cache is basically a map of informers by GVK). When realtime composition is enabled, there are even more caches. One per composed resource GVK. A GVK routed cache routes cache lookups to these various delegate caches. Meanwhile, controller-runtime recently made it possible to stop an informer within a cache. It's also been possible to remove an event handler from an informer for some time (since Kubernetes 1.26). kubernetes-sigs/controller-runtime#2285 kubernetes-sigs/controller-runtime#2046 This commit uses a single client, backed by a single cache, across all dynamic controllers (specifically the definition, offered, claim, and XR controllers). Compared to the current implementation, this commit: * Takes fewer global locks when realtime compositions are enabled. Locking is now mostly at the controller scope. * Works with the breaking changes to source.Source introduced in controller-runtime v0.18. :) I think this makes the realtime composition code a little easier to follow by consolodating it into the ControllerEngine, but that's pretty subjective. Signed-off-by: Nic Cope <nicc@rk0n.org> Signed-off-by: Chuan-Yen Chiang <chuanyen.chiang@volvocars.com>

k8s-ci-robot added the cncf-cla: yes label Apr 22, 2023

k8s-ci-robot requested review from FillZpp and varshaprasad96 April 22, 2023 02:18

k8s-ci-robot added needs-ok-to-test size/L labels Apr 22, 2023

maxsmythe changed the title ~~Proposal for dynamic informer cache~~ 🌱 Proposal for dynamic informer cache Apr 22, 2023

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels May 4, 2023

vincepri reviewed May 4, 2023

View reviewed changes

pkg/cache/informer_cache.go Outdated Show resolved Hide resolved

pkg/cache/internal/informers.go Outdated Show resolved Hide resolved

pkg/cache/internal/informers.go Outdated Show resolved Hide resolved

vincepri reviewed May 5, 2023

View reviewed changes

pkg/cache/informer_cache.go Outdated Show resolved Hide resolved

maxsmythe force-pushed the dynamic-informer-cache branch 3 times, most recently from d9ecde9 to 55353dc Compare May 13, 2023 01:15

ritazh mentioned this pull request Oct 25, 2023

Remove controller-runtime fork open-policy-agent/gatekeeper#3111

Closed

maxsmythe mentioned this pull request Jan 2, 2024

Dynamic informers do not stop when custom resource definition is removed kubernetes/kubernetes#79610

Open

shafeeqes mentioned this pull request Feb 9, 2024

Upgrade k8s.io/* to v0.29, sigs.k8s.io/controller-runtime to v0.17 gardener/gardener#9047

Closed

3 tasks

kcp-ci-bot mentioned this pull request Feb 29, 2024

✨ Bump to v1.17.2 kcp-dev/controller-runtime#47

Closed

negz mentioned this pull request May 4, 2024

Add an --enable-cached-compositions feature flag crossplane/crossplane#5644

Closed

6 tasks

negz mentioned this pull request May 7, 2024

Use a single cache for all dynamic controllers (i.e. XRs and claims) crossplane/crossplane#5651

Merged

6 tasks

karlkfi mentioned this pull request Jul 8, 2024

Add fake.Cache.RemoveInformer GoogleContainerTools/kpt-config-sync#1314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Proposal for dynamic informer cache #2285

🌱 Proposal for dynamic informer cache #2285

maxsmythe commented Apr 22, 2023

k8s-ci-robot commented Apr 22, 2023

k8s-ci-robot commented Apr 22, 2023

ritazh commented May 3, 2023

braghettos commented May 4, 2023

vincepri commented May 4, 2023

vincepri commented May 4, 2023

vincepri commented May 4, 2023

maxsmythe commented May 5, 2023

maxsmythe commented May 6, 2023

vincepri commented May 8, 2023

maxsmythe commented May 10, 2023

braghettos commented May 12, 2023

maxsmythe commented May 13, 2023

maxsmythe commented May 13, 2023

maxsmythe commented May 13, 2023

braghettos commented May 19, 2023

stevekuznetsov commented May 19, 2023

alvaroaleman commented Oct 23, 2023

maxsmythe commented Oct 23, 2023

alvaroaleman commented Oct 23, 2023

alvaroaleman commented Oct 23, 2023

maxsmythe commented Oct 23, 2023

alvaroaleman commented Oct 23, 2023

maxsmythe commented Nov 4, 2023

🌱 Proposal for dynamic informer cache #2285

🌱 Proposal for dynamic informer cache #2285

Conversation

maxsmythe commented Apr 22, 2023

k8s-ci-robot commented Apr 22, 2023

k8s-ci-robot commented Apr 22, 2023

ritazh commented May 3, 2023

braghettos commented May 4, 2023

vincepri commented May 4, 2023

vincepri commented May 4, 2023

vincepri commented May 4, 2023

maxsmythe commented May 5, 2023

maxsmythe commented May 6, 2023

vincepri commented May 8, 2023

maxsmythe commented May 10, 2023

braghettos commented May 12, 2023

maxsmythe commented May 13, 2023

maxsmythe commented May 13, 2023

maxsmythe commented May 13, 2023

braghettos commented May 19, 2023

stevekuznetsov commented May 19, 2023

alvaroaleman commented Oct 23, 2023

maxsmythe commented Oct 23, 2023

alvaroaleman commented Oct 23, 2023

alvaroaleman commented Oct 23, 2023

maxsmythe commented Oct 23, 2023

alvaroaleman commented Oct 23, 2023

maxsmythe commented Nov 4, 2023