Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Proposal for dynamic informer cache #2285

Merged
merged 2 commits into from Oct 23, 2023

Conversation

maxsmythe
Copy link
Contributor

This PR shows how Gatekeeper has forked controller runtime to support the dynamic addition/removal of informers.

Happy to flesh this out if people are interested. Not sure what the correct licensing actions are for moving code across CNCF projects.

This is related to #2159 and #1884

Basically, something like this PR will be necessary to clean up informers once no more controllers are using them.

In the interim, having this code would be helpful to Gatekeeper by eliminating our need to maintain a fork, which has been fairly labor intensive. It also may give other members of the community a way to meet their needs for dynamic watches while waiting for the dynamic controller/reference counting approach to be implemented.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 22, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @maxsmythe!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @maxsmythe. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 22, 2023
@maxsmythe maxsmythe changed the title Proposal for dynamic informer cache 🌱 Proposal for dynamic informer cache Apr 22, 2023
@ritazh
Copy link
Member

ritazh commented May 3, 2023

@vincepri can you please help take a look at this?

@braghettos
Copy link

We have also a use case where a feature like this one could be really helpful: we are going to generate at runtime CRDs and we would like to watch all the CRs based on the created CRDs.

@vincepri
Copy link
Member

vincepri commented May 4, 2023

In the queue, should be done reviewing by EOW tomorrow

@vincepri
Copy link
Member

vincepri commented May 4, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 4, 2023
@vincepri
Copy link
Member

vincepri commented May 4, 2023

@maxsmythe We should probably add some tests before merging as well

pkg/cache/informer_cache.go Outdated Show resolved Hide resolved
pkg/cache/internal/informers.go Outdated Show resolved Hide resolved
pkg/cache/internal/informers.go Outdated Show resolved Hide resolved
@maxsmythe
Copy link
Contributor Author

Thanks for the review!

Regarding tests, I was holding off until I got signal that there was interest in the idea. It sounds like that hurdle is cleared?

@maxsmythe
Copy link
Contributor Author

@vincepri Any feedback on the correct thing to do vis-a-vis licensing, since this code is copied from the Gatekeeper project (which is Apache licensed and CNCF-owned)?

Not sure if this requires a NOTICE file, or a citation in the comment or similar.

@vincepri
Copy link
Member

vincepri commented May 8, 2023

Where is the code copied from? In general, I have no idea how that works, but I'd expect code for this specific feature to go be net-new in controller runtime, especially with the changes proposed above

@maxsmythe
Copy link
Contributor Author

Code currently lives here:

https://github.com/open-policy-agent/gatekeeper/tree/release-3.12/third_party/sigs.k8s.io/controller-runtime/pkg/dynamiccache

That directory is essentially a fork of controller runtime, where the code changes look pretty much like what I've done in this PR.

@braghettos
Copy link

@maxsmythe @vincepri any updates on the proposed PR? This looks so useful for many use cases!

@maxsmythe maxsmythe force-pushed the dynamic-informer-cache branch 3 times, most recently from d9ecde9 to 55353dc Compare May 13, 2023 01:15
@maxsmythe
Copy link
Contributor Author

@braghettos Sorry, I got distracted. Just pushed a first pass responding to feedback now.

@vincepri I took a stab at addressing your comments. LMK what you think. Also let me know what, if any, additional tests you'd like to see.

Tests pass locally, not sure why there is a failure currently.

@maxsmythe
Copy link
Contributor Author

/retest

@maxsmythe
Copy link
Contributor Author

Unit tests passed on retest. Still haven't been able to get them to fail locally. Rare flake?

@braghettos
Copy link

Any updates on this PR?

@stevekuznetsov
Copy link
Contributor

Kubernetes upstream added support for dynamic informer lifecycle handling for the shared informer factory in 1.26. Any chances that we could lean on that?

@maxsmythe
Copy link
Contributor Author

No worries!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maxsmythe, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2023
@k8s-ci-robot k8s-ci-robot merged commit cc2a25b into kubernetes-sigs:main Oct 23, 2023
8 of 9 checks passed
@vincepri
Copy link
Member

@maxsmythe Chatting with @alvaroaleman it'd be good going forward to have a small design that talks on how we can get to the larger state of dynamic watchers.

@alvaroaleman
Copy link
Member

@maxsmythe given that we do not have anything to dynamically manage watches on controllers, would you be ok if we change this to StopInformer but not remove it from the map? It seems that would still solve your problem but not result in confusing behavior if the informer is then later re-created (existing event sources do not get event)

@maxsmythe
Copy link
Contributor Author

Thanks for merging!!

@vincepri @alvaroaleman Definitely interested in making this more user-friendly. Lemme know how I can help.

@alvaroaleman The one concern I would have is memory leaks. The main reason to remove the informer is to also remove the cache data, cleaning stale state.

Is there some way to remove the cache but keep a record of interested sources (this could still be a memory leak, but a smaller one)? Is there a way to have two methods (stop informer and remove informer)? Maybe figuring out the edge cases is part of the larger "dynamic watcher" story, that this is more a stopgap on the way to?

@alvaroaleman
Copy link
Member

@maxsmythe memory leak because there is still data in the informer (the state of the world before it was stopped) or because of the informer itself?

@alvaroaleman
Copy link
Member

Definitely interested in making this more user-friendly. Lemme know how I can help.

Its less about this particular change but about having an overall "Dynamic management of event sources" story which would include things like removing/adding them to a controller post-start. Right now it seems ppl are submitting code for their particular itch but I'd really like someone to step up and have the overall user stories and public interface changes described in a design doc. This would not mean that that person also has to implement any of it but it would ensure that what we merge is aligned with the overall north star.

@maxsmythe
Copy link
Contributor Author

Mostly concerned about the state of the world, as that's the larger dataset (there can be a LOT of very large config maps in a cluster, for instance).

Having a persistent record of an informer could break down if there is a lot of creation/destruction of CRDs with novel kinds. This seems a harder edge case to get into, but IMO always worth architecting defensively, or at least giving users the ability to mitigate the issue if they're subject to it (though it may require extra effort on their part).

WRT "dynamic management of event sources" happy to take a stab at that. AFAIK the first attempt is #2159, which appears to be focused on starting/stopping controllers ad-hoc. Is there any other state I should know about? I can sketch some rough thoughts on a Google doc to start with, unless you have a more preferred way of handling these things?

@alvaroaleman
Copy link
Member

Mostly concerned about the state of the world, as that's the larger dataset (there can be a LOT of very large config maps in a cluster, for instance).

Yeah, maybe there is a way to clean up the store in the informer?

AFAIK the first attempt is #2159, which appears to be focused on starting/stopping controllers ad-hoc. Is there any other state I should know about? I can sketch some rough thoughts on a Google doc to start with, unless you have a more preferred way of handling these things?

Only the linked issues to my knowledge. And starting to sketch something would be really helpful :) The PRs make it too easy to miss the forest from all the trees because these changes are big.

@maxsmythe
Copy link
Contributor Author

Still working on the stories. @alvaroaleman Is there a timeline yet for when a release may be cut that has this?

negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Ensures all dynamic controllers use clients backed by the same cache
  used to power watches (i.e. trigger reconciles).
* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

Notably when realtime compositions are enabled, XR controllers will get
XRs and composed resources from cache. Before this commit, their client
wasn't backed by a cache. They'd get resources directly from the API
server. Similarly, the claim controller will read claims from cache.

Finally, I think this makes the realtime composition code a little easier
to follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Ensures all dynamic controllers use clients backed by the same cache
  used to power watches (i.e. trigger reconciles).
* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

Notably when realtime compositions are enabled, XR controllers will get
XRs and composed resources from cache. Before this commit, their client
wasn't backed by a cache. They'd get resources directly from the API
server. Similarly, the claim controller will read claims from cache.

Finally, I think this makes the realtime composition code a little easier
to follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Ensures all dynamic controllers use clients backed by the same cache
  used to power watches (i.e. trigger reconciles).
* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

Notably when realtime compositions are enabled, XR controllers will get
XRs and composed resources from cache. Before this commit, their client
wasn't backed by a cache. They'd get resources directly from the API
server. Similarly, the claim controller will read claims from cache.

Finally, I think this makes the realtime composition code a little easier
to follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Ensures all dynamic controllers use clients backed by the same cache
  used to power watches (i.e. trigger reconciles).
* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

Notably when realtime compositions are enabled, XR controllers will get
XRs and composed resources from cache. Before this commit, their client
wasn't backed by a cache. They'd get resources directly from the API
server. Similarly, the claim controller will read claims from cache.

Finally, I think this makes the realtime composition code a little easier
to follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Ensures all dynamic controllers use clients backed by the same cache
  used to power watches (i.e. trigger reconciles).
* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

Notably when realtime compositions are enabled, XR controllers will get
XRs and composed resources from cache. Before this commit, their client
wasn't backed by a cache. They'd get resources directly from the API
server. Similarly, the claim controller will read claims from cache.

Finally, I think this makes the realtime composition code a little easier
to follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

I think this makes the realtime composition code a little easier to
follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 7, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

I think this makes the realtime composition code a little easier to
follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 8, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

I think this makes the realtime composition code a little easier to
follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
negz added a commit to negz/crossplane that referenced this pull request May 9, 2024
Crossplane uses a controller engine to dynamically start claim and XR
controllers when a new XRD is installed.

Before this commit, each controller gets at least one cache. This is
because when I built this functionality, you couldn't stop a single
informer within a cache (a cache is basically a map of informers by
GVK).

When realtime composition is enabled, there are even more caches. One
per composed resource GVK. A GVK routed cache routes cache lookups to
these various delegate caches.

Meanwhile, controller-runtime recently made it possible to stop an
informer within a cache. It's also been possible to remove an event
handler from an informer for some time (since Kubernetes 1.26).

kubernetes-sigs/controller-runtime#2285
kubernetes-sigs/controller-runtime#2046

This commit uses a single client, backed by a single cache, across all
dynamic controllers (specifically the definition, offered, claim, and
XR controllers).

Compared to the current implementation, this commit:

* Takes fewer global locks when realtime compositions are enabled.
  Locking is now mostly at the controller scope.
* Works with the breaking changes to source.Source introduced in
  controller-runtime v0.18. :)

I think this makes the realtime composition code a little easier to
follow by consolodating it into the ControllerEngine, but that's
pretty subjective.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants