New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic informers do not stop when custom resource definition is removed #79610
Comments
/cc |
@liggitt I would like to work on it! As this my first issue in api-machinery sig. |
ref #77480 |
@yliaog Cloud you please give me some clues? As well as your idea about this issue? |
the PR #77480 has more work to do
|
I benefit a lot from @yliaog and @yue9944882's comments. After go through the code, i'd like to share my findings and look forward to some advise. Root CauseThe controller, Take quota as example, the code as follows: kubernetes/pkg/controller/resourcequota/resource_quota_monitor.go Lines 174 to 179 in 99ff994
After CRD be deleted, the controllers doesn't remove event handler from informer as well as try to release informer, even through they don't need it. They just do clean up for themself. kubernetes/pkg/controller/resourcequota/resource_quota_monitor.go Lines 237 to 241 in 99ff994
So, the informer still running there and fire the error log constantly. Possible SolutionsThis solution is a little more complicated as there is no interface to remove event handler from informer and no interface to release informer from factory too. Informer should provide interface for remove event handler.The client should clear event handler when they no longer need it. Otherwise, the handler will leak. Informer factory should clear informes that no event handler in it.After remove one event handler for a specific informer, factory need to check the number of event handler left in the informer. If no event handler in it, the informer should be stopped. Otherwise, the informer will leak and running there. Informer factory should have ability to stop specific infomers.For now, all informers in a factory share the same stop channel, factory can not stop one informer separated. May be we can create one stop channel for each infomer when factory start and save the channels.
Kindly ping @sttts @jpbetz @deads2k for comment. I am thankful in advance. |
API clients interact with informers in multiple ways:
I'm not sure we'll be able to detect all the clients of the everything downstream of the informer using the current API. cc @deads2k as the originator of the shared informer approach for thoughts |
@liggitt It's not obvious to me what behavior is desired. Consider a case like:
Wouldn't we want that choice to be left external the shared code? In the case you're probably seeing (quota and GC?), you just want an "unget my informer" that can shutdown a particular reflector if nobody wants it anymore? Seems like that decision is going to be driven by factors outside what the lister/informer/reflector/listwatcher. |
I was thinking about adding two func's to SharedInformer interface:
Given the above, sharedInformerFactory would be able to sideline informers whose count of event handlers reaches zero.
This way, we don't need to introduce stop channel for each informer.
|
@liggitt |
im seeing this to |
We had to do this for all kube-control-managers. |
i can confrim this too, the kube controller seem to logs this every second: |
Is there any fix that could apply? I also want to know if this is a serious problem or not pls. |
Thanks a lot for this workaround Any news to fix the root cause? |
@liggitt @kevindelgado |
I ran into this recently and was thinking about ways to fix it. I agree with @deads2k that "unwatching" (or whether to unwatch) needs to be an explicit decision by the user of the client. Given that, I see 2 potential approaches:
I haven't contributed much to our client code - is this the sort of decision that needs a KEP? |
@tallclair this is sounding a bit like Gatekeeper's watch manager:
Happy to share experiences with it if desired. Biggest challenge would be around reconciling events for a watch that has already been torn down and avoiding race conditions if two reconcilers are acting on the same resource but only one uses a registrar. |
👀 |
Any update on this topic? I face the same issue in ~ 1.24 |
Is there a temporary workaround to start the kubelet |
Same issue for me. |
I faced same issue on lab v1.28.0 cluster provisioned by kind. |
Any update on this issue? We really need this too. |
same issue using kyverno reporter |
For those using controller-runtime libraries, it's now possible to manually remove an informer via RemoveInformer() (or will be once the change rolls out): kubernetes-sigs/controller-runtime#2285 Unfortunately this means you'll need to handle the record keeping as to when informers should be cleaned up yourself (at least until a more elegant UX is defined) |
@maxsmythe The topic talked on this PR is garbage collector and quota controllers in kube-controller-manager, I think. |
Not sure what the current state of client-go is. It looks like controller-runtime uses the dynamic client under the hood, but I don't think it uses SharedInformer. Definitely don't think this fixes the issue in core, but wanted to leave a breadcrumb for any extension authors who have this issue. |
What happened:
Once started, dynamic informers for custom resources are not stopped.
What you expected to happen:
After the dynamic informers resynced, they would stop informers belong to no-longer-existing resources.
How to reproduce it (as minimally and precisely as possible):
after the switch by the garbage collector and quota controllers to use the generic metadata client, the controller manager logs around this are even harder to understand:
Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):/sig api-machinery
/priority important-soon
/cc @sttts @jpbetz @deads2k
The text was updated successfully, but these errors were encountered: