Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic informers do not stop when custom resource definition is removed #79610

Open
liggitt opened this issue Jul 1, 2019 · 38 comments · May be fixed by kevindelgado/kubernetes#1
Open

Dynamic informers do not stop when custom resource definition is removed #79610

liggitt opened this issue Jul 1, 2019 · 38 comments · May be fixed by kevindelgado/kubernetes#1
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@liggitt
Copy link
Member

liggitt commented Jul 1, 2019

What happened:

Once started, dynamic informers for custom resources are not stopped.

What you expected to happen:

After the dynamic informers resynced, they would stop informers belong to no-longer-existing resources.

How to reproduce it (as minimally and precisely as possible):

  1. Start a cluster with garbage collection and quota enabled with high enough verbosity to see API requests logged
  2. Create a custom resource definition
  3. Observe list/watch requests from the dynamic informers:
I0701 14:31:54.604559   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (382.601µs) 200 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
  1. Delete the custom resource definition
  2. Observe continuous list requests from the dynamic informers receiving a 404:
I0701 14:33:05.857207   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (346.987µs) 404 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
I0701 14:33:06.861218   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (349.903µs) 404 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
...

after the switch by the garbage collector and quota controllers to use the generic metadata client, the controller manager logs around this are even harder to understand:

2019-11-16T01:18:20.15247609Z stderr F E1116 01:18:20.152307       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.450468558Z stderr F E1116 01:18:20.450095       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.632847838Z stderr F E1116 01:18:20.625706       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.693940889Z stderr F E1116 01:18:20.693774       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.740153328Z stderr F E1116 01:18:20.739963       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.740400594Z stderr F E1116 01:18:20.740276       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.820833937Z stderr F E1116 01:18:20.820695       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.848726246Z stderr F E1116 01:18:20.848605       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.856013314Z stderr F E1116 01:18:20.855816       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.950015373Z stderr F E1116 01:18:20.949837       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.998885723Z stderr F E1116 01:18:20.998732       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:21.001894579Z stderr F E1116 01:18:21.001726       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/sig api-machinery
/priority important-soon
/cc @sttts @jpbetz @deads2k

@liggitt liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Jul 1, 2019
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 1, 2019
@yliaog
Copy link
Contributor

yliaog commented Jul 1, 2019

/cc

@RainbowMango
Copy link
Member

@liggitt I would like to work on it! As this my first issue in api-machinery sig.
/assign

@yue9944882
Copy link
Member

ref #77480

@liggitt liggitt added this to Required for GA, not started in Custom Resource Definitions Jul 2, 2019
@RainbowMango
Copy link
Member

@yliaog
Your PR #77480 seems try to solve the same problem as this issue. I'm feeling confused as you closed your PR.
#77480 (comment)

Cloud you please give me some clues? As well as your idea about this issue?

@liggitt liggitt moved this from Required for GA, not started to Not required for GA in Custom Resource Definitions Jul 3, 2019
@yliaog
Copy link
Contributor

yliaog commented Jul 3, 2019

the PR #77480 has more work to do

  1. the interface is not structured the way we like it to be, @caesarxuchao raised the issue in the comments. as dynamic informer can be used outside resource quota, or garbage collector, it should have a generic interface.

  2. since the dynamic informer is shared, it can only be stopped when 'safe', i.e. all users of it are 'aware' that it is stopped, and no longer use it anymore after it is stopped.

@RainbowMango
Copy link
Member

I benefit a lot from @yliaog and @yue9944882's comments.

After go through the code, i'd like to share my findings and look forward to some advise.

Root Cause

The controller, garbagecollector and resourcequota, will request a dynamic informer when a CRD be created. The two controller each add an event handler to the informer.

Take quota as example, the code as follows:

shared, err := qm.informerFactory.ForResource(resource)
if err == nil {
klog.V(4).Infof("QuotaMonitor using a shared informer for resource %q", resource.String())
shared.Informer().AddEventHandlerWithResyncPeriod(handlers, qm.resyncPeriod())
return shared.Informer().GetController(), nil
}

After CRD be deleted, the controllers doesn't remove event handler from informer as well as try to release informer, even through they don't need it. They just do clean up for themself.

for _, monitor := range toRemove {
if monitor.stopCh != nil {
close(monitor.stopCh)
}
}

So, the informer still running there and fire the error log constantly.

Possible Solutions

This solution is a little more complicated as there is no interface to remove event handler from informer and no interface to release informer from factory too.

Informer should provide interface for remove event handler.

The client should clear event handler when they no longer need it. Otherwise, the handler will leak.

Informer factory should clear informes that no event handler in it.

After remove one event handler for a specific informer, factory need to check the number of event handler left in the informer. If no event handler in it, the informer should be stopped. Otherwise, the informer will leak and running there.

Informer factory should have ability to stop specific infomers.

For now, all informers in a factory share the same stop channel, factory can not stop one informer separated.

May be we can create one stop channel for each infomer when factory start and save the channels.

I have made some simple changes for debug, and it works. And i need more comments before i fire a pull request.

Kindly ping @sttts @jpbetz @deads2k for comment. I am thankful in advance.

@RainbowMango
Copy link
Member

I reproduced this issue after #78742 be merged.
The problem still exists after gc and quota change client to metadata.

@liggitt Can you share your comments?

@liggitt
Copy link
Member Author

liggitt commented Jul 16, 2019

API clients interact with informers in multiple ways:

  • registering event handlers to them
  • taking handles to listers for stores populated by the informers

I'm not sure we'll be able to detect all the clients of the everything downstream of the informer using the current API.

cc @deads2k as the originator of the shared informer approach for thoughts

@deads2k
Copy link
Contributor

deads2k commented Jul 29, 2019

@liggitt It's not obvious to me what behavior is desired. Consider a case like:

  1. CR is already present and available.
  2. informer is started for lister and event handling
  3. CRD is deleted
  4. all CR instances are removed and events sent
  5. the backing reflector starts 404ing
  6. I think this is where desired behavior may change. Some watchers may wish to disconnect their informer/lister entirely. Others may it active and waiting for values to come back.

Wouldn't we want that choice to be left external the shared code? In the case you're probably seeing (quota and GC?), you just want an "unget my informer" that can shutdown a particular reflector if nobody wants it anymore? Seems like that decision is going to be driven by factors outside what the lister/informer/reflector/listwatcher.

@liggitt liggitt moved this from Proposed features to Bugs/Tests/Cleanup in Custom Resource Definitions Aug 31, 2019
@liggitt liggitt added this to the next-candidate milestone Sep 3, 2019
@liggitt liggitt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 3, 2019
@liggitt liggitt removed this from the next-candidate milestone Oct 6, 2019
@tedyu
Copy link
Contributor

tedyu commented Oct 29, 2019

I was thinking about adding two func's to SharedInformer interface:

diff --git a/staging/src/k8s.io/client-go/tools/cache/shared_informer.go b/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
index f59a0852fe..8025408a40 100644
--- a/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
+++ b/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
@@ -148,6 +148,11 @@ type SharedInformer interface {
        // store. The value returned is not synchronized with access to the underlying store and is not
        // thread-safe.
        LastSyncResourceVersion() string
+       // RemoveEventHandler removes an event handler from the shared informer
+       // It returns true if removal is successful
+       RemoveEventHandler(handler ResourceEventHandler) bool
+       // EventHandlerCount returns the number of event handlers currently registered with the shared informer
+       EventHandlerCount() int
 }

 // SharedIndexInformer provides add and get Indexers ability based on SharedInformer.

Given the above, sharedInformerFactory would be able to sideline informers whose count of event handlers reaches zero.
In the following snippet, such sidelining is done in WaitForCacheSync (it can also be done in a new factory func):

diff --git a/staging/src/k8s.io/client-go/informers/factory.go b/staging/src/k8s.io/client-go/informers/factory.go
index c5230a491d..fcca7087d6 100644
--- a/staging/src/k8s.io/client-go/informers/factory.go
+++ b/staging/src/k8s.io/client-go/informers/factory.go
@@ -61,6 +61,8 @@ type sharedInformerFactory struct {
        customResync     map[reflect.Type]time.Duration

        informers map[reflect.Type]cache.SharedIndexInformer
+       // sidelined informers
+       sidelinedInformers  map[reflect.Type]cache.SharedIndexInformer
        // startedInformers is used for tracking which informers have been started.
        // This allows Start() to be called multiple times safely.
        startedInformers map[reflect.Type]bool
@@ -153,6 +155,16 @@ func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[ref
        }()

        res := map[reflect.Type]bool{}
+       toSideline := map[reflect.Type]cache.SharedIndexInformer{}
+       for informType, informer := range informers {
+               if informer.EventHandlerCount() == 0 {
+                       toSideline[informType] = informer
+               }
+       }
+       for informType, informer := range toSideline {
+               delete(f.informers, informType)
+               f.sidelinedInformers[informType] = informer
+       }
        for informType, informer := range informers {
                res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
        }

This way, we don't need to introduce stop channel for each informer.
We would also need to add this method:

// removeListener returns true if the listener is removed. false otherwise
func (p *sharedProcessor) removeListener(listener *processorListener) bool {
	p.listenersLock.Lock()
	defer p.listenersLock.Unlock()
	index := -1
	for i := range p.listeners {
		if p.listeners[i] == listener {
			index = i
			break;
		}
	}
	if index >= 0 {
		p.listeners = append(p.listeners[:index], p.listeners[index+1:]...)
		p.syncingListeners = append(p.syncingListeners[:index], p.syncingListeners[index+1:]...)
		return true
	}
	return false
}

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 16, 2019
@liggitt liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 16, 2019
@Shahard2
Copy link

Shahard2 commented Oct 4, 2020

Why?

@liggitt
Joining the question...
It's flooding the kube-controller logs

@jayunit100
Copy link
Member

im seeing this to

@estahn
Copy link

estahn commented Oct 28, 2020

I think we might have found a workaround for this issue.We were seeing the same errors in our k8s clusters after we deleted some old CRDs to install new ones. Turns out that cycling the Kube-control-manager leader (docker kill on the master), helped us fix the issue.

We had to do this for all kube-control-managers.

@sumitKash
Copy link

i can confrim this too, the kube controller seem to logs this every second:
E1110 16:03:22.545119 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
E1110 16:03:22.969943 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
E1110 16:03:23.960951 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

@sumitKash
Copy link

Is there any fix that could apply? I also want to know if this is a serious problem or not pls.

@HaveFun83
Copy link

I think we might have found a workaround for this issue.We were seeing the same errors in our k8s clusters after we deleted some old CRDs to install new ones. Turns out that cycling the Kube-control-manager leader (docker kill on the master), helped us fix the issue.

Thanks a lot for this workaround

Any news to fix the root cause?

@Pingan2017
Copy link
Member

@liggitt @kevindelgado
Any New to fix this?
I leave a comment on #97214 (comment) , it will cause a new bug when i merge it

@tallclair
Copy link
Member

I ran into this recently and was thinking about ways to fix it. I agree with @deads2k that "unwatching" (or whether to unwatch) needs to be an explicit decision by the user of the client. Given that, I see 2 potential approaches:

  1. Per-informer Stop() function, with reference counting on the informers cache. This means each call to get a new informer would need to actually return a wrapper around the underlying shared informer, with it's own Stop() implementation. Reference counting on each instantiation, and only when all are stopped is the informer actually terminated.
  2. Shared informer cache-level "stop watching when the resource goes away" setting. This is less powerful than (1) but would be sufficient to cover the kube-controller-manager usecase. This seems like it might be simpler to implement, but we would need to be careful to avoid race conditions around rapidly deleting & recreating CRDs.

I haven't contributed much to our client code - is this the sort of decision that needs a KEP?

@maxsmythe
Copy link
Contributor

maxsmythe commented Apr 11, 2022

@tallclair this is sounding a bit like Gatekeeper's watch manager:

  • A central watch manager handles the set-up/tear down of dynamic watches (which is just an informer cache with the ability to remove GVKs)

  • Individual controllers can sign up for "registrars", which is a controller-specific object that allows them to update the set of GVKs they are watching via set operations (add/remove/replace one or many)

  • Watch events are distributed to controllers by the watch manager by posting to individual channels

  • The central watch manager handles updates to registrars, setting up and/or tearing down watches based on the composite results

Happy to share experiences with it if desired. Biggest challenge would be around reconciling events for a watch that has already been torn down and avoiding race conditions if two reconcilers are acting on the same resource but only one uses a registrar.

@cndoit18
Copy link
Member

cndoit18 commented Jun 2, 2022

👀

@JJotah
Copy link

JJotah commented Aug 29, 2022

Any update on this topic? I face the same issue in ~ 1.24

@cloud-bsf
Copy link

Is there a temporary workaround to start the kubelet

@Lxrdknows77
Copy link

Same issue for me.

@atoato88
Copy link
Contributor

atoato88 commented Sep 27, 2023

I faced same issue on lab v1.28.0 cluster provisioned by kind.
Any progress about this issue?

@muralov
Copy link

muralov commented Sep 29, 2023

Any update on this issue? We really need this too.

@didlawowo
Copy link

same issue using kyverno reporter

@maxsmythe
Copy link
Contributor

For those using controller-runtime libraries, it's now possible to manually remove an informer via RemoveInformer() (or will be once the change rolls out):

kubernetes-sigs/controller-runtime#2285

Unfortunately this means you'll need to handle the record keeping as to when informers should be cleaned up yourself (at least until a more elegant UX is defined)

@atoato88
Copy link
Contributor

@maxsmythe
Thank you for sharing the topic around the controller-runtime.
It looks great work for me.

The topic talked on this PR is garbage collector and quota controllers in kube-controller-manager, I think.
Those controllers don't use controller-runtime, but client-go, I think.
So is it needed to update client-go side?
WDYT?

@maxsmythe
Copy link
Contributor

Not sure what the current state of client-go is. It looks like controller-runtime uses the dynamic client under the hood, but I don't think it uses SharedInformer.

Definitely don't think this fixes the issue in core, but wanted to leave a breadcrumb for any extension authors who have this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet