Dynamic informers do not stop when custom resource definition is removed #79610

liggitt · 2019-07-01T18:34:31Z

What happened:

Once started, dynamic informers for custom resources are not stopped.

What you expected to happen:

After the dynamic informers resynced, they would stop informers belong to no-longer-existing resources.

How to reproduce it (as minimally and precisely as possible):

Start a cluster with garbage collection and quota enabled with high enough verbosity to see API requests logged
Create a custom resource definition
Observe list/watch requests from the dynamic informers:

I0701 14:31:54.604559   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (382.601µs) 200 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]

Delete the custom resource definition
Observe continuous list requests from the dynamic informers receiving a 404:

I0701 14:33:05.857207   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (346.987µs) 404 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
I0701 14:33:06.861218   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (349.903µs) 404 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
...

after the switch by the garbage collector and quota controllers to use the generic metadata client, the controller manager logs around this are even harder to understand:

2019-11-16T01:18:20.15247609Z stderr F E1116 01:18:20.152307       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.450468558Z stderr F E1116 01:18:20.450095       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.632847838Z stderr F E1116 01:18:20.625706       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.693940889Z stderr F E1116 01:18:20.693774       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.740153328Z stderr F E1116 01:18:20.739963       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.740400594Z stderr F E1116 01:18:20.740276       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.820833937Z stderr F E1116 01:18:20.820695       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.848726246Z stderr F E1116 01:18:20.848605       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.856013314Z stderr F E1116 01:18:20.855816       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.950015373Z stderr F E1116 01:18:20.949837       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.998885723Z stderr F E1116 01:18:20.998732       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:21.001894579Z stderr F E1116 01:18:21.001726       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

/sig api-machinery
/priority important-soon
/cc @sttts @jpbetz @deads2k

The text was updated successfully, but these errors were encountered:

yliaog · 2019-07-01T21:27:09Z

/cc

RainbowMango · 2019-07-02T01:44:06Z

@liggitt I would like to work on it! As this my first issue in api-machinery sig.
/assign

yue9944882 · 2019-07-02T04:04:40Z

ref #77480

RainbowMango · 2019-07-03T13:03:10Z

@yliaog
Your PR #77480 seems try to solve the same problem as this issue. I'm feeling confused as you closed your PR.
#77480 (comment)

Cloud you please give me some clues? As well as your idea about this issue?

yliaog · 2019-07-03T17:31:59Z

the PR #77480 has more work to do

the interface is not structured the way we like it to be, @caesarxuchao raised the issue in the comments. as dynamic informer can be used outside resource quota, or garbage collector, it should have a generic interface.
since the dynamic informer is shared, it can only be stopped when 'safe', i.e. all users of it are 'aware' that it is stopped, and no longer use it anymore after it is stopped.

RainbowMango · 2019-07-09T13:34:27Z

I benefit a lot from @yliaog and @yue9944882's comments.

After go through the code, i'd like to share my findings and look forward to some advise.

Root Cause

The controller, garbagecollector and resourcequota, will request a dynamic informer when a CRD be created. The two controller each add an event handler to the informer.

Take quota as example, the code as follows:

kubernetes/pkg/controller/resourcequota/resource_quota_monitor.go

Lines 174 to 179 in 99ff994

    
           shared, err := qm.informerFactory.ForResource(resource) 
        
           if err == nil { 
        
           	klog.V(4).Infof("QuotaMonitor using a shared informer for resource %q", resource.String()) 
        
           	shared.Informer().AddEventHandlerWithResyncPeriod(handlers, qm.resyncPeriod()) 
        
           	return shared.Informer().GetController(), nil 
        
           }

After CRD be deleted, the controllers doesn't remove event handler from informer as well as try to release informer, even through they don't need it. They just do clean up for themself.

kubernetes/pkg/controller/resourcequota/resource_quota_monitor.go

Lines 237 to 241 in 99ff994

    
           for _, monitor := range toRemove { 
        
           	if monitor.stopCh != nil { 
        
           		close(monitor.stopCh) 
        
           	} 
        
           }

So, the informer still running there and fire the error log constantly.

Possible Solutions

This solution is a little more complicated as there is no interface to remove event handler from informer and no interface to release informer from factory too.

Informer should provide interface for remove event handler.

The client should clear event handler when they no longer need it. Otherwise, the handler will leak.

Informer factory should clear informes that no event handler in it.

After remove one event handler for a specific informer, factory need to check the number of event handler left in the informer. If no event handler in it, the informer should be stopped. Otherwise, the informer will leak and running there.

Informer factory should have ability to stop specific infomers.

For now, all informers in a factory share the same stop channel, factory can not stop one informer separated.

May be we can create one stop channel for each infomer when factory start and save the channels.

I have made some simple changes for debug, and it works. And i need more comments before i fire a pull request.

Kindly ping @sttts @jpbetz @deads2k for comment. I am thankful in advance.

RainbowMango · 2019-07-12T02:14:03Z

I reproduced this issue after #78742 be merged.
The problem still exists after gc and quota change client to metadata.

@liggitt Can you share your comments?

liggitt · 2019-07-16T03:51:11Z

API clients interact with informers in multiple ways:

registering event handlers to them
taking handles to listers for stores populated by the informers

I'm not sure we'll be able to detect all the clients of the everything downstream of the informer using the current API.

cc @deads2k as the originator of the shared informer approach for thoughts

deads2k · 2019-07-29T11:40:36Z

@liggitt It's not obvious to me what behavior is desired. Consider a case like:

CR is already present and available.
informer is started for lister and event handling
CRD is deleted
all CR instances are removed and events sent
the backing reflector starts 404ing
I think this is where desired behavior may change. Some watchers may wish to disconnect their informer/lister entirely. Others may it active and waiting for values to come back.

Wouldn't we want that choice to be left external the shared code? In the case you're probably seeing (quota and GC?), you just want an "unget my informer" that can shutdown a particular reflector if nobody wants it anymore? Seems like that decision is going to be driven by factors outside what the lister/informer/reflector/listwatcher.

tedyu · 2019-10-29T15:47:19Z

I was thinking about adding two func's to SharedInformer interface:

diff --git a/staging/src/k8s.io/client-go/tools/cache/shared_informer.go b/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
index f59a0852fe..8025408a40 100644
--- a/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
+++ b/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
@@ -148,6 +148,11 @@ type SharedInformer interface {
        // store. The value returned is not synchronized with access to the underlying store and is not
        // thread-safe.
        LastSyncResourceVersion() string
+       // RemoveEventHandler removes an event handler from the shared informer
+       // It returns true if removal is successful
+       RemoveEventHandler(handler ResourceEventHandler) bool
+       // EventHandlerCount returns the number of event handlers currently registered with the shared informer
+       EventHandlerCount() int
 }

 // SharedIndexInformer provides add and get Indexers ability based on SharedInformer.

Given the above, sharedInformerFactory would be able to sideline informers whose count of event handlers reaches zero.
In the following snippet, such sidelining is done in WaitForCacheSync (it can also be done in a new factory func):

diff --git a/staging/src/k8s.io/client-go/informers/factory.go b/staging/src/k8s.io/client-go/informers/factory.go
index c5230a491d..fcca7087d6 100644
--- a/staging/src/k8s.io/client-go/informers/factory.go
+++ b/staging/src/k8s.io/client-go/informers/factory.go
@@ -61,6 +61,8 @@ type sharedInformerFactory struct {
        customResync     map[reflect.Type]time.Duration

        informers map[reflect.Type]cache.SharedIndexInformer
+       // sidelined informers
+       sidelinedInformers  map[reflect.Type]cache.SharedIndexInformer
        // startedInformers is used for tracking which informers have been started.
        // This allows Start() to be called multiple times safely.
        startedInformers map[reflect.Type]bool
@@ -153,6 +155,16 @@ func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[ref
        }()

        res := map[reflect.Type]bool{}
+       toSideline := map[reflect.Type]cache.SharedIndexInformer{}
+       for informType, informer := range informers {
+               if informer.EventHandlerCount() == 0 {
+                       toSideline[informType] = informer
+               }
+       }
+       for informType, informer := range toSideline {
+               delete(f.informers, informType)
+               f.sidelinedInformers[informType] = informer
+       }
        for informType, informer := range informers {
                res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
        }

This way, we don't need to introduce stop channel for each informer.
We would also need to add this method:

// removeListener returns true if the listener is removed. false otherwise
func (p *sharedProcessor) removeListener(listener *processorListener) bool {
	p.listenersLock.Lock()
	defer p.listenersLock.Unlock()
	index := -1
	for i := range p.listeners {
		if p.listeners[i] == listener {
			index = i
			break;
		}
	}
	if index >= 0 {
		p.listeners = append(p.listeners[:index], p.listeners[index+1:]...)
		p.syncingListeners = append(p.syncingListeners[:index], p.syncingListeners[index+1:]...)
		return true
	}
	return false
}

Shahard2 · 2020-10-04T08:24:11Z

Why?

@liggitt
Joining the question...
It's flooding the kube-controller logs

jayunit100 · 2020-10-08T19:27:58Z

im seeing this to

estahn · 2020-10-28T22:40:05Z

I think we might have found a workaround for this issue.We were seeing the same errors in our k8s clusters after we deleted some old CRDs to install new ones. Turns out that cycling the Kube-control-manager leader (docker kill on the master), helped us fix the issue.

We had to do this for all kube-control-managers.

sumitKash · 2020-11-10T18:30:35Z

i can confrim this too, the kube controller seem to logs this every second:
E1110 16:03:22.545119 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
E1110 16:03:22.969943 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
E1110 16:03:23.960951 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

sumitKash · 2020-11-10T18:32:20Z

Is there any fix that could apply? I also want to know if this is a serious problem or not pls.

HaveFun83 · 2021-08-13T08:28:56Z

I think we might have found a workaround for this issue.We were seeing the same errors in our k8s clusters after we deleted some old CRDs to install new ones. Turns out that cycling the Kube-control-manager leader (docker kill on the master), helped us fix the issue.

Thanks a lot for this workaround

Any news to fix the root cause?

Pingan2017 · 2022-03-02T01:42:50Z

@liggitt @kevindelgado
Any New to fix this?
I leave a comment on #97214 (comment) , it will cause a new bug when i merge it

tallclair · 2022-04-11T15:30:30Z

I ran into this recently and was thinking about ways to fix it. I agree with @deads2k that "unwatching" (or whether to unwatch) needs to be an explicit decision by the user of the client. Given that, I see 2 potential approaches:

Per-informer Stop() function, with reference counting on the informers cache. This means each call to get a new informer would need to actually return a wrapper around the underlying shared informer, with it's own Stop() implementation. Reference counting on each instantiation, and only when all are stopped is the informer actually terminated.
Shared informer cache-level "stop watching when the resource goes away" setting. This is less powerful than (1) but would be sufficient to cover the kube-controller-manager usecase. This seems like it might be simpler to implement, but we would need to be careful to avoid race conditions around rapidly deleting & recreating CRDs.

I haven't contributed much to our client code - is this the sort of decision that needs a KEP?

maxsmythe · 2022-04-11T18:38:23Z

@tallclair this is sounding a bit like Gatekeeper's watch manager:

A central watch manager handles the set-up/tear down of dynamic watches (which is just an informer cache with the ability to remove GVKs)
Individual controllers can sign up for "registrars", which is a controller-specific object that allows them to update the set of GVKs they are watching via set operations (add/remove/replace one or many)
Watch events are distributed to controllers by the watch manager by posting to individual channels
The central watch manager handles updates to registrars, setting up and/or tearing down watches based on the composite results

Happy to share experiences with it if desired. Biggest challenge would be around reconciling events for a watch that has already been torn down and avoiding race conditions if two reconcilers are acting on the same resource but only one uses a registrar.

cndoit18 · 2022-06-02T15:34:21Z

👀

JJotah · 2022-08-29T08:06:23Z

Any update on this topic? I face the same issue in ~ 1.24

cloud-bsf · 2022-09-03T12:47:34Z

Is there a temporary workaround to start the kubelet

Lxrdknows77 · 2023-08-24T10:37:15Z

Same issue for me.

atoato88 · 2023-09-27T04:14:41Z

I faced same issue on lab v1.28.0 cluster provisioned by kind.
Any progress about this issue?

muralov · 2023-09-29T09:08:14Z

Any update on this issue? We really need this too.

didlawowo · 2023-12-18T21:02:50Z

same issue using kyverno reporter

maxsmythe · 2024-01-02T23:16:41Z

For those using controller-runtime libraries, it's now possible to manually remove an informer via RemoveInformer() (or will be once the change rolls out):

kubernetes-sigs/controller-runtime#2285

Unfortunately this means you'll need to handle the record keeping as to when informers should be cleaned up yourself (at least until a more elegant UX is defined)

atoato88 · 2024-01-31T20:42:36Z

@maxsmythe
Thank you for sharing the topic around the controller-runtime.
It looks great work for me.

The topic talked on this PR is garbage collector and quota controllers in kube-controller-manager, I think.
Those controllers don't use controller-runtime, but client-go, I think.
So is it needed to update client-go side?
WDYT?

maxsmythe · 2024-02-03T03:17:01Z

Not sure what the current state of client-go is. It looks like controller-runtime uses the dynamic client under the hood, but I don't think it uses SharedInformer.

Definitely don't think this fixes the issue in core, but wanted to leave a breadcrumb for any extension authors who have this issue.

liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Jul 1, 2019

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 1, 2019

k8s-ci-robot assigned RainbowMango Jul 2, 2019

liggitt added this to Required for GA, not started in Custom Resource Definitions Jul 2, 2019

liggitt moved this from Required for GA, not started to Not required for GA in Custom Resource Definitions Jul 3, 2019

RainbowMango mentioned this issue Jul 5, 2019

Update contributor guide about assign feature kubernetes/community#3872

Merged

liggitt mentioned this issue Jul 9, 2019

The garbage collector and quota counter should use the metadata client and protobuf to access resources #78742

Merged

AmitKumarDas mentioned this issue Jul 28, 2019

K8s: Dynamic Informers AmitKumarDas/Decisions#48

Open

liggitt moved this from Proposed features to Bugs/Tests/Cleanup in Custom Resource Definitions Aug 31, 2019

liggitt added this to the next-candidate milestone Sep 3, 2019

liggitt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 3, 2019

bryanl mentioned this issue Sep 10, 2019

If a CRD is deleted, stop it's watcher vmware-archive/octant#117

Closed

liggitt removed this from the next-candidate milestone Oct 6, 2019

calmkart mentioned this issue Oct 29, 2019

delete crd but not stop informer watch/list #84494

Closed

tedyu mentioned this issue Oct 29, 2019

Remove event handler when resource is removed #84537

Closed

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 16, 2019

liggitt mentioned this issue Nov 16, 2019

WIP - Log more details about metadatainformer errors #85376

Closed

liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 16, 2019

ArseniiPetrovich mentioned this issue Sep 26, 2020

Migrate prod environment to the new cluster protofire/Filecoin-node-hosting-management#48

Open

kevindelgado linked a pull request Nov 2, 2020 that will close this issue

WIP: Controller Lifecycle Management kevindelgado/kubernetes#1

Open

nijave mentioned this issue Nov 25, 2020

A few suspicious logs kubernetes-sigs/cluster-api-provider-aws#1976

Closed

kevindelgado mentioned this issue Dec 11, 2020

Informer Lifecycle Improvements #97214

Closed

josephpeacock mentioned this issue Mar 10, 2021

Istio MCS Generating ServiceExports istio/istio#29763

Closed

zwtop mentioned this issue Sep 30, 2021

kube-controller-manager will log "list error" when crd is removed everoute/everoute#185

Open

yue9944882 mentioned this issue Apr 14, 2022

WEP 2: Sync applied workload resource status kubernetes-sigs/work-api#18

Closed

ArchiFleKs mentioned this issue Jun 3, 2023

Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster vmware-tanzu/velero#6350

Closed

cedricmckinnie mentioned this issue Jun 29, 2023

FluxCD integration idea: allow TestTriggers to execute tests on CustomResource Conditions kubeshop/testkube#3712

Open

lorenz mentioned this issue Mar 5, 2024

k8s controllers are spamming warning/errors monogon-dev/monogon#289

Open

nick-oconnor mentioned this issue Apr 1, 2024

Removed AddressPool CRD is still referenced in code metallb/metallb#2270

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic informers do not stop when custom resource definition is removed #79610

Dynamic informers do not stop when custom resource definition is removed #79610

liggitt commented Jul 1, 2019 •

edited

yliaog commented Jul 1, 2019

RainbowMango commented Jul 2, 2019

yue9944882 commented Jul 2, 2019

RainbowMango commented Jul 3, 2019

yliaog commented Jul 3, 2019

RainbowMango commented Jul 9, 2019

RainbowMango commented Jul 12, 2019

liggitt commented Jul 16, 2019

deads2k commented Jul 29, 2019

tedyu commented Oct 29, 2019 •

edited

Shahard2 commented Oct 4, 2020

jayunit100 commented Oct 8, 2020

estahn commented Oct 28, 2020

sumitKash commented Nov 10, 2020

sumitKash commented Nov 10, 2020

HaveFun83 commented Aug 13, 2021

Pingan2017 commented Mar 2, 2022

tallclair commented Apr 11, 2022

maxsmythe commented Apr 11, 2022 •

edited

cndoit18 commented Jun 2, 2022

JJotah commented Aug 29, 2022

cloud-bsf commented Sep 3, 2022

Lxrdknows77 commented Aug 24, 2023

atoato88 commented Sep 27, 2023 •

edited

muralov commented Sep 29, 2023 •

edited

didlawowo commented Dec 18, 2023

maxsmythe commented Jan 2, 2024

atoato88 commented Jan 31, 2024

maxsmythe commented Feb 3, 2024

Dynamic informers do not stop when custom resource definition is removed #79610

Dynamic informers do not stop when custom resource definition is removed #79610

Comments

liggitt commented Jul 1, 2019 • edited

yliaog commented Jul 1, 2019

RainbowMango commented Jul 2, 2019

yue9944882 commented Jul 2, 2019

RainbowMango commented Jul 3, 2019

yliaog commented Jul 3, 2019

RainbowMango commented Jul 9, 2019

Root Cause

Possible Solutions

Informer should provide interface for remove event handler.

Informer factory should clear informes that no event handler in it.

Informer factory should have ability to stop specific infomers.

RainbowMango commented Jul 12, 2019

liggitt commented Jul 16, 2019

deads2k commented Jul 29, 2019

tedyu commented Oct 29, 2019 • edited

Shahard2 commented Oct 4, 2020

jayunit100 commented Oct 8, 2020

estahn commented Oct 28, 2020

sumitKash commented Nov 10, 2020

sumitKash commented Nov 10, 2020

HaveFun83 commented Aug 13, 2021

Pingan2017 commented Mar 2, 2022

tallclair commented Apr 11, 2022

maxsmythe commented Apr 11, 2022 • edited

cndoit18 commented Jun 2, 2022

JJotah commented Aug 29, 2022

cloud-bsf commented Sep 3, 2022

Lxrdknows77 commented Aug 24, 2023

atoato88 commented Sep 27, 2023 • edited

muralov commented Sep 29, 2023 • edited

didlawowo commented Dec 18, 2023

maxsmythe commented Jan 2, 2024

atoato88 commented Jan 31, 2024

maxsmythe commented Feb 3, 2024

liggitt commented Jul 1, 2019 •

edited

tedyu commented Oct 29, 2019 •

edited

maxsmythe commented Apr 11, 2022 •

edited

atoato88 commented Sep 27, 2023 •

edited

muralov commented Sep 29, 2023 •

edited