WIP: Controller Lifecycle Management #1

kevindelgado · 2020-11-02T22:04:27Z

What type of PR is this?

/kind feature

What this PR does / why we need it:
Implementation of Informer Lifecycle Managemt design doc
Discussed at the Nov 4 sig-api-machinery meeting:video

It does a few things:

Adds the ability to define stop conditions when running individual informers via the new RunWithStopOptions() method on the SharedInformer interface. It exposes running informers with stop options on the various informer factories (currently only supports global stop options that apply to all informers of an informer factory)
Modifies controller-manager to run the GC and RQ controllers with RunWithStopOptions (meaning it always stops an informer upon reflector ListAndWatch error).
Modifies the garbage collector to force a resync of resources that get stopped between syncs regardless of whether they are re-installed prior to the following sync (in order to ensure the informer is appropriately restarted).
Enables tracking the number of event handlers registered to a shared informer and enables the removal of an individual event handler.

This is tested manually by modifying controller-runtime to use the new interfaces. Also by running the resource quota controller and GC controller with the new interfaces to verify that the informer shuts down as expected.

Unit testing has been added to the relevant pieces of tools/cache and metadatainformer/dynamicinformer. Integration tests modified to use the new interface methods where appropriate in order to confirm no regressions.

Which issue(s) this PR fixes:

Fixes kubernetes#79610

staging/src/k8s.io/client-go/informers/factory.go

kevindelgado · 2020-11-04T18:15:39Z

RunWithStopOptions added to the metadata informer in order to test the resource quota controller with the new changes.

shomron · 2020-11-05T17:45:32Z

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

@@ -524,6 +656,21 @@ func (s *sharedIndexInformer) AddEventHandlerWithResyncPeriod(handler ResourceEv
 	}
 }

+func (s *sharedIndexInformer) RemoveEventHandler(handler ResourceEventHandler) bool {
+	for i, listener := range s.processor.listeners {


I'd consider moving the implementation into sharedProcessor. I think we're missing the listenersLock here.

Makes sense, done

shomron · 2020-11-05T17:46:05Z

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

+}
+
+func (s *sharedIndexInformer) EventHandlerCount() int {
+	return len(s.processor.listeners)


Similar to above. Need a lock and maybe we shouldn't reach into the internals of the processor directly.

shomron · 2020-11-05T17:50:20Z

staging/src/k8s.io/client-go/tools/cache/reflector.go

@@ -182,6 +186,7 @@ func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{},
 		resyncPeriod:           resyncPeriod,
 		clock:                  realClock,
 		watchErrorHandler:      WatchErrorHandler(DefaultWatchErrorHandler),
+		stopHandle:             stopHandle,


What happens if stopHandle is nil?

This is a good point. Take a look at the most recent commit. The least intrusive solution I could think of was just to add a couple more constructors and creating a new stopHandle if it's nil.

This way, the stopHandle should never be nil if the reflector is created via constructor. A couple concerns here are:

It feels a little chaotic to have 4 different constructors for the reflector. Open to ways of doing this cleaner. We could do a builder pattern here or use something like an Options struct, but that seems like a big change and not sure if it's worth doing here.

Similar to the comment in controller.go, if you're not using a constructor and just using the struct, there's still a chance you're gonna have a nil stopHandle. We could nil-check and default it in the RunWithStopOptions() function like I've commented, but it feels a little messy.

shomron · 2020-11-05T17:51:43Z

staging/src/k8s.io/client-go/tools/cache/controller.go

 	defer utilruntime.HandleCrash()
 	go func() {
-		<-stopCh
+		<-c.config.StopHandle.Done()


What happens if c.config or c.config.StopHandle are nil?

As mentioned in reflector.go I added new reflector constructors to hopefully alleviate some of the issues.

While it doesn't solve the problem entirely, my understanding now is that at least we're not making it any worse (i.e. we would still run into the same panics regardless of this change if c.config is nil and there's nothing inherently more concerning about c.config.StopHandle being nil than there is if c.config.Queue is nil.

Let me know what you think.

kevindelgado · 2020-11-12T01:48:37Z

Thanks for the feedback @shomron! I've taken a pass at addressing it as well as adding some tests that focus on the changes int tools/cache in the latest commit.

I'm still working on testing with the factories and the couple builtin controllers we want to use this with (GC and resource quota) and will wait until I've made some progress there to make a PR in k/k. Feel free to take a look now and provide more feedback or wait for a ping once I've got something up in k/k.

kevindelgado · 2020-11-17T21:19:40Z

Hey, @DirectXMan12, I’ve been chugging along here and wanted to get some feedback (it’s big, but I think you’ve seen all except the most recent 2 commits)

Wanted to get some clarity on a couple of things

How should we enable using RunWithStopOptions on specific controllers (if we even want to)?
Right now I’ve added a boolean flag to the generic controller-manager flags that will stop the informer in the case of a reflector ListAndWatch error. This applies to all controllers though and only gives you the option to return true of false from the OnListError function rather than defining the function yourself.
Similarly, the garbage collector runs a controller indenpendent of an informer or informer factory, thus we need a different way to pass it the stop options. My current solution is to modify the GC constructor to accept a stopOnListError bool and then generate a new stopOptions with this bool in the monitor’s Run function. I’ve left a comment on why this concerns me
With respect to testing:

The factories we modify (informerfactory/informer_factory.go and client-go/informers/factory.go) don’t have any testing associated with them. Are there integration tests somewhere that capture this that I am missing? Is it necessary for me to add testing to these for the changes?
Is the unit testing I have added in tools/cache sufficient? Is there anything else you think should be tested at this level. I didn’t add a test to controller_test.go because it doesn’t seem like this test is actually testing Controller#Run() as that’s mostly captured by the tests in the shared informer and reflector.
What’s the best way to mimic the manually testing I’m doing for GC and RQM controllers in an integration or e2e test? (Where I run controller manager with just these controllers, create a CRD, and visually inspect that the informer shuts down and the controller doesn’t continually 404 when the CRD is deleted)
Are there any other higher level (e2e or integration) tests you think we needin general?

kevindelgado

This is now in a working and tested state. I've got some todos and comments inline, but I'm sure there will be a lot more to discuss. Was hoping to get one more pass @DirectXMan12 before opening a PR in k/k

test/integration/garbagecollector/garbage_collector_test.go

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

kevindelgado · 2020-12-01T23:56:13Z

staging/src/k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go

+// the old stop options (only stopping via closure of stopCh).
+// It makes sure remove an informer from the list of informers and started informers when the informer is stopped
+// to prevent a race where InformerFor gives the user back a stopped informer that will never be started again.
+func (f *sharedInformerFactory) StartWithStopOptions(stopCh <-chan struct{}) {


There are a bunch of factories like this one in various repos in staging (kube-aggregator, sample-controller, sample-apiserver, code-generator). Is this code autogenerated somewhere or do I need to copy/paste in all the various */externalversions/factory.go and factory interfaces?

kevindelgado · 2020-12-03T00:52:48Z

Here you go @caesarxuchao this is the log of the failing integration test using the new informer shutdown mechanism
test-crd-deletion-failing.log
as well as the log of the passing test using the old method
test-crd-deletion-passing.log

test/integration/garbagecollector/garbage_collector_test.go

…nown Deprecating PodUnknown podPhase

When the API server encounters an error during admission webhook handling, lower-level errors are bubbled up without any additional context added. This leads to fairly opaque and unintelligible errors. It is not clear to users if the API server itself is having an error (for instance, fetching the REST client) or if the request to the webhook failed in some way. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

…nilcheck cleanup: omit redundant nil check around loop in apiserver

Structured Logging migration: modify server_windows part logs of kube-proxy.

kubeadm:Use kubeadmapiv1.SchemeGroupVersion.String() instead of kubeadm.k8s.i…

Fix fluent-bit configuration for GCE Windows.

add --all-namespaces to kubectl annotate,label

…p-admission-error-reasons apiserver: wrap errors in admission with context

split CRD schema test between migrated data and current

The method is used only for testing purposes. Given Resource data type exposes all its fields, any invoker of ResourceList that is still using the method outside of kubernetes/kubernetes can still either copy paste the original implementation or implement a custom method that's converting resources into proper Quantity data type. Given the hugepage resource is a scalar resource, it's sufficient the underlying code under fit_test.go to take into account any extended resources. For predicate_test.go, the hugepage resource does not play any role as the General predicates test cases does not set any scaler resource at all. Additionally, by removing ResourceList method, pkg/scheduler/framework can get rid of dependency on k8s.io/kubernetes/pkg/apis/core/v1/helper.

Added integration test for pod affinity namespace selector

…unt-path-removal Deprecate removal of CSI nodepublish path by kubelet (kubernetes#101332)

…Resource-ResourceList-method pkg/scheduler: drop Resource.ResourceList() method

kube-proxy copy node labels

…eplace-IsScalerResourceName-with-nodeinfo-allocatable-scalar-resource-presence noderesource: node info already knows which resources are scalar

To be able to implement controllers that are dynamically deciding on which resources to watch, it is required to get rid of dedicated watches and event handlers again. This requires the possibility to remove event handlers from SharedIndexInformers again. Stopping an informer is not sufficient, because there might be multiple controllers in a controller manager that independently decide which resources to watch. Unfortunately the ResourceEventHandler interface encourages to use value objects for handlers (like the ResourceEventHandlerFuncs struct, that uses value receivers to implement the interface). Go does not support comparison of function pointers and therefore the comparison of such structs is not possible, also. Such handlers cannot be matched again after they have been added and therefore it is only possible to remove handlers that are comparable. Fortunately struct based handlers can also be passed by reference. The user of the interface can therefore still use those handlers and remove them again, by switching to a pointer argument. The remove method checks whether a handler can be compared and ignores uncomparable handlers in the removal process. Removing of uncomparable handlers result in an error return. Remark: If as the result of a handler removal a complete informer should be disabled it is higly recommended to incorporate pull request kubernetes#98653, which fixes a race condition when stopping watches for an informer using the stop channel.

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from d6560f3 to eea92eb Compare November 2, 2020 22:55

DirectXMan12 reviewed Nov 3, 2020

View reviewed changes

staging/src/k8s.io/client-go/informers/factory.go Outdated Show resolved Hide resolved

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from a2a97f5 to d0440b5 Compare November 4, 2020 17:46

shomron reviewed Nov 5, 2020

View reviewed changes

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from 8e00b3f to aac37d6 Compare November 12, 2020 01:45

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from 6ebafd1 to e356475 Compare November 17, 2020 20:59

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from 9f2f852 to 537742d Compare November 20, 2020 17:26

kevindelgado changed the base branch from master to connection-refused-backoff November 20, 2020 17:28

kevindelgado changed the base branch from connection-refused-backoff to master November 20, 2020 17:28

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from f55ac2e to 8b6bd3a Compare November 20, 2020 20:51

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 2 times, most recently from b98ea64 to 72231b9 Compare December 1, 2020 23:50

kevindelgado commented Dec 1, 2020

View reviewed changes

kevindelgado commented Dec 4, 2020

View reviewed changes

test/integration/garbagecollector/garbage_collector_test.go Outdated Show resolved Hide resolved

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 2 times, most recently from 0f9fa39 to f8b370f Compare December 9, 2020 20:58

k8s-ci-robot and others added 17 commits April 27, 2021 11:21

Merge pull request kubernetes#95286 from SergeyKanzhelev/removePodUnk…

21ab9c4

…nown Deprecating PodUnknown podPhase

Merge pull request kubernetes#101403 from wangyx1992/redundant-silce-…

fbc93bd

…nilcheck cleanup: omit redundant nil check around loop in apiserver

Merge pull request kubernetes#101410 from songxiao-wang87/run-test5

3d5ebcc

Structured Logging migration: modify server_windows part logs of kube-proxy.

clarify deprecation comment

260426b

Merge pull request kubernetes#101307 from wangyysde/replace-groupversion

05e90d9

kubeadm:Use kubeadmapiv1.SchemeGroupVersion.String() instead of kubeadm.k8s.i…

Merge pull request kubernetes#101271 from jeremyje/fixfb

f631c0e

Fix fluent-bit configuration for GCE Windows.

Merge pull request kubernetes#101487 from deads2k/annotate-A

1cab0ad

add --all-namespaces to kubectl annotate,label

Merge pull request kubernetes#101532 from stevekuznetsov/skuznets/wra…

50e3197

…p-admission-error-reasons apiserver: wrap errors in admission with context

Merge pull request kubernetes#101497 from deads2k/crd-integration

0128917

split CRD schema test between migrated data and current

kube-proxy copy node labels

c6d97ee

Merge pull request kubernetes#101352 from ahg-g/ahg-nss-integration

e6b4fa3

Added integration test for pod affinity namespace selector

Merge pull request kubernetes#101441 from dobsonj/101332-deprecate-mo…

7f8539f

…unt-path-removal Deprecate removal of CSI nodepublish path by kubelet (kubernetes#101332)

Merge pull request kubernetes#101465 from ingvagabund/scheduler-drop-…

b9e8671

…Resource-ResourceList-method pkg/scheduler: drop Resource.ResourceList() method

Merge pull request kubernetes#101583 from aojea/kproxycopylabels

dc6b04c

kube-proxy copy node labels

CHANGELOG: Update directory for v1.22.0-alpha.1 release

6c3654d

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from b573670 to d5fc2e7 Compare April 28, 2021 21:56

k8s-ci-robot and others added 2 commits April 28, 2021 18:12

Merge pull request kubernetes#101467 from ingvagabund/noderesources-r…

b4ea432

…eplace-IsScalerResourceName-with-nodeinfo-allocatable-scalar-resource-presence noderesource: node info already knows which resources are scalar

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from d5fc2e7 to bfa6e68 Compare April 29, 2021 01:49

kevindelgado added 3 commits April 29, 2021 04:22

Add OnError to ResourceEventHandler

59cbb26

Enable running SharedInformer with stop options

6c49e5f

Use RunWithStopOptions in dynamic and metadata informer factories

eb73b73

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from bfa6e68 to 82c0bbf Compare April 29, 2021 04:44

kevindelgado added 4 commits April 29, 2021 04:54

Factor out common CRD integration testing utils

519ce13

Modify GC controller to stop informers on error

db7877c

Modify quota controller to stop informers on error

d96bd4b

Controller Manager runs stoppable informers

1f87ccb

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from 82c0bbf to 1f87ccb Compare April 29, 2021 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Controller Lifecycle Management #1

WIP: Controller Lifecycle Management #1

kevindelgado commented Nov 2, 2020 •

edited

kevindelgado commented Nov 4, 2020

shomron Nov 5, 2020

kevindelgado Nov 12, 2020

shomron Nov 5, 2020

kevindelgado Nov 12, 2020

shomron Nov 5, 2020

kevindelgado Nov 12, 2020

shomron Nov 5, 2020

kevindelgado Nov 12, 2020

kevindelgado commented Nov 12, 2020

kevindelgado commented Nov 17, 2020 •

edited

kevindelgado left a comment •

edited

kevindelgado Dec 1, 2020

kevindelgado commented Dec 3, 2020

WIP: Controller Lifecycle Management #1

Are you sure you want to change the base?

WIP: Controller Lifecycle Management #1

Conversation

kevindelgado commented Nov 2, 2020 • edited

kevindelgado commented Nov 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevindelgado commented Nov 12, 2020

kevindelgado commented Nov 17, 2020 • edited

kevindelgado left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevindelgado commented Dec 3, 2020

kevindelgado commented Nov 2, 2020 •

edited

kevindelgado commented Nov 17, 2020 •

edited

kevindelgado left a comment •

edited