Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Controller Lifecycle Management #1

Open
wants to merge 776 commits into
base: master
Choose a base branch
from

Conversation

kevindelgado
Copy link
Owner

@kevindelgado kevindelgado commented Nov 2, 2020

What type of PR is this?

/kind feature

What this PR does / why we need it:
Implementation of Informer Lifecycle Managemt design doc
Discussed at the Nov 4 sig-api-machinery meeting:video

It does a few things:

  1. Adds the ability to define stop conditions when running individual informers via the new RunWithStopOptions() method on the SharedInformer interface. It exposes running informers with stop options on the various informer factories (currently only supports global stop options that apply to all informers of an informer factory)
  2. Modifies controller-manager to run the GC and RQ controllers with RunWithStopOptions (meaning it always stops an informer upon reflector ListAndWatch error).
  3. Modifies the garbage collector to force a resync of resources that get stopped between syncs regardless of whether they are re-installed prior to the following sync (in order to ensure the informer is appropriately restarted).
  4. Enables tracking the number of event handlers registered to a shared informer and enables the removal of an individual event handler.

This is tested manually by modifying controller-runtime to use the new interfaces. Also by running the resource quota controller and GC controller with the new interfaces to verify that the informer shuts down as expected.

Unit testing has been added to the relevant pieces of tools/cache and metadatainformer/dynamicinformer. Integration tests modified to use the new interface methods where appropriate in order to confirm no regressions.

Which issue(s) this PR fixes:

Fixes kubernetes#79610

@kevindelgado kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from d6560f3 to eea92eb Compare November 2, 2020 22:55
@kevindelgado kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from a2a97f5 to d0440b5 Compare November 4, 2020 17:46
@kevindelgado
Copy link
Owner Author

RunWithStopOptions added to the metadata informer in order to test the resource quota controller with the new changes.

@@ -524,6 +656,21 @@ func (s *sharedIndexInformer) AddEventHandlerWithResyncPeriod(handler ResourceEv
}
}

func (s *sharedIndexInformer) RemoveEventHandler(handler ResourceEventHandler) bool {
for i, listener := range s.processor.listeners {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider moving the implementation into sharedProcessor. I think we're missing the listenersLock here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done

}

func (s *sharedIndexInformer) EventHandlerCount() int {
return len(s.processor.listeners)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above. Need a lock and maybe we shouldn't reach into the internals of the processor directly.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -182,6 +186,7 @@ func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{},
resyncPeriod: resyncPeriod,
clock: realClock,
watchErrorHandler: WatchErrorHandler(DefaultWatchErrorHandler),
stopHandle: stopHandle,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if stopHandle is nil?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Take a look at the most recent commit. The least intrusive solution I could think of was just to add a couple more constructors and creating a new stopHandle if it's nil.

This way, the stopHandle should never be nil if the reflector is created via constructor. A couple concerns here are:

  1. It feels a little chaotic to have 4 different constructors for the reflector. Open to ways of doing this cleaner. We could do a builder pattern here or use something like an Options struct, but that seems like a big change and not sure if it's worth doing here.
  2. Similar to the comment in controller.go, if you're not using a constructor and just using the struct, there's still a chance you're gonna have a nil stopHandle. We could nil-check and default it in the RunWithStopOptions() function like I've commented, but it feels a little messy.

defer utilruntime.HandleCrash()
go func() {
<-stopCh
<-c.config.StopHandle.Done()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if c.config or c.config.StopHandle are nil?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in reflector.go I added new reflector constructors to hopefully alleviate some of the issues.

While it doesn't solve the problem entirely, my understanding now is that at least we're not making it any worse (i.e. we would still run into the same panics regardless of this change if c.config is nil and there's nothing inherently more concerning about c.config.StopHandle being nil than there is if c.config.Queue is nil.

Let me know what you think.

@kevindelgado
Copy link
Owner Author

Thanks for the feedback @shomron! I've taken a pass at addressing it as well as adding some tests that focus on the changes int tools/cache in the latest commit.

I'm still working on testing with the factories and the couple builtin controllers we want to use this with (GC and resource quota) and will wait until I've made some progress there to make a PR in k/k. Feel free to take a look now and provide more feedback or wait for a ping once I've got something up in k/k.

@kevindelgado
Copy link
Owner Author

kevindelgado commented Nov 17, 2020

Hey, @DirectXMan12, I’ve been chugging along here and wanted to get some feedback (it’s big, but I think you’ve seen all except the most recent 2 commits)

Wanted to get some clarity on a couple of things

  1. How should we enable using RunWithStopOptions on specific controllers (if we even want to)?
    Right now I’ve added a boolean flag to the generic controller-manager flags that will stop the informer in the case of a reflector ListAndWatch error. This applies to all controllers though and only gives you the option to return true of false from the OnListError function rather than defining the function yourself.

  2. Similarly, the garbage collector runs a controller indenpendent of an informer or informer factory, thus we need a different way to pass it the stop options. My current solution is to modify the GC constructor to accept a stopOnListError bool and then generate a new stopOptions with this bool in the monitor’s Run function. I’ve left a comment on why this concerns me

  3. With respect to testing:

  • The factories we modify (informerfactory/informer_factory.go and client-go/informers/factory.go) don’t have any testing associated with them. Are there integration tests somewhere that capture this that I am missing? Is it necessary for me to add testing to these for the changes?

  • Is the unit testing I have added in tools/cache sufficient? Is there anything else you think should be tested at this level. I didn’t add a test to controller_test.go because it doesn’t seem like this test is actually testing Controller#Run() as that’s mostly captured by the tests in the shared informer and reflector.

  • What’s the best way to mimic the manually testing I’m doing for GC and RQM controllers in an integration or e2e test? (Where I run controller manager with just these controllers, create a CRD, and visually inspect that the informer shuts down and the controller doesn’t continually 404 when the CRD is deleted)

  • Are there any other higher level (e2e or integration) tests you think we needin general?

@kevindelgado kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from 9f2f852 to 537742d Compare November 20, 2020 17:26
@kevindelgado kevindelgado changed the base branch from master to connection-refused-backoff November 20, 2020 17:28
@kevindelgado kevindelgado changed the base branch from connection-refused-backoff to master November 20, 2020 17:28
@kevindelgado kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 3 times, most recently from f55ac2e to 8b6bd3a Compare November 20, 2020 20:51
@kevindelgado kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 2 times, most recently from b98ea64 to 72231b9 Compare December 1, 2020 23:50
Copy link
Owner Author

@kevindelgado kevindelgado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now in a working and tested state. I've got some todos and comments inline, but I'm sure there will be a lot more to discuss. Was hoping to get one more pass @DirectXMan12 before opening a PR in k/k

// the old stop options (only stopping via closure of stopCh).
// It makes sure remove an informer from the list of informers and started informers when the informer is stopped
// to prevent a race where InformerFor gives the user back a stopped informer that will never be started again.
func (f *sharedInformerFactory) StartWithStopOptions(stopCh <-chan struct{}) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of factories like this one in various repos in staging (kube-aggregator, sample-controller, sample-apiserver, code-generator). Is this code autogenerated somewhere or do I need to copy/paste in all the various */externalversions/factory.go and factory interfaces?

@kevindelgado
Copy link
Owner Author

Here you go @caesarxuchao this is the log of the failing integration test using the new informer shutdown mechanism
test-crd-deletion-failing.log
as well as the log of the passing test using the old method
test-crd-deletion-passing.log

@kevindelgado kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch 2 times, most recently from 0f9fa39 to f8b370f Compare December 9, 2020 20:58
k8s-ci-robot and others added 17 commits April 27, 2021 11:21
When the API server encounters an error during admission webhook
handling, lower-level errors are bubbled up without any additional
context added. This leads to fairly opaque and unintelligible errors. It
is not clear to users if the API server itself is having an error (for
instance, fetching the REST client) or if the request to the webhook
failed in some way.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
…nilcheck

cleanup: omit redundant nil check around loop in apiserver
Structured Logging migration: modify server_windows part logs of kube-proxy.
kubeadm:Use kubeadmapiv1.SchemeGroupVersion.String() instead of kubeadm.k8s.i…
Fix fluent-bit configuration for GCE Windows.
add --all-namespaces to kubectl annotate,label
…p-admission-error-reasons

apiserver: wrap errors in admission with context
split CRD schema test between migrated data and current
The method is used only for testing purposes. Given Resource data type
exposes all its fields, any invoker of ResourceList that is still
using the method outside of kubernetes/kubernetes can still either
copy paste the original implementation or implement a custom method
that's converting resources into proper Quantity data type.

Given the hugepage resource is a scalar resource, it's sufficient
the underlying code under fit_test.go to take into account any
extended resources. For predicate_test.go, the hugepage
resource does not play any role as the General predicates test cases
does not set any scaler resource at all.

Additionally, by removing ResourceList method, pkg/scheduler/framework
can get rid of dependency on k8s.io/kubernetes/pkg/apis/core/v1/helper.
Added integration test for pod affinity namespace selector
…unt-path-removal

Deprecate removal of CSI nodepublish path by kubelet (kubernetes#101332)
…Resource-ResourceList-method

pkg/scheduler: drop Resource.ResourceList() method
k8s-ci-robot and others added 2 commits April 28, 2021 18:12
…eplace-IsScalerResourceName-with-nodeinfo-allocatable-scalar-resource-presence

noderesource: node info already knows which resources are scalar
To be able to implement controllers that are dynamically deciding
on which resources to watch, it is required to get rid of
dedicated watches and event handlers again. This requires the
possibility to remove event handlers from SharedIndexInformers again.
Stopping an informer is not sufficient, because there might
be multiple controllers in a controller manager that independently
decide which resources to watch.

Unfortunately the ResourceEventHandler interface encourages to use
value objects for handlers (like the ResourceEventHandlerFuncs
struct, that uses value receivers to implement the interface).
Go does not support comparison of function pointers and therefore
the comparison of such structs is not possible, also. Such handlers
cannot be matched again after they have been added and therefore it
is only possible to remove handlers that are comparable.
Fortunately struct based handlers can also be passed by reference.
The user of the interface can therefore still use those
handlers and remove them again, by switching to a pointer argument.

The remove method checks whether a handler can be compared
and ignores uncomparable handlers in the removal process.
Removing of uncomparable handlers result in an error return.

Remark: If as the result of a handler removal a complete informer
should be disabled it is higly recommended to incorporate
pull request kubernetes#98653, which fixes a race condition when stopping
watches for an informer using the stop channel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet