Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Add node watcher to MachinePool controller #8443

Merged

Conversation

CecileRobertMichon
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon commented Mar 31, 2023

What this PR does / why we need it: Since CAPI v1.4.0, the MachinePool controller drops the node.cluster.x-k8s.io/uninitialized taint from nodes as soon as the providerID gets added to the node by the cloud-controller-manager and CAPI is able to match the node with a MachinePool. However, it is currently not watching Nodes (unlike the Machine controller) which can cause delays in the taint being dropped on MachinePool nodes, which in turn causes the Node to not become schedulable until 10-15 minutes after the node is Ready.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #8442

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 31, 2023
if err := r.Client.List(
context.TODO(),
machinePoolList,
append(filters, client.MatchingFields{index.MachinePoolNodeNameField: node.Name})...); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure how to get indexing to work for MachinePools given there are multiple nodes per MachinePool... The extract func here returns a list of all node names associated with the MachinePool. What should this filter be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation looks like this way using the filter should work as MatchingFields also supports cache indices.

And, indexing with multiple "keys" is apparently supported however it would lack compatibility with Kubernetes API server (not sure what the means or what the impact would be).

@CecileRobertMichon CecileRobertMichon changed the title [WIP] Add node watcher to MachinePool controller [WIP] 馃悰 Add node watcher to MachinePool controller Mar 31, 2023
@CecileRobertMichon CecileRobertMichon force-pushed the watch-mp-nodes branch 2 times, most recently from abf0b02 to 571d0d3 Compare March 31, 2023 23:54
@CecileRobertMichon CecileRobertMichon changed the title [WIP] 馃悰 Add node watcher to MachinePool controller 馃悰 Add node watcher to MachinePool controller Mar 31, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2023
@CecileRobertMichon
Copy link
Contributor Author

/hold

testing in kubernetes-sigs/cluster-api-provider-azure#3378

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2023
exp/controllers/alias.go Show resolved Hide resolved
@ykakarap
Copy link
Contributor

ykakarap commented Apr 1, 2023

/cherry-pick release-1.4

@k8s-infra-cherrypick-robot

@ykakarap: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ykakarap
Copy link
Contributor

ykakarap commented Apr 2, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1b6f745788bf342e68b9fc5228b83282e81ce7a3

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I think this solution makes sense for now, and we can eventually follow up with some improvements like an in-memory map if we figure out that this does not scale well with a growing number of nodes.

Also, with machine pool machines we can probably greatly simplify this and just rely on machine pools watching machines

@CecileRobertMichon
Copy link
Contributor Author

CecileRobertMichon commented Apr 3, 2023

@ykakarap I got a repro of the issue while testing with this branch in the latest run on kubernetes-sigs/cluster-api-provider-azure#3378 so I'm trying to figure out what happened.

edit: I'm adding logs to my branch and going to re-run the tests with it to see what went wrong

@CecileRobertMichon
Copy link
Contributor Author

/hold

this is an optimization but doesn't actually fix the issue. See #8462 for the real fix.

@CecileRobertMichon
Copy link
Contributor Author

/hold cancel

let's get this in along with the other fix as it does fix some potential delays (although in practice it doesn't actually fix the observed flake as we're constantly requeuing when Node provider IDs aren't set: https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_noderef.go#L83)

@k8s-ci-robot k8s-ci-robot removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 3, 2023
@@ -80,7 +86,8 @@ func (r *MachinePoolReconciler) reconcileNodeRefs(ctx context.Context, cluster *
if err != nil {
if err == errNoAvailableNodes {
log.Info("Cannot assign NodeRefs to MachinePool, no matching Nodes")
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
// No need to requeue here. Nodes emit an event that triggers reconciliation.
return ctrl.Result{}, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8c9fa0343abfab7f982fab2ecb81f9db1b5be951

@CecileRobertMichon
Copy link
Contributor Author

/hold

we're not going to include this in v1.4.1 as tests are passing without it, will take more time to validate this one

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 4, 2023
@CecileRobertMichon
Copy link
Contributor Author

/hold cancel

Got 4 passing tests in a row using kubernetes-sigs/cluster-api-provider-azure#3378

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 4, 2023
@CecileRobertMichon
Copy link
Contributor Author

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit 94f5468 into kubernetes-sigs:main Apr 4, 2023
10 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.5 milestone Apr 4, 2023
@k8s-infra-cherrypick-robot

@ykakarap: new pull request created: #8474

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johannesfrey
Copy link
Contributor

/area machinepool

@k8s-ci-robot k8s-ci-robot added the area/machinepool Issues or PRs related to machinepools label Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machinepool Issues or PRs related to machinepools cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MachinePool controller does not watch nodes, causing delay in removing uninitialized node taint
6 participants