Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacktrace when replacing imported controlplane #8480

Closed
huxcrux opened this issue Apr 5, 2023 · 6 comments · Fixed by #8481
Closed

Stacktrace when replacing imported controlplane #8480

huxcrux opened this issue Apr 5, 2023 · 6 comments · Fixed by #8481
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@huxcrux
Copy link

huxcrux commented Apr 5, 2023

What steps did you take and what happened?

I currently work on a solution for migrating an existing cluster to cluster api.

This is done by importing all nodes as Machine (and since I use the openstack infrastructure provider I also create an OpenStackMachine for each Machine)

The import process works fine however when I update my KubeadmControlPlane (with a new Kubernetes version, I also updated the machineTemplate with the new image) in order to replace all control-plane nodes. This will cause one new control-plane node to be created, it joins the cluster however as soon as the first node/machine are removed the kubeadm control-plane provider pod crashed (the pod in namespace capi-kubeadm-control-plane-controller-manager) and prints an stacktrace.

Everytime the kubeadm control-plane provider pod restarts it quickly crashes again and I have not managed to find a solution.

If I downgrade the kubeadm control-plane controller to version 1.3.6 (or any 1.3 release) the 2 non updated control-plane nodes are replaced without any problems (I did this by simply changing the image tag to :1.3.6 and removing ,LazyRestmapper=false as a feature gate flag)

After all control-plane nodes are replaced I can update to 1.4.1 again and everything works as expected. The upgrade is done by running clusterctl upgrade apply --control-plane capi-kubeadm-control-plane-system/kubeadm:v1.4.1

What did you expect to happen?

I expected the control-plane machines to be replaced one by one without any problems

Cluster API version

I currently run 1.4.1 but had the same problem on 1.4.0

Please not that the only thing I changed in order to get this to work is the kubeadm control-plane provider.

❯ clusterctl version

clusterctl version: &version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"39d87e91080088327c738c43f39e46a7f557d03b", GitTreeState:"clean", BuildDate:"2023-04-04T17:31:43Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

❯ clusterctl upgrade plan

Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
bootstrap-kubeadm capi-kubeadm-bootstrap-system BootstrapProvider v1.4.1 Already up to date
control-plane-kubeadm capi-kubeadm-control-plane-system ControlPlaneProvider v1.4.1 Already up to date
cluster-api capi-system CoreProvider v1.4.1 Already up to date

Kubernetes version

Management cluster: 1.21.1
Workload cluster: Nodes are imported on 1.25.6 however for recreation to be forced I update the control-plane to 1.25.7 (I have also tested 1.25.8)

Anything else you would like to add?

Log output from kubeadm control-plane pod (from namespace capi-kubeadm-control-plane-system):

E0405 11:01:21.246623       1 controller.go:329] "Reconciler error" err="failed to get patch helper for machine lab1-k8s-master-3" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hux-lab1-control-plane" namespace="default" name="hux-lab1-control-plane" reconcileID=a5233509-06ab-4e9a-a47e-d2f7b0ca308c
I0405 11:01:23.706055       1 controller.go:276] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hux-lab1-control-plane" namespace="default" name="hux-lab1-control-plane" reconcileID=7fab91cb-28bb-4195-a666-34ffea587a38 Cluster="default/hux-lab1"
I0405 11:01:23.784371       1 controller.go:118] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hux-lab1-control-plane" namespace="default" name="hux-lab1-control-plane" reconcileID=7fab91cb-28bb-4195-a666-34ffea587a38
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x183740e]

goroutine 470 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:119 +0x1fa
panic({0x19f8600, 0x2dd21f0})
	runtime/panic.go:884 +0x212
sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).syncMachines(0xc0008ab400, {0x1f4b738, 0xc000c8bd70}, 0xc00061f4a0)
	sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:603 +0x52e
sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).reconcile(0xc0008ab400, {0x1f4b738, 0xc000c8bd70}, 0xc001153380, 0xc00057f880)
	sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:347 +0x732
sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).Reconcile(0xc0008ab400, {0x1f4b738, 0xc000c8bb60}, {{{0xc000ba47b0?, 0x10?}, {0xc001c15920?, 0x40dc07?}}})
	sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:233 +0x7f0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1f4b738?, {0x1f4b738?, 0xc000c8bb60?}, {{{0xc000ba47b0?, 0x1961de0?}, {0xc001c15920?, 0x0?}}})
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0008ab4a0, {0x1f4b690, 0xc0007eb100}, {0x1a71d20?, 0xc000ea56e0?})
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323 +0x38f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0008ab4a0, {0x1f4b690, 0xc0007eb100})
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:231 +0x333

The complete log output and relevant manifests can be found here: https://gist.github.com/bl0m1/f446be110ba6037c893b61cb53cbf93e

Label(s) to be applied

/kind bug
/area control-plane

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/control-plane Issues or PRs related to control-plane lifecycle management needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2023
@killianmuldoon
Copy link
Contributor

Thanks for reporting this @bl0m1 - this code was heavily refactored during the 1.4 cycle so I think we probably just missed a nil check somewhere. Have you got a KCP yaml or Cluster yaml that minimally recreates the crash so I can be certain?

/assign
/triage accepted.

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon: The label(s) triage/accepted. cannot be applied, because the repository doesn't have them.

In response to this:

Thanks for reporting this @bl0m1 - this code was heavily refactored during the 1.4 cycle so I think we probably just missed a nil check somewhere. Have you got a KCP yaml or Cluster yaml that minimally recreates the crash so I can be certain?

/assign
/triage accepted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon
Copy link
Contributor

/triage accepted

@killianmuldoon
Copy link
Contributor

@bl0m1 I think this - #8481 - probably fixes the issue - but it would be great to get a repro-case if you can supply one to ensure this actually works for your specific problem.

@huxcrux
Copy link
Author

huxcrux commented Apr 6, 2023

@bl0m1 I think this - #8481 - probably fixes the issue - but it would be great to get a repro-case if you can supply one to ensure this actually works for your specific problem.

I have verified your patch and can confirm it fixes the problem

@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants