Stacktrace when replacing imported controlplane #8480

huxcrux · 2023-04-05T12:36:09Z

What steps did you take and what happened?

I currently work on a solution for migrating an existing cluster to cluster api.

This is done by importing all nodes as Machine (and since I use the openstack infrastructure provider I also create an OpenStackMachine for each Machine)

The import process works fine however when I update my KubeadmControlPlane (with a new Kubernetes version, I also updated the machineTemplate with the new image) in order to replace all control-plane nodes. This will cause one new control-plane node to be created, it joins the cluster however as soon as the first node/machine are removed the kubeadm control-plane provider pod crashed (the pod in namespace capi-kubeadm-control-plane-controller-manager) and prints an stacktrace.

Everytime the kubeadm control-plane provider pod restarts it quickly crashes again and I have not managed to find a solution.

If I downgrade the kubeadm control-plane controller to version 1.3.6 (or any 1.3 release) the 2 non updated control-plane nodes are replaced without any problems (I did this by simply changing the image tag to :1.3.6 and removing ,LazyRestmapper=false as a feature gate flag)

After all control-plane nodes are replaced I can update to 1.4.1 again and everything works as expected. The upgrade is done by running clusterctl upgrade apply --control-plane capi-kubeadm-control-plane-system/kubeadm:v1.4.1

What did you expect to happen?

I expected the control-plane machines to be replaced one by one without any problems

Cluster API version

I currently run 1.4.1 but had the same problem on 1.4.0

Please not that the only thing I changed in order to get this to work is the kubeadm control-plane provider.

❯ clusterctl version

clusterctl version: &version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"39d87e91080088327c738c43f39e46a7f557d03b", GitTreeState:"clean", BuildDate:"2023-04-04T17:31:43Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

❯ clusterctl upgrade plan

Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
bootstrap-kubeadm capi-kubeadm-bootstrap-system BootstrapProvider v1.4.1 Already up to date
control-plane-kubeadm capi-kubeadm-control-plane-system ControlPlaneProvider v1.4.1 Already up to date
cluster-api capi-system CoreProvider v1.4.1 Already up to date

Kubernetes version

Management cluster: 1.21.1
Workload cluster: Nodes are imported on 1.25.6 however for recreation to be forced I update the control-plane to 1.25.7 (I have also tested 1.25.8)

Anything else you would like to add?

Log output from kubeadm control-plane pod (from namespace capi-kubeadm-control-plane-system):

E0405 11:01:21.246623       1 controller.go:329] "Reconciler error" err="failed to get patch helper for machine lab1-k8s-master-3" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hux-lab1-control-plane" namespace="default" name="hux-lab1-control-plane" reconcileID=a5233509-06ab-4e9a-a47e-d2f7b0ca308c
I0405 11:01:23.706055       1 controller.go:276] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hux-lab1-control-plane" namespace="default" name="hux-lab1-control-plane" reconcileID=7fab91cb-28bb-4195-a666-34ffea587a38 Cluster="default/hux-lab1"
I0405 11:01:23.784371       1 controller.go:118] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hux-lab1-control-plane" namespace="default" name="hux-lab1-control-plane" reconcileID=7fab91cb-28bb-4195-a666-34ffea587a38
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x183740e]

goroutine 470 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:119 +0x1fa
panic({0x19f8600, 0x2dd21f0})
	runtime/panic.go:884 +0x212
sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).syncMachines(0xc0008ab400, {0x1f4b738, 0xc000c8bd70}, 0xc00061f4a0)
	sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:603 +0x52e
sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).reconcile(0xc0008ab400, {0x1f4b738, 0xc000c8bd70}, 0xc001153380, 0xc00057f880)
	sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:347 +0x732
sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).Reconcile(0xc0008ab400, {0x1f4b738, 0xc000c8bb60}, {{{0xc000ba47b0?, 0x10?}, {0xc001c15920?, 0x40dc07?}}})
	sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:233 +0x7f0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1f4b738?, {0x1f4b738?, 0xc000c8bb60?}, {{{0xc000ba47b0?, 0x1961de0?}, {0xc001c15920?, 0x0?}}})
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0008ab4a0, {0x1f4b690, 0xc0007eb100}, {0x1a71d20?, 0xc000ea56e0?})
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323 +0x38f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0008ab4a0, {0x1f4b690, 0xc0007eb100})
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:231 +0x333

The complete log output and relevant manifests can be found here: https://gist.github.com/bl0m1/f446be110ba6037c893b61cb53cbf93e

Label(s) to be applied

/kind bug
/area control-plane

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-04-05T14:38:27Z

Thanks for reporting this @bl0m1 - this code was heavily refactored during the 1.4 cycle so I think we probably just missed a nil check somewhere. Have you got a KCP yaml or Cluster yaml that minimally recreates the crash so I can be certain?

/assign
/triage accepted.

k8s-ci-robot · 2023-04-05T14:38:29Z

@killianmuldoon: The label(s) triage/accepted. cannot be applied, because the repository doesn't have them.

In response to this:

Thanks for reporting this @bl0m1 - this code was heavily refactored during the 1.4 cycle so I think we probably just missed a nil check somewhere. Have you got a KCP yaml or Cluster yaml that minimally recreates the crash so I can be certain?

/assign
/triage accepted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon · 2023-04-05T14:41:40Z

/triage accepted

killianmuldoon · 2023-04-05T14:53:00Z

@bl0m1 I think this - #8481 - probably fixes the issue - but it would be great to get a repro-case if you can supply one to ensure this actually works for your specific problem.

huxcrux · 2023-04-06T10:04:11Z

@bl0m1 I think this - #8481 - probably fixes the issue - but it would be great to get a repro-case if you can supply one to ensure this actually works for your specific problem.

I have verified your patch and can confirm it fixes the problem

fabriziopandini · 2023-04-06T11:40:29Z

/triage accepted

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/control-plane Issues or PRs related to control-plane lifecycle management needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2023

k8s-ci-robot assigned killianmuldoon Apr 5, 2023

killianmuldoon mentioned this issue Apr 5, 2023

🐛 Ensure nil-pointer check in KCP syncMachines #8481

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2023

k8s-ci-robot closed this as completed in #8481 Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stacktrace when replacing imported controlplane #8480

Stacktrace when replacing imported controlplane #8480

huxcrux commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

k8s-ci-robot commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

huxcrux commented Apr 6, 2023

fabriziopandini commented Apr 6, 2023

Stacktrace when replacing imported controlplane #8480

Stacktrace when replacing imported controlplane #8480

Comments

huxcrux commented Apr 5, 2023

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

killianmuldoon commented Apr 5, 2023

k8s-ci-robot commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

huxcrux commented Apr 6, 2023

fabriziopandini commented Apr 6, 2023