Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSA failure after removal of old API versions #10051

Open
MaxRink opened this issue Jan 24, 2024 · 19 comments · Fixed by #10147
Open

SSA failure after removal of old API versions #10051

MaxRink opened this issue Jan 24, 2024 · 19 comments · Fixed by #10147
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@MaxRink
Copy link
Contributor

MaxRink commented Jan 24, 2024

What steps did you take and what happened?

We have some capi cluster that have been around for quite a while. After upgrading to CAPI v1.6.x (so after v1alpha3 removal, v1alpha4 deprecation) we saw serverSideApply issues pop up, more specifically flux not being able to apply our resources anymore.
We have seen this on MachineDeployments, MachineHealthChecks and also VSphereClusters (those are CAPV which also had api version removals in the latest version). Im adding the MD here as an example.
The same behaviour occurs when manually using kubectl apply --server-side, it does however not happen when not using SSA

MD in API-Server:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  annotations:
    machinedeployment.clusters.x-k8s.io/revision: "38"
  creationTimestamp: "2021-09-14T14:58:17Z"
  generation: 5447
  labels:
    cluster.x-k8s.io/cluster-name: cocomat-1
    kustomize.toolkit.fluxcd.io/name: tenant-clusters-vsphere-tstsa1-bn
    kustomize.toolkit.fluxcd.io/namespace: schiff-system
    nodepool: cocomat-1-md-0
  managedFields:
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:cluster.x-k8s.io/cluster-name: {}
          f:kustomize.toolkit.fluxcd.io/name: {}
          f:kustomize.toolkit.fluxcd.io/namespace: {}
          f:nodepool: {}
      f:spec:
        f:clusterName: {}
        f:minReadySeconds: {}
        f:progressDeadlineSeconds: {}
        f:replicas: {}
        f:revisionHistoryLimit: {}
        f:selector: {}
        f:strategy:
          f:rollingUpdate:
            f:maxSurge: {}
            f:maxUnavailable: {}
          f:type: {}
        f:template:
          f:metadata:
            f:annotations:
              f:schiff.telekom.de/hotfix: {}
            f:labels:
              f:cluster.x-k8s.io/cluster-name: {}
              f:cluster.x-k8s.io/deployment-name: {}
              f:node-role.kubernetes.io/worker: {}
              f:nodepool: {}
          f:spec:
            f:bootstrap:
              f:configRef: {}
            f:clusterName: {}
            f:infrastructureRef: {}
            f:nodeDeletionTimeout: {}
            f:nodeDrainTimeout: {}
            f:nodeVolumeDetachTimeout: {}
            f:version: {}
    manager: kustomize-controller
    operation: Apply
    time: "2023-12-15T14:02:22Z"
  - apiVersion: cluster.x-k8s.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences: {}
      f:status:
        .: {}
        f:selector: {}
    manager: manager
    operation: Update
    time: "2021-09-24T14:24:20Z"
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:machinedeployment.clusters.x-k8s.io/revision: {}
        f:ownerReferences:
          k:{"uid":"c5e2f139-11d5-4293-9a11-ae97795e305d"}: {}
    manager: manager
    operation: Update
    time: "2023-12-15T14:02:26Z"
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:availableReplicas: {}
        f:conditions: {}
        f:observedGeneration: {}
        f:phase: {}
        f:readyReplicas: {}
        f:replicas: {}
        f:unavailableReplicas: {}
        f:updatedReplicas: {}
    manager: manager
    operation: Update
    subresource: status
    time: "2023-12-15T15:55:08Z"
  name: cocomat-1-md-0
  namespace: vsphere-tstsa1-bn
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: cocomat-1
    uid: c5e2f139-11d5-4293-9a11-ae97795e305d
  resourceVersion: "3025961533"
  uid: 3225b9ca-7d89-476b-8d0e-aed1a3a1c1d7
spec:
  clusterName: cocomat-1
  minReadySeconds: 20
  progressDeadlineSeconds: 600
  replicas: 6
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: cocomat-1
      cluster.x-k8s.io/deployment-name: cocomat-1-md-0
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        schiff.telekom.de/hotfix: "true"
      labels:
        cluster.x-k8s.io/cluster-name: cocomat-1
        cluster.x-k8s.io/deployment-name: cocomat-1-md-0
        node-role.kubernetes.io/worker: ""
        nodepool: cocomat-1-md-0
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: cocomat-1-md-11
      clusterName: cocomat-1
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: cocomat-1-md-35
      nodeDeletionTimeout: 0s
      nodeDrainTimeout: 15m0s
      nodeVolumeDetachTimeout: 30m0s
      version: v1.27.8
status:
  availableReplicas: 6
  conditions:
  - lastTransitionTime: "2023-10-17T19:53:46Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-10-17T19:53:46Z"
    status: "True"
    type: Available
  observedGeneration: 5447
  phase: Running
  readyReplicas: 6
  replicas: 6
  selector: cluster.x-k8s.io/cluster-name=cocomat-1,cluster.x-k8s.io/deployment-name=cocomat-1-md-0
  unavailableReplicas: 0
  updatedReplicas: 6

MD in git (so the one flux tries to SSA):

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: cocomat-1
    nodepool: cocomat-1-md-0
  name: cocomat-1-md-0
  namespace: vsphere-tstsa1-bn
spec:
  clusterName: cocomat-1
  minReadySeconds: 20
  progressDeadlineSeconds: 600
  replicas: 6
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: cocomat-1
      cluster.x-k8s.io/deployment-name: cocomat-1-md-0
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        schiff.telekom.de/hotfix: 'true'
      labels:
        cluster.x-k8s.io/cluster-name: cocomat-1
        cluster.x-k8s.io/deployment-name: cocomat-1-md-0
        node-role.kubernetes.io/worker: ''
        nodepool: cocomat-1-md-0
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: cocomat-1-md-12
      clusterName: cocomat-1
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: cocomat-1-md-36
      nodeDeletionTimeout: 0s
      nodeDrainTimeout: 15m0s
      nodeVolumeDetachTimeout: 1800s
      version: v1.27.9

error message:

  Warning  ReconciliationFailed  3s (x5453 over 19d)  kustomize-controller  MachineDeployment/vsphere-tstsa1-bn/cocomat-1-md-0 dry-run failed: request to convert CR to an invalid group/version: cluster.x-k8s.io/v1alpha3

What did you expect to happen?

SSA continues to work and doesnt reference removed APIVersions

Cluster API version

v1.6.1

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2024
@MaxRink
Copy link
Contributor Author

MaxRink commented Jan 24, 2024

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Jan 24, 2024

/triage accepted

This looks bad - thanks for reporting it. Need to get time to reproduce.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2024
@fabriziopandini
Copy link
Member

Q: do we know what K8s was used when we created the object/update with v1alpha3 (about "2021-09-24T14:24:20Z") and which K8s version we have now when the object was update last (about "2023-12-15T14:02:26Z"). This could help reproduce.

General comment (without having done any investigation)
Il looks like that support for SSA on CRD has some issues with API version removal. Getting around this on CAPI side seems kind of acting in the wrong layer, but let's avoid early solutioning until we have a better understanding of what's going on

Other data point is that, in this example, CAPI is doing plain patch (not SSA patch), but the issue about version removal leads to errors when something else does SSA

Last note, might be worth asking guidance in API machinery, but at this stage I even struggle to explain the problem properly...

@sbueringer
Copy link
Member

/assign

@sbueringer
Copy link
Member

I could reproduce it locally and have some ideas around mitigating it. Need some more time to provide more details.

@sbueringer sbueringer added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 13, 2024
@sbueringer
Copy link
Member

sbueringer commented Feb 13, 2024

Some more data.

I could see old apiVersions in managedFields in these jobs:

Seems like it's a pretty regular case that old apiVersions remain in managedFields. The issue than occurs as soon as the apiVersion (e.g. v1alpha3 / v1alpha4) is entirely dropped from CRDs. Because then conversions cannot be handled anymore and (at least) kubectl apply --server-side fails. Also every other client that uses SSA on client-side like flux or potentially other controllers.

Here is a full example to reproduce the issue from scratch:

  • Setup env: (tested with Kubernetes v1.27.3 & kind v0.20.0)
    • git checkout release-1.6
    • PATH=/Users/buringerst/go_sdks/go1.20.7/bin/:$PATH make tilt-up (Tilt with core CAPI + CAPD)
  • Deploy MD CRD with v1alpha4 storage version
    • k apply -f ./testing/ssa/crd-v1alpha4-storage-v1beta1.yaml
  • Deploy Cluster with v1alpha4 MD
    • k apply -f ./testing/ssa/capi-quickstart.yaml
  • Verify v1alpha4 in managed fields
    • k get md capi-quickstart-md-0 -o yaml --show-managed-fields=true
  • Change storage version to v1beta1
    • k apply -f ./testing/ssa/crd-v1alpha4-v1beta1-storage.yaml
  • Ensure MD was written with v1beta1 storage version
    • k label md capi-quickstart-md-0 abc=def
  • Drop v1alpha4 from .status.storedVersions
    • k edit crd machinedeployments.cluster.x-k8s.io --subresource=status
  • Drop v1alpha4 from CRD
    • k apply -f ./testing/ssa/crd-v1beta1.yaml
  • Fail => Error from server: request to convert CR to an invalid group/version: cluster.x-k8s.io/v1alpha4
    • k apply -f ./testing/ssa/capi-quickstart-md.yaml --server-side=true

The situation can be recovered via:

  • Add version v1alpha4 again (also works again if v1alpha4 has served: false)
    • k apply -f ./testing/ssa/crd-v1alpha4-v1beta1-storage.yaml
  • Verify that it works again
    • k apply -f ./testing/ssa/capi-quickstart-md.yaml --server-side=true
      • Note: This calls the conversion webhook

Files can be downloaded here: https://gist.github.com/sbueringer/4301e6160e3cfc35b57ff517599efe10

@sbueringer
Copy link
Member

I'm now working on:

  1. Reproducing the error with an extended clusterctl upgrade test
  2. Working on a fix to mitigate the issue short-term. The current idea is to re-introduce v1alpha3/v1alpha4 in a way that conversion works again (but these versions should otherwise be not exposed). Than we can think about mid-term / long-term solutions.

@sbueringer
Copy link
Member

cc @killianmuldoon @fabriziopandini @chrischdi (just fyi)

@adilGhaffarDev
Copy link
Contributor

@sbueringer This also seems like the cause of the flake that we see in clusterctl upgrade tests from 0.4->1.4 or 0.4->1.5. ref: #9688

@sbueringer
Copy link
Member

sbueringer commented Feb 13, 2024

No I think there is a difference between conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha4, Kind=DockerCluster failed and request to convert CR to an invalid group/version.

The former means the kube-apiserver tries to communicate with the webhook but the webhook is not reachable. The latter means the kube-apiserver doesn't even find the apiVersion in the CRD.

@sbueringer
Copy link
Member

A wrote a hacky version of an upgrade test to reproduce the issue: #10146 (ran as part of e2e-blocking)

Now implementing a short-term workaround

@vincepri
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@vincepri: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 14, 2024
@vincepri
Copy link
Member

Maybe @liggitt or @deads2k could help here?

Is this an expected behavior with CRD versioning and SSA? Wondering if this is something that we'd expect the api-server to handle for end users (Cluster API in this case), or if we should handle version removal and managed field cleanup?

@liggitt
Copy link

liggitt commented Feb 14, 2024

after v1alpha3 removal

does this mean existing persisted objects were touched to update the stored version to a later version, and v1alpha3 was dropped from spec.versions and status.storedVersions in the CRD?

cc @jpbetz for an aspect that seems missing from https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version ... I'm not sure what I would tell someone to do to update existing managedFields tied to no-longer-existing versions, or how to update those as part of storage migration

this looks like kubernetes/kubernetes#111937 that would possibly be resolved by kubernetes/kubernetes#115650 (but kubernetes/kubernetes#115650 needs close review to make sure there are no other undesirable side effects)

@sbueringer
Copy link
Member

Sharing more information that Fabrizio shared in the office hours just now:

A nasty issue requires an urgent fix for CAPI v1.6.x and main; this fix will be included in the next patch release, which we deferred to next week to give some time to get the fix merged.

  • The issue involves API version removal, SSA and managed fields.
    • If you have cluster created with v1alpha3, there is a good chance that you got a v1alpha3 entry in the managed fields for some of your objects
    • Despite the fact that we went through a proper API removal process, some of those v1alpha3 entries in managed field stick around - and they never get automatically converted to newer API versions.
    • After upgrading to CAPI v1.6, which drops the v1alpha3 API, if controllers/users perform a SSA patch on one object in the state above, they get the following error message:
      • request to convert CR to an invalid group/version: cluster.x-k8s.io/v1alpha3
  • Root cause
    • The root cause why some of those v1alpha3 entries in managed field stick around is not entirely clear
    • However, we already know that the issue could be triggered by any client using the v1alpha3 API (including CAPI, but also kubectl, GitOps controllers, literally any other client working on our CRs…)
  • The fix [short-term mitigation]
    • Given the impact (all the CRs), and the almost unlimited use cases that can lead to this condition (any client interacting with CAPI CRs), we decided to restore only API types and conversions for v1alpha3 and v1alpha4 (more details below on how we mimimized the impact of this operation).
  • Why was this not detected by our CI?
    • We were testing CAPI upgrades from vA->vB
    • This issue happens only if you upgrade CAPI from vA->vB->vC and in vB an API version is deprecated and if in vC the same API versions is removed (more details below on how we fixed this gap).
  • What’s next?
    • For users using CAPI v1.6, we highly recommend to upgrade to the next patch release (as soon as it is out).
    • For providers using CAPI v1.6 in their own CI, we highly recommend to bump to the next patch release ASAP
    • WRT to CAPI, the plan is still to remove v1alpha3 and v1alpha4 API versions as soon as possible, but before doing so we need to figure out a way to do so without incurring the same problem again.

@sbueringer
Copy link
Member

sbueringer commented Feb 14, 2024

after v1alpha3 removal

does this mean existing persisted objects were touched to update the stored version to a later version, and v1alpha3 was dropped from spec.versions and status.storedVersions in the CRD?

Yes. We had a multi-step migration process:

  1. v1beta1 is storage version
  2. List => Update all CRs of the CRD, remove v1alpha3 from .status.storedVersions
  3. set served: false on v1alpha3
  4. remove v1alpha3 from the CRD

After step 4 we started getting convert CR to an invalid group/version errors. I assume the apiserver tries to find the version in the CRD and doesn't find it so it returns this error.

this looks like kubernetes/kubernetes#111937 that would possibly be resolved by kubernetes/kubernetes#115650 (but kubernetes/kubernetes#115650 needs close review to make sure there are no other undesirable side effects)

Agree. This looks exactly like #111937

(cc @MadhavJivrajani, just fyi :))

@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 15, 2024
@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 18, 2024
@sbueringer sbueringer added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
9 participants