Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster #6350

ArchiFleKs · 2023-06-03T09:15:06Z

What steps did you take and what happened:

I deployed an EKS cluster with Terraform. Then I installed a lot of middleware, especially:

Velero
Calico

When Velero is running the job to upgrade CRDs I go into the state describe in these issue:

kube-controller-manager unable to perform normal GC projectcalico/calico#7598
CRDs causes the whole K8s cluster not to work properly goharbor/harbor-operator#906
Dynamic informers do not stop when custom resource definition is removed kubernetes/kubernetes#79610

On EKS controller manager logs I get :

2023-06-03T10:54:25.000+02:00 | E0603 08:54:25.411942 10 shared_informer.go:314] unable to sync caches for garbage collector
-- | --
  | 2023-06-03T10:54:25.000+02:00 | E0603 08:54:25.411954 10 garbagecollector.go:261] timed out waiting for dependency graph builder sync during GC sync (attempt 35)

Which render the cluster completely unusable as I can't restart the controller manager on EKS.

I managed to remove Velero and Calico as well as removing all the Velero CRDs and Calico CRDs then the controller manager GC started working again.

This lead me to believe that the issue is with Velero at the CRDs upgrade process I'm not 100% sure.

Use the "reaction smiley face" up to the right of this comment to vote.

Edit:

I still get perpetual errors in controller manager now:

E0603 09:15:00.570214      10 reflector.go:148] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:106: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource | E0603 09:15:00.570214 10 reflector.go:148] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:106: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
-- | --

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

qiuming-best · 2023-06-07T10:09:45Z

@ArchiFleKs This question may not be easy to pinpoint. Have you tried only installing Calico to do the test?
And What‘s the original version of velero and the upgraded version?

ArchiFleKs · 2023-06-07T14:10:52Z

Yes to be honest I managed to reproduced the issue with only Calico. It is really hard to know what is causing this issue.

…

On Wed, Jun 7, 2023, 12:09 qiuming ***@***.***> wrote: @ArchiFleKs <https://github.com/ArchiFleKs> This question may not be easy to pinpoint. Have you tried only installing Calico to do the test? And What‘s the original version of velero and the upgraded version? — Reply to this email directly, view it on GitHub <#6350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJXRKKNW722TAYQH6JPF6TXKBHPJANCNFSM6AAAAAAYZFSE5Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

BGOtura · 2023-06-08T15:32:46Z

I have the same issue in a cluster deployed with just calico also. Is possible that the GC is prevented to run properly due to a problem with calico installation?

W0608 15:27:27.043907       1 reflector.go:533] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:106: failed to list *v1.PartialObjectMetadata: connection is unauthorized: bgpfilters.crd.projectcalico.org is forbidden: User "system:serviceaccount:calico-apiserver:calico-apiserver" cannot list resource "bgpfilters" in API group "crd.projectcalico.org" at the cluster scope
E0608 15:27:27.043983       1 reflector.go:148] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:106: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: connection is unauthorized: bgpfilters.crd.projectcalico.org is forbidden: User "system:serviceaccount:calico-apiserver:calico-apiserver" cannot list resource "bgpfilters" in API group "crd.projectcalico.org" at the cluster scope
E0608 15:27:36.855518       1 shared_informer.go:314] unable to sync caches for garbage collector
E0608 15:27:36.855602       1 garbagecollector.go:261] timed out waiting for dependency graph builder sync during GC sync (attempt 17)
I0608 15:27:36.955759       1 shared_informer.go:311] Waiting for caches to sync for garbage collector
E0608 15:28:06.957220       1 shared_informer.go:314] unable to sync caches for garbage collector
E0608 15:28:06.957424       1 garbagecollector.go:261] timed out waiting for dependency graph builder sync during GC sync (attempt 18)
I0608 15:28:07.052826       1 shared_informer.go:311] Waiting for caches to sync for garbage collector

ArchiFleKs · 2023-06-10T08:08:35Z

Maybe we should continue this issue in Calico. Regarding the bgpfilters it is an issue with operators rbac not having the bgpfilters in roles. But it does not fix the issue when editing RBAC manually.

na4ma4 · 2023-06-15T00:16:12Z

But it does not fix the issue when editing RBAC manually.

I found this to work (except the calico APIServer will clobber the changes and break it again).

kubectl get clusterroles/calico-crds -o json | jq '.rules[] |= ( if ( (.apiGroups | index("crd.projectcalico.org")) and (.resources | index("bgpfilters") | not) ) then .resources += [ "bgpfilters" ] else . end )' | kubectl apply -f -

reasonerjt · 2023-08-23T09:08:21Z

per latest comment, the root cause is the roles installed by calico, closing this issue.

qiuming-best added Needs investigation Area/Cloud/AWS labels Jun 7, 2023

qiuming-best self-assigned this Jun 9, 2023

ArchiFleKs changed the title ~~Issue with crontroller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster~~ Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster Jul 4, 2023

pradeepkchaturvedi added the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Aug 4, 2023

reasonerjt closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster #6350

Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster #6350

ArchiFleKs commented Jun 3, 2023 •

edited

qiuming-best commented Jun 7, 2023

ArchiFleKs commented Jun 7, 2023 via email

BGOtura commented Jun 8, 2023

ArchiFleKs commented Jun 10, 2023

na4ma4 commented Jun 15, 2023

reasonerjt commented Aug 23, 2023

Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster #6350

Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster #6350

Comments

ArchiFleKs commented Jun 3, 2023 • edited

qiuming-best commented Jun 7, 2023

ArchiFleKs commented Jun 7, 2023 via email

BGOtura commented Jun 8, 2023

ArchiFleKs commented Jun 10, 2023

na4ma4 commented Jun 15, 2023

reasonerjt commented Aug 23, 2023

ArchiFleKs commented Jun 3, 2023 •

edited