-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lrp: update LRP services with stale backends on agent restart #36036
Merged
+74
−0
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
11d1a6c
to
e1b7a47
Compare
/test |
ea7c7ff
to
1b16dab
Compare
/test |
borkmann
approved these changes
Nov 20, 2024
tklauser
approved these changes
Nov 20, 2024
aspsk
approved these changes
Nov 20, 2024
Loading status checks…
This commit fixes the issue where stale backends are associated with the LRP service after the agent restarts. Cilium restores the service and backend cache from the BPF map and synchronizes it with the Kubernetes API server, assuming that UpsertService is called for each active service. During the sync period, Cilium keeps a list of restored backends that haven't been observed for each service to prevent temporary connectivity loss during agent restarts. (See commit 920976a.) After synchronization, an update is triggered for each service still associated with stale backends, allowing them to be removed. However, LRP services are not updated and remain associated with stale backends because the ServiceCache cannot update LRP services. Instead, the LRP manager is responsible for updating them. This issue arises if a CLRP is created during an agent restart. For example, consider a scenario where the following nodelocaldns CLRP is applied during agent startup: 1) Cilium restores the kube-dns ClusterIP service and its backends (coredns) from the BPF map and synchronizes them with Kubernetes. 2) If the LRP manager calls UpsertService first, it retains coredns, adds node-local-dns as a backend, and updates the kube-dns service to an LRP-type service. 3) After synchronization, updates are triggered for all services. However, the LRP service is not updated, leaving stale backends associated with it. To address this issue, this commit ensures that the LRP manager calls EnsureService to remove stale backends. apiVersion: "cilium.io/v2" kind: CiliumLocalRedirectPolicy metadata: name: "nodelocaldns" namespace: kube-system spec: redirectFrontend: serviceMatcher: serviceName: kube-dns namespace: kube-system redirectBackend: localEndpointSelector: matchLabels: k8s-app: node-local-dns toPorts: - port: "53" name: dns protocol: UDP - port: "53" name: dns-tcp protocol: TCP Signed-off-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
1b16dab
to
3f24b8e
Compare
/test |
joamaki
approved these changes
Nov 21, 2024
This was referenced Nov 22, 2024
Merged
Closed
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects/v1.14
This issue affects v1.14 branch
affects/v1.15
This issue affects v1.15 branch
area/lrp
Impacts Local Redirect Policy.
backport/author
The backport will be carried out by the author of the PR.
backport-done/1.16
The backport for Cilium 1.16.x for this PR is done.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes the issue where stale backends are associated with the LRP service after the agent restarts.
Cilium restores the service and backend cache from the BPF map and synchronizes it with the Kubernetes API server, assuming that UpsertService is called for each active service. During the sync period, Cilium keeps a list of restored backends that haven't been observed for each service to prevent temporary connectivity loss during agent restarts. (See commit 920976a.)
After synchronization, an update is triggered for each service still associated with stale backends, allowing them to be removed. However, LRP services are not updated and remain associated with stale backends because the ServiceCache cannot update LRP services. Instead, the LRP manager is responsible for updating them.
This issue arises if a CLRP is created during an agent restart. For example, consider a scenario where the following nodelocaldns CLRP is applied during agent startup:
To address this issue, this commit ensures that the LRP manager calls EnsureService to remove stale backends.
How can we reproduce the issue?
Note that adding time.Sleep before calling UpsertService makes it easier to reproduce. so that the LRP manager can call UpsertService first.
cilium/pkg/k8s/watchers/service.go
Line 575 in 0648731