-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterresolver: push empty config to child policy upon removal of cluster resource #6125
Conversation
b286c3c
to
e869163
Compare
start := time.Now() | ||
end := start.Add(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: maybe merge these lines to
end := time.Now().Add(time.Second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
// Ensure that the EDS watch is not canceled. | ||
sCtx, sCancel := context.WithTimeout(ctx, defaultTestShortTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is defaultTestShortTimeout
(10ms) long enough of a wait here? How about we move this into the RPC for loop check below and error if we receive anything on the edsResourceCanceledCh
channel in that 1s ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that way we maybe able to remove sCtx
created here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
select { | ||
case <-rr.updateChannel: | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder whether it's possible for this to race with another source? If not, maybe a comment about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dfawley. Looks like there is a small race possible in the eds resource resolver. I will fix that ping the PR again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL. Thanks.
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, glad my comment caught a potential problem!
When the cluster resource associated with a CDS LB policy is removed, the CDS LB policy propagates this error to its child policy (clusterresolver LB), which stops the associated EDS watch. It was not sending a config update to its child (priority LB) with no endpoints. This meant that there was no picker update from the leaf of the LB policy tree, and therefore the subConns associated with backends in the cluster were still active, and RPCs were successful to the deleted cluster.
This PR fixes this by ensuring that the clusterresolver LB policy sends an empty config update to it child when the cluster resource is removed. This ensures that the child policies are cleaned up, and subConns are removed, and thereby RPCs start to fail to the removed cluster.
Also, fixes #6083, because it replaces the flaky test with an e2e style test.
RELEASE NOTES: