Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterresolver: comply with A37 for handling errors from discovery mechanisms #6461

Merged
merged 5 commits into from Jul 24, 2023

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Jul 20, 2023

Fixes #6462. This PR fixes cluster_resolver error handing. The desired behavior as per gRFC A37 is to create a priority and send down an empty address list if an EDS Discovery Mechanism or DNS Discovery Mechanism errors before a first good update, triggering Transient Failure for that priority and causing fallback if another lower level priority. In master, we simply don't create the priority child in the EDS case and send down a resolver error in the DNS case.

RELEASE NOTES: N/A

@zasweq zasweq requested a review from easwars July 20, 2023 16:33
@zasweq zasweq added this to the 1.57 Release milestone Jul 20, 2023
@@ -190,7 +190,19 @@ func buildClusterImplConfigForEDS(g *nameGenerator, edsResp xdsresource.Endpoint
})
}

priorities := groupLocalitiesByPriority(edsResp.Localities)
var priorities [][]xdsresource.Locality
// Triggered by an EDS error before update, or empty localities list in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or resource-not-found error?

Copy link
Contributor Author

@zasweq zasweq Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, EDS error meant resource not found || NACK. I'll change language to be explicit about these two error cases though.

// data, and it has not previously reported any results, it should
// report a result that is a single priority with no endpoints." - A37
priorities = make([][]xdsresource.Locality, 1)
priorities[0] = make([]xdsresource.Locality, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

priorities := [][]xdsresource.Locality{{}}

Since the priorities var is defined as a slice of slice of xdsresource.Locality, the compiler can figure out the type of the elements in the slice (when defined statically) without them having to be mentioned explicitly. So, in the above statement {} indicates an zero-value xdsresource.Locality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the {} be a zero value of []xdsresource.Locality? Thus setting to nil rather than an empty slice? I don't think it matters though, I forget why I didn't want this to be nil I'll try switching it over.

dr.topLevelResolver.onError(err)
dr.mu.Lock()
// Don't report any errors after first error or good address update.
if dr.addrs != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be if !dr.updateReceived { ... } instead?

Copy link
Contributor Author

@zasweq zasweq Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It can be. I was debating whether using the addrs or the bool. I guess if you don't like the addrs := make(resolver.Address, 0) call I can use the bool.

@@ -142,7 +142,17 @@ func (dr *dnsDiscoveryMechanism) ReportError(err error) {
dr.logger.Infof("DNS discovery mechanism for resource %q reported error: %v", dr.target, err)
}

dr.topLevelResolver.onError(err)
dr.mu.Lock()
// Don't report any errors after first error or good address update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to suppress subsequent errors? It probably doesn't have any downstream effects today, but are there any other good reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm btw I talked to Doug yesterday about this in our 1:1. The channel eats errors from the resolver if it's already received a good address update, however he mentioned it's also fine to add an extra eat for that scenario at this level. My logic was to suppress since no downstream effects, and also to keep it similar to how you handled subsequent errors in eds resolver (write an empty update, next error returns early).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can be a little more verbose in this comment. Maybe this is too verbose, but something to indicate why suppressing the errors is beneficial here:

// If a previous good update was received, suppress the error and continue using the
// previous update. If RPCs were succeeding prior to this, they will continue to do so.
// Also suppress errors if we previously received an error, since there will be no
// downstream effects of propagating this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. Used your comment.

dr.mu.Unlock()
return
}
dr.addrs = make([]string, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dr.addrs = nil is more readable, and avoids an unnecessary call to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per comment above, I did this to make the conditional at beginning gating subsequent calls on addrs based off this addrs struct. However, I don't actually think you need to set it to nil here. The only way it hits this codeblock is if it's an error before update, in which case the addrs will be nil already. In the case that the addrs get set to non nil, it won't hit this codeblock. I'll set it anyway though. Also, I was scared that the getAddrs() switch won't see the type as []string if it's nil, but thinking about it I think it will.

dr.updateReceived = true
dr.mu.Unlock()

dr.topLevelResolver.onUpdate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are calling onUpdate for both good updates and errors, we no longer need the onError method on the topLevelResolver. EDS discovery mechanism does not use it either. So, you can delete the implementation of that method in resourceResolver and remove that method from the topLevelResolver interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm ok. Will delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually being called in DNS Discovery Mechanism Construction errors, but as per the back and forth in chat I need to switch this to do same thing DNS errors before update do - create the priority and send TF down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in Discovery Mechanism Constructor same way as OnError - set updateReceived to true and then call onUpdate on the top level resolver. Let me know if you want to change the bool (I need to set it for lastUpdate() to return true).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want to change the bool

What do you mean by this? I'm good with how it is looking now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the construction to also set this bool and trigger TF. Thus, I don't know if the naming is appropriate. Has it received an update, or is it just triggering an error? I'm fine leaving as is also though.

@easwars easwars assigned zasweq and unassigned easwars Jul 21, 2023
@zasweq
Copy link
Contributor Author

zasweq commented Jul 21, 2023

Putting this here so I don't forget; I also need to add a test/codepath for errors in construction of DNS Discovery Mechanism before update triggering TF.

@@ -75,13 +75,15 @@ func newDNSResolver(target string, topLevelResolver topLevelResolver, logger *gr
}
u, err := url.Parse("dns:///" + target)
if err != nil {
topLevelResolver.onError(fmt.Errorf("failed to parse dns hostname %q in clusterresolver LB policy", target))
ret.updateReceived = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably throw a warning log here and in the next error case. I would even be OK with an error log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep it consistent with the error logs already in codebase (from ReportError), switched to Info log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Log lines should be capitalized, unlike error strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, noted. Switched.

@@ -142,7 +142,17 @@ func (dr *dnsDiscoveryMechanism) ReportError(err error) {
dr.logger.Infof("DNS discovery mechanism for resource %q reported error: %v", dr.target, err)
}

dr.topLevelResolver.onError(err)
dr.mu.Lock()
// Don't report any errors after first error or good address update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can be a little more verbose in this comment. Maybe this is too verbose, but something to indicate why suppressing the errors is beneficial here:

// If a previous good update was received, suppress the error and continue using the
// previous update. If RPCs were succeeding prior to this, they will continue to do so.
// Also suppress errors if we previously received an error, since there will be no
// downstream effects of propagating this error.

dr.updateReceived = true
dr.mu.Unlock()

dr.topLevelResolver.onUpdate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want to change the bool

What do you mean by this? I'm good with how it is looking now.

@easwars
Copy link
Contributor

easwars commented Jul 21, 2023

Putting this here so I don't forget; I also need to add a test/codepath for errors in construction of DNS Discovery Mechanism before update triggering TF.

Haven't reviewed the tests yet. Will wait for this.
The code LGTM, btw.

@zasweq
Copy link
Contributor Author

zasweq commented Jul 21, 2023

Oh I finished that, forgot to note (the error in construction). I didn't write a test for it since there was no tests for master and it seems like as per xDS chat we'll be adding validations in xDS Client unmarshaling that makes this impossible to hit (Mark has a pending PR). Right now, switched to what c-core does which is to trigger this TF on errors in DNS Discovery Mechanism construction.

@zasweq zasweq assigned easwars and unassigned zasweq Jul 21, 2023
@easwars easwars changed the title xds/internal/balancer/clusterresolver: Fix cluster_resolver error handling clusterresolver: comply with A37 for handling errors from discovery mechanisms Jul 21, 2023
Comment on lines 193 to 204
var priorities [][]xdsresource.Locality
// Triggered by an NACK or resource-not-found error before update, or a
// empty localities list in a update. In either case want to create a
// priority, and send down empty address list, causing TF for that priority.
if len(edsResp.Localities) == 0 {
// "If any discovery mechanism instance experiences an error retrieving
// data, and it has not previously reported any results, it should
// report a result that is a single priority with no endpoints." - A37
priorities = [][]xdsresource.Locality{{}}
} else {
priorities = groupLocalitiesByPriority(edsResp.Localities)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit. You can shorten this even further:

priorities := [][]xdsresource.Locality{{}}
if len(edsResp.Localities) != 0 {
	priorities = groupLocalitiesByPriority(edsResp.Localities)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. Switched.

// cluster. Also configure an empty endpoints resource for the EDS cluster
// that contains no endpoints.
// cluster. Also configure an endpoints resource for the EDS cluster which
// triggers an NACK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry: s/an NACK/a NACK/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha, switched.

Comment on lines 776 to 788
for ; ctx.Err() == nil; <-time.After(defaultTestShortTimeout) {
peer := &peer.Peer{}
if _, err := client.EmptyCall(ctx, &testpb.Empty{}, grpc.Peer(peer)); err != nil {
t.Logf("EmptyCall() failed: %v", err)
continue
}
if peer.Addr.String() == addrs[0].Addr {
break
}
}
if ctx.Err() != nil {
t.Fatalf("Timeout when waiting for RPCs to be routed to backend %q in the DNS cluster", addrs[0].Addr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that worked :). Note that the other tests in this file don't use this helper (since you always rewrite these haha :D).

@@ -75,13 +75,15 @@ func newDNSResolver(target string, topLevelResolver topLevelResolver, logger *gr
}
u, err := url.Parse("dns:///" + target)
if err != nil {
topLevelResolver.onError(fmt.Errorf("failed to parse dns hostname %q in clusterresolver LB policy", target))
ret.updateReceived = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Log lines should be capitalized, unlike error strings.

@zasweq zasweq merged commit 2aa2615 into grpc:master Jul 24, 2023
11 checks passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix cluster resolver error handling
2 participants