Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dial DNS lookup error not retried #2295

Closed
a-canya opened this issue Sep 29, 2023 · 8 comments
Closed

Dial DNS lookup error not retried #2295

a-canya opened this issue Sep 29, 2023 · 8 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@a-canya
Copy link
Contributor

a-canya commented Sep 29, 2023

Describe the bug

Using the standard retryer (retry.NewStandard()), I have observed that the following error is retried:

https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://MY_BUCKET.s3.us-east-1.amazonaws.com/MY_FILE?x-id=PutObject": dial tcp 1.2.3.4:443: i/o timeout

whereas this error is not retried:

https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://MY_BUCKET.s3.us-east-1.amazonaws.com/MY_FILE?x-id=PutObject": dial tcp: lookup MY_BUCKET.s3.us-east-1.amazonaws.com: i/o timeout

Expected Behavior

Both errors are retried.

Current Behavior

Error with DNS lookup is not retried.

Reproduction Steps

I don't know how to reproduce the error. Happened in production a few times.

Possible Solution

In https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/retryable_error.go#L100, there is a cast to *net.DNSError which was introduced in #2083.

	case errors.As(err, &dnsError):
		// NXDOMAIN errors should not be retried
		retryable = !dnsError.IsNotFound && dnsError.IsTemporary

This has several issues. IsTemporary field is only guaranteed to be valid if true. There is also a method (Temporary()) which also considers that the error might be a time out (which I think might be my case). For reference see net.DNSError. In any case, if false the code should fall through to the other checks.

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

v1.21.0

Compiler and Version used

go version go1.20.2 linux/amd64

Operating System and version

alpine:latest

@a-canya a-canya added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 29, 2023
@a-canya
Copy link
Contributor Author

a-canya commented Sep 29, 2023

I'm working on a fix. I will create a PR when I can validate in prod that this fixes our issue.

@RanVaknin
Copy link
Contributor

Hi @a-canya ,

Thanks for reaching out.
While the two errors are in fact DNS related errors, they occur within different areas of the networking stack.

https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://MY_BUCKET.s3.us-east-1.amazonaws.com/MY_FILE?x-id=PutObject": dial tcp 1.2.3.4:443: i/o timeout

This error indicates that DNS resolution succeeds but not able to establish a connection. This is often for transient reasons. From an SDK design perspective we want to retry connection timeouts since they are often solved with a retry.

https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://MY_BUCKET.s3.us-east-1.amazonaws.com/MY_FILE?x-id=PutObject": dial tcp: lookup MY_BUCKET.s3.us-east-1.amazonaws.com: i/o timeout

This one is a DNS lookup failure, which often indicates more persistent issues than mere connection timeouts. Hence its not retryable by default.

The SDK tries to accommodate transient error by default, and not general networking errors. It aims to be an isolated layer on top of the standard go library for networking purposes and let customers configure their own networking stack errors (perhaps with a custom retry strategy implemented by the user)

Im inclined to close this issue but if you have any clarification / case for why this should be changed, I'd be happy to discuss it with the team.

Thanks,
Ran~

@RanVaknin RanVaknin self-assigned this Oct 2, 2023
@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Oct 2, 2023
@lucix-aws
Copy link
Contributor

FYI @a-canya I've taken a look at your patch. I'm fine with taking the change to look at Temporary() instead of IsTemporary but I'm not sure I understand why the specific "is dns error" path needed to be broken out of the main switch expression - they seem behaviorally equivalent to me.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 3, 2023
@lucix-aws lucix-aws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 3, 2023
@a-canya
Copy link
Contributor Author

a-canya commented Oct 3, 2023

Hi, thanks for you quick answers!

@RanVaknin

https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://MY_BUCKET.s3.us-east-1.amazonaws.com/MY_FILE?x-id=PutObject": dial tcp: lookup MY_BUCKET.s3.us-east-1.amazonaws.com: i/o timeout

This one is a DNS lookup failure, which often indicates more persistent issues than mere connection timeouts. Hence its not retryable by default.

I believe the eror is a connection timeout while doing a DNS lookup, which should be retryable.


@lucix-aws

I moved the check outside the switch because just looking at net.DNSError we might not be able to determine if the error is retryable, and in this case it's better to keep trying other options than to just assume that it is not retryable.

The log I got shows an error that wraps a long chain of errors of different types (apart from a net.DNSError, there is also a net.OpError).

If dnsError.Temporary() is false, error may or may not be temporary:

Temporary reports whether the DNS error is known to be temporary. This is not always known; a DNS lookup may fail due to a temporary error and return a DNSError for which Temporary returns false. (source)


My teammate is trying to recreate the same situation where this error occurred with increased logs.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 4, 2023
@lucix-aws
Copy link
Contributor

OK I see that now. There is definitely a subtle difference between terminating at retryable = dnsError.Temporary() and returning on it only if Temporary() is true since it being false isn't conclusive. Thanks for clarifying.

Your patch looks solid then if you'd like to PR it - I imagine we'd take it regardless of whether it actually fixes your issue, since you've made a real behavioral improvement regardless.

@a-canya
Copy link
Contributor Author

a-canya commented Oct 5, 2023

Hi, thanks @lucix-aws ! I will open the PR.

FYI the error happened again in our system and I managed to get more logs:

&net.OpError{Op:"dial", Net:"tcp", Source:net.Addr(nil), Addr:net.Addr(nil), Err:(*net.DNSError)(0xc025604840)}
&net.DNSError{Err:"i/o timeout", Name:"MY_BUCKET.s3.us-east-1.amazonaws.com", Server:"", IsTimeout:true, IsTemporary:false, IsNotFound:false}

As suspected, IsTemporary is false but IsTimeout is true.

@lucix-aws
Copy link
Contributor

Resolved by #2300

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

3 participants