Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive DNS caching for DNS verification #3743

Closed
timothyclarke opened this issue Mar 5, 2021 · 21 comments
Closed

Excessive DNS caching for DNS verification #3743

timothyclarke opened this issue Mar 5, 2021 · 21 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@timothyclarke
Copy link

Describe the bug:
Cert-Manager Caches DNS entries for longer than DNS TTL times

Expected behaviour:
Cert-Manager should not cache DNS responses

Steps to reproduce the bug:

This was encountered when the parent domain was in cloudflare and subdomains (via NS records) were delegated to other nameservers. That delegation was removed and replaced with A records.

  1. Create a domain and delegate subdomains. Keep the TTL Low
    eg www.subdomain.example.com
  2. Attempt to validate subdomains via the DNS method (This should fail as the subdomain is elsewhere)
  3. Remove delegation (NS records) for subdomain.example.com from the parent domain
  4. Create A records for www.subdomain.example.com and subdomain.example.com in the parent domain
  5. Remove the secret / certificate request to re-attempt certificate validation by the DNS01 method

Anything else we need to know?:
I suspect that the DNS caching would also manifest if you attempted to validate certificates before the domain had been transferred to the correct location. Eg Purchased at GoDaddy and later transferred to CloudFlare. between the registration and transfer DNS01 validation to CloudFlare is attempted.

We were transferring sites in from a 3rd party hosting provider. The 3rd party used DNS delegation (NS entries in the patent domain) to manage the A records at their end. When the initial cert validation went in the NS records were in place (with 300s TTL's in the parent domain). These NS records were replaced with A records in the parent domain.
48 hours later all DNS requests (dig) both inside and outside the cluster were returning correct results however cert-manager would not validate the cert with the following error (real domain name swapped to example.com)

cert-manager/controller/challenges "msg"="re-queuing item  due to error processing" "error"="Zone jp.example.com. not found in CloudFlare for domain _acme-challenge.jp.example.com." "key"="redirects/example.com-tls-3737300376-3017021645-504325998"

All other domains (eg example.net, example.org) were working correctly. When I deleted the cert-manager pod and allowed K8s to spawn a new one the cert was issued within a few minutes

Environment details::

  • Kubernetes version: 1.19.7
  • Cloud-provider/provisioner: AWS / KOPS (1.19.0)
  • cert-manager version: v0.14.0
  • Install method: e.g. helm

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 5, 2021
@osddeitf
Copy link

I'm facing this issue too, as of #3776.

@osddeitf
Copy link

It'll make sense when cert-manager should not cache DNS resolutions that I think not affect performance. Or if I'm wrong, perhaps should there are some options like TTL of DNS resolutions. Sometimes we could not afford to wait for Certificates ready.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021
@timothyclarke
Copy link
Author

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021
@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2021
@timothyclarke
Copy link
Author

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2021
@maelvls
Copy link
Member

maelvls commented Dec 15, 2021

In Go, the net.Dial function delegates the DNS lookups to the OS. In Kubernetes, containers are usually configured with CoreDNS as their resolver in /etc/resolv.conf.

In a fresh Kind cluster, the /etc/resolv.conf file in the cert-manager pod looks like this (10.96.0.10 is the service IP for CoreDNS):

search default.svc.cluster.local svc.cluster.local cluster.local tailnet-bb86.ts.net cert-manager.org.github.beta.tailscale.net
nameserver 10.96.0.10
options ndots:5

The DNS queries are not cached locally in the cert-manager container.

The unexpectedly higher TTL probably come from some caching in CoreDNS.

Which resolver are you using?

@timothyclarke
Copy link
Author

timothyclarke commented Dec 15, 2021

When this issue was raised, I cannot remember if this was CoreDNS or not. however I can guarantee that it wasn't caching in CoreDNS as I did queries against the clusters resolvers using dig waited over a day and tried again. Both times they were returning results I expected but cert-manager was giving the error per above.

I removed the pod, it respawned onto the same k8s node and the cert was issued so there was something specific to the cert-manager pod that kept a value.

Following what you state sets a potentially bad state for cert manager. Cert manager relies on dns records being set and correctly propagating for it to function correctly. Cert manager implicitly relies on dns misses to have short TTL's. If there are unknown layers caching those responses, specifically negative responses that is going to cause additionally load elsewhere if cert-manager queries too early.

@maelvls
Copy link
Member

maelvls commented Dec 15, 2021

You are correct, I just realized that we have our own DNS client when it comes to DNS-01 challenges. Our ad-hoc client is in DNSQuery.

Then I realized that we do cache SOA queries:

https://github.com/jetstack/cert-manager/blob/dffbf391dbb0fc6c1cfea62e561a9c6f54362ab0/pkg/issuer/acme/dns/util/wait.go#L326-L331

This is definitely the cause of your issue. I do not know the rationale for caching these calls. The caching mechanism seems to have been introduced back in 2017 in https://github.com/jetstack/cert-manager/pull/11/files.

Possible remediations:

  1. Remove caching.
  2. Add a TTL mechanism to the cache.

@munnerz Do you think disabling caching would affect performance?

@timothyclarke
Copy link
Author

Possible remediations (additional):
3. Provide a mechanism for the caching to be controlled by args eg --max-cache-ttl 60m

@michaeljguarino
Copy link

I've been hitting this issue as well. It seems like a dns-01 solver should probably bias towards correctness rather than performance since it's really only looking for a temporary secret being present in a TXT entry on a basically one-off basis.

@norman-zon
Copy link

We have issues with this behaviour too.
I second the idea, that correctness beats performance in this scenario.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022
@dedene
Copy link

dedene commented May 2, 2022

I'm also observing this behaviour and not sure why. I can only observe DNS-01 is taking over 30mins with the message "propagation check failed", although I can confirm the TXT record is immediately available on Cloudflare.

@maelvls
Copy link
Member

maelvls commented May 2, 2022

Side note: we are considering revamping our ad-hoc DNS client used for self-checks. It emulates what a DNS resolver would do, which seems overkill for cert-manager’s purposes, which I have started describing in Why I think dns-over-https doesn’t need finding authoritative nameservers, nor following CNAME records (part of #5003).

I think we should drastically simplify the self-check DNS client, but that will take a while since we can’t break anyone.

@munnerz
Copy link
Member

munnerz commented May 3, 2022

@maelvls the one thing to call out here, and the reason we query authoritative nameservers instead of recursive ones is so that we are not at the mercy of a recursive nameserver's caching TTL.

That's the entire purpose of the SOA record traversal - with respect to DNS over HTTPS, that may be something that we should only do via a recursive NS however we will see delays due to caching at the recursive nameserver we're relying on.

@maelvls
Copy link
Member

maelvls commented May 3, 2022

Thanks for the clarification!

Isn't the TTL caching issue a non-issue, knowing that the self-check will be performed again if it receives a NXDOMAIN response? I think @wallrj was making this point the other day, and I agreed with him.

@munnerz
Copy link
Member

munnerz commented May 3, 2022

There's no guarantee an NXDOMAIN response is returned (there could be an existing record already in place for that name) & additionally, not all resolvers actually have that correct behaviour with NXDOMAIN responses.

This behaviour is also possible today using the '--dns01-recursive-nameservers-only' flag too, for what it's worth 😊 but it almost certainly will result in slower time to completing validation.

@jetstack-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2022
@jetstack-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

@jetstack-bot
Copy link
Contributor

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants