Excessive DNS caching for DNS verification #3743

timothyclarke · 2021-03-05T11:16:44Z

Describe the bug:
Cert-Manager Caches DNS entries for longer than DNS TTL times

Expected behaviour:
Cert-Manager should not cache DNS responses

Steps to reproduce the bug:

This was encountered when the parent domain was in cloudflare and subdomains (via NS records) were delegated to other nameservers. That delegation was removed and replaced with A records.

Create a domain and delegate subdomains. Keep the TTL Low
eg www.subdomain.example.com
Attempt to validate subdomains via the DNS method (This should fail as the subdomain is elsewhere)
Remove delegation (NS records) for subdomain.example.com from the parent domain
Create A records for www.subdomain.example.com and subdomain.example.com in the parent domain
Remove the secret / certificate request to re-attempt certificate validation by the DNS01 method

Anything else we need to know?:
I suspect that the DNS caching would also manifest if you attempted to validate certificates before the domain had been transferred to the correct location. Eg Purchased at GoDaddy and later transferred to CloudFlare. between the registration and transfer DNS01 validation to CloudFlare is attempted.

We were transferring sites in from a 3rd party hosting provider. The 3rd party used DNS delegation (NS entries in the patent domain) to manage the A records at their end. When the initial cert validation went in the NS records were in place (with 300s TTL's in the parent domain). These NS records were replaced with A records in the parent domain.
48 hours later all DNS requests (dig) both inside and outside the cluster were returning correct results however cert-manager would not validate the cert with the following error (real domain name swapped to example.com)

cert-manager/controller/challenges "msg"="re-queuing item  due to error processing" "error"="Zone jp.example.com. not found in CloudFlare for domain _acme-challenge.jp.example.com." "key"="redirects/example.com-tls-3737300376-3017021645-504325998"

All other domains (eg example.net, example.org) were working correctly. When I deleted the cert-manager pod and allowed K8s to spawn a new one the cert was issued within a few minutes

Environment details::

Kubernetes version: 1.19.7
Cloud-provider/provisioner: AWS / KOPS (1.19.0)
cert-manager version: v0.14.0
Install method: e.g. helm

/kind bug

The text was updated successfully, but these errors were encountered:

osddeitf · 2021-03-20T17:35:33Z

I'm facing this issue too, as of #3776.

osddeitf · 2021-03-20T17:38:48Z

It'll make sense when cert-manager should not cache DNS resolutions that I think not affect performance. Or if I'm wrong, perhaps should there are some options like TTL of DNS resolutions. Sometimes we could not afford to wait for Certificates ready.

jetstack-bot · 2021-09-16T04:24:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

timothyclarke · 2021-09-16T07:07:25Z

/remove-lifecycle stale

jetstack-bot · 2021-12-15T07:19:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

timothyclarke · 2021-12-15T10:06:58Z

/remove-lifecycle stale

maelvls · 2021-12-15T10:30:42Z

In Go, the net.Dial function delegates the DNS lookups to the OS. In Kubernetes, containers are usually configured with CoreDNS as their resolver in /etc/resolv.conf.

In a fresh Kind cluster, the /etc/resolv.conf file in the cert-manager pod looks like this (10.96.0.10 is the service IP for CoreDNS):

search default.svc.cluster.local svc.cluster.local cluster.local tailnet-bb86.ts.net cert-manager.org.github.beta.tailscale.net
nameserver 10.96.0.10
options ndots:5

The DNS queries are not cached locally in the cert-manager container.

The unexpectedly higher TTL probably come from some caching in CoreDNS.

Which resolver are you using?

timothyclarke · 2021-12-15T11:44:11Z

When this issue was raised, I cannot remember if this was CoreDNS or not. however I can guarantee that it wasn't caching in CoreDNS as I did queries against the clusters resolvers using dig waited over a day and tried again. Both times they were returning results I expected but cert-manager was giving the error per above.

I removed the pod, it respawned onto the same k8s node and the cert was issued so there was something specific to the cert-manager pod that kept a value.

Following what you state sets a potentially bad state for cert manager. Cert manager relies on dns records being set and correctly propagating for it to function correctly. Cert manager implicitly relies on dns misses to have short TTL's. If there are unknown layers caching those responses, specifically negative responses that is going to cause additionally load elsewhere if cert-manager queries too early.

maelvls · 2021-12-15T12:44:56Z

You are correct, I just realized that we have our own DNS client when it comes to DNS-01 challenges. Our ad-hoc client is in DNSQuery.

Then I realized that we do cache SOA queries:

https://github.com/jetstack/cert-manager/blob/dffbf391dbb0fc6c1cfea62e561a9c6f54362ab0/pkg/issuer/acme/dns/util/wait.go#L326-L331

This is definitely the cause of your issue. I do not know the rationale for caching these calls. The caching mechanism seems to have been introduced back in 2017 in https://github.com/jetstack/cert-manager/pull/11/files.

Possible remediations:

Remove caching.
Add a TTL mechanism to the cache.

@munnerz Do you think disabling caching would affect performance?

timothyclarke · 2021-12-15T13:42:33Z

Possible remediations (additional):
3. Provide a mechanism for the caching to be controlled by args eg --max-cache-ttl 60m

michaeljguarino · 2022-01-11T00:33:32Z

I've been hitting this issue as well. It seems like a dns-01 solver should probably bias towards correctness rather than performance since it's really only looking for a temporary secret being present in a TXT entry on a basically one-off basis.

norman-zon · 2022-02-01T11:09:31Z

We have issues with this behaviour too.
I second the idea, that correctness beats performance in this scenario.

jetstack-bot · 2022-05-02T11:14:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

dedene · 2022-05-02T11:47:20Z

I'm also observing this behaviour and not sure why. I can only observe DNS-01 is taking over 30mins with the message "propagation check failed", although I can confirm the TXT record is immediately available on Cloudflare.

maelvls · 2022-05-02T21:08:39Z

Side note: we are considering revamping our ad-hoc DNS client used for self-checks. It emulates what a DNS resolver would do, which seems overkill for cert-manager’s purposes, which I have started describing in Why I think dns-over-https doesn’t need finding authoritative nameservers, nor following CNAME records (part of #5003).

I think we should drastically simplify the self-check DNS client, but that will take a while since we can’t break anyone.

munnerz · 2022-05-03T07:20:45Z

@maelvls the one thing to call out here, and the reason we query authoritative nameservers instead of recursive ones is so that we are not at the mercy of a recursive nameserver's caching TTL.

That's the entire purpose of the SOA record traversal - with respect to DNS over HTTPS, that may be something that we should only do via a recursive NS however we will see delays due to caching at the recursive nameserver we're relying on.

maelvls · 2022-05-03T07:29:57Z

Thanks for the clarification!

Isn't the TTL caching issue a non-issue, knowing that the self-check will be performed again if it receives a NXDOMAIN response? I think @wallrj was making this point the other day, and I agreed with him.

munnerz · 2022-05-03T07:33:39Z

There's no guarantee an NXDOMAIN response is returned (there could be an existing record already in place for that name) & additionally, not all resolvers actually have that correct behaviour with NXDOMAIN responses.

This behaviour is also possible today using the '--dns01-recursive-nameservers-only' flag too, for what it's worth 😊 but it almost certainly will result in slower time to completing validation.

jetstack-bot · 2022-06-02T08:31:07Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

jetstack-bot · 2022-07-02T09:30:44Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

jetstack-bot · 2022-07-02T09:30:46Z

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 5, 2021

irbekrm mentioned this issue Mar 19, 2021

Cert-manager have DNS cache too long? #3776

Closed

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2021

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2021

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2022

jetstack-bot closed this as completed Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive DNS caching for DNS verification #3743

Excessive DNS caching for DNS verification #3743

timothyclarke commented Mar 5, 2021

osddeitf commented Mar 20, 2021

osddeitf commented Mar 20, 2021

jetstack-bot commented Sep 16, 2021

timothyclarke commented Sep 16, 2021

jetstack-bot commented Dec 15, 2021

timothyclarke commented Dec 15, 2021

maelvls commented Dec 15, 2021

timothyclarke commented Dec 15, 2021 •

edited

maelvls commented Dec 15, 2021

timothyclarke commented Dec 15, 2021

michaeljguarino commented Jan 11, 2022

norman-zon commented Feb 1, 2022

jetstack-bot commented May 2, 2022

dedene commented May 2, 2022

maelvls commented May 2, 2022 •

edited

munnerz commented May 3, 2022

maelvls commented May 3, 2022

munnerz commented May 3, 2022

jetstack-bot commented Jun 2, 2022

jetstack-bot commented Jul 2, 2022

jetstack-bot commented Jul 2, 2022

Excessive DNS caching for DNS verification #3743

Excessive DNS caching for DNS verification #3743

Comments

timothyclarke commented Mar 5, 2021

osddeitf commented Mar 20, 2021

osddeitf commented Mar 20, 2021

jetstack-bot commented Sep 16, 2021

timothyclarke commented Sep 16, 2021

jetstack-bot commented Dec 15, 2021

timothyclarke commented Dec 15, 2021

maelvls commented Dec 15, 2021

timothyclarke commented Dec 15, 2021 • edited

maelvls commented Dec 15, 2021

timothyclarke commented Dec 15, 2021

michaeljguarino commented Jan 11, 2022

norman-zon commented Feb 1, 2022

jetstack-bot commented May 2, 2022

dedene commented May 2, 2022

maelvls commented May 2, 2022 • edited

munnerz commented May 3, 2022

maelvls commented May 3, 2022

munnerz commented May 3, 2022

jetstack-bot commented Jun 2, 2022

jetstack-bot commented Jul 2, 2022

jetstack-bot commented Jul 2, 2022

timothyclarke commented Dec 15, 2021 •

edited

maelvls commented May 2, 2022 •

edited