-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive DNS caching for DNS verification #3743
Comments
I'm facing this issue too, as of #3776. |
It'll make sense when |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
In Go, the net.Dial function delegates the DNS lookups to the OS. In Kubernetes, containers are usually configured with CoreDNS as their resolver in In a fresh Kind cluster, the
The DNS queries are not cached locally in the cert-manager container. The unexpectedly higher TTL probably come from some caching in CoreDNS. Which resolver are you using? |
When this issue was raised, I cannot remember if this was CoreDNS or not. however I can guarantee that it wasn't caching in I removed the pod, it respawned onto the same k8s node and the cert was issued so there was something specific to the cert-manager pod that kept a value. Following what you state sets a potentially bad state for cert manager. Cert manager relies on dns records being set and correctly propagating for it to function correctly. Cert manager implicitly relies on dns misses to have short TTL's. If there are unknown layers caching those responses, specifically negative responses that is going to cause additionally load elsewhere if cert-manager queries too early. |
You are correct, I just realized that we have our own DNS client when it comes to DNS-01 challenges. Our ad-hoc client is in DNSQuery. Then I realized that we do cache This is definitely the cause of your issue. I do not know the rationale for caching these calls. The caching mechanism seems to have been introduced back in 2017 in https://github.com/jetstack/cert-manager/pull/11/files. Possible remediations:
@munnerz Do you think disabling caching would affect performance? |
Possible remediations (additional): |
I've been hitting this issue as well. It seems like a dns-01 solver should probably bias towards correctness rather than performance since it's really only looking for a temporary secret being present in a TXT entry on a basically one-off basis. |
We have issues with this behaviour too. |
Issues go stale after 90d of inactivity. |
I'm also observing this behaviour and not sure why. I can only observe DNS-01 is taking over 30mins with the message "propagation check failed", although I can confirm the TXT record is immediately available on Cloudflare. |
Side note: we are considering revamping our ad-hoc DNS client used for self-checks. It emulates what a DNS resolver would do, which seems overkill for cert-manager’s purposes, which I have started describing in Why I think dns-over-https doesn’t need finding authoritative nameservers, nor following CNAME records (part of #5003). I think we should drastically simplify the self-check DNS client, but that will take a while since we can’t break anyone. |
@maelvls the one thing to call out here, and the reason we query authoritative nameservers instead of recursive ones is so that we are not at the mercy of a recursive nameserver's caching TTL. That's the entire purpose of the SOA record traversal - with respect to DNS over HTTPS, that may be something that we should only do via a recursive NS however we will see delays due to caching at the recursive nameserver we're relying on. |
Thanks for the clarification! Isn't the TTL caching issue a non-issue, knowing that the self-check will be performed again if it receives a NXDOMAIN response? I think @wallrj was making this point the other day, and I agreed with him. |
There's no guarantee an NXDOMAIN response is returned (there could be an existing record already in place for that name) & additionally, not all resolvers actually have that correct behaviour with NXDOMAIN responses. This behaviour is also possible today using the '--dns01-recursive-nameservers-only' flag too, for what it's worth 😊 but it almost certainly will result in slower time to completing validation. |
Stale issues rot after 30d of inactivity. |
Rotten issues close after 30d of inactivity. |
@jetstack-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Describe the bug:
Cert-Manager Caches DNS entries for longer than DNS TTL times
Expected behaviour:
Cert-Manager should not cache DNS responses
Steps to reproduce the bug:
This was encountered when the parent domain was in cloudflare and subdomains (via NS records) were delegated to other nameservers. That delegation was removed and replaced with A records.
eg
www.subdomain.example.com
subdomain.example.com
from the parent domainwww.subdomain.example.com
andsubdomain.example.com
in the parent domainAnything else we need to know?:
I suspect that the DNS caching would also manifest if you attempted to validate certificates before the domain had been transferred to the correct location. Eg Purchased at GoDaddy and later transferred to CloudFlare. between the registration and transfer DNS01 validation to CloudFlare is attempted.
We were transferring sites in from a 3rd party hosting provider. The 3rd party used DNS delegation (NS entries in the patent domain) to manage the A records at their end. When the initial cert validation went in the NS records were in place (with 300s TTL's in the parent domain). These NS records were replaced with A records in the parent domain.
48 hours later all DNS requests (dig) both inside and outside the cluster were returning correct results however cert-manager would not validate the cert with the following error (real domain name swapped to example.com)
All other domains (eg example.net, example.org) were working correctly. When I deleted the cert-manager pod and allowed K8s to spawn a new one the cert was issued within a few minutes
Environment details::
/kind bug
The text was updated successfully, but these errors were encountered: