-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize the function load_balance_for_hostname to avoid starting multiple DNS tasks #672
Conversation
… more than one background on-demand DNS task
😊 Welcome @zhlsunshine! This is either your first contribution to the Istio ztunnel repo, or it's been You can learn more about the Istio working groups, code of conduct, and contributing guidelines Thanks for contributing! Courtesy of your friendly welcome wagon. |
Hi @howardjohn & @hzxuzhonghu, is there any comment for this PR? thanks! |
Will take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I donot fully understand the idea.
I think we just need CachedResolvedDNSMap
which stores the hash map of hostname to SingleFlight
The single flight only need to store the Notify
Hi @hzxuzhonghu, I think I can give more explains here for these 2 AtomicBools: |
Hi @howardjohn & @hzxuzhonghu & @nmittler, is there anything that I can do to make it approved? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we can have a test for it
Hi @hzxuzhonghu, sure, I added a concurrency requests test named |
Hi @howardjohn, @hzxuzhonghu and @nmittler, there is a change for this PR based on @nmittler 's suggestion, creating a new struct named |
src/state.rs
Outdated
@@ -333,20 +570,17 @@ impl DemandProxyState { | |||
let workload_uid = workload.uid.to_owned(); | |||
let hostname = workload.hostname.to_owned(); | |||
metrics.as_ref().on_demand_dns.get_or_create(&labels).inc(); | |||
let rdns = match state.get_ips_for_hostname(&workload.hostname) { | |||
let rdns = match state.dns_resolver.get_ips_for_hostname(&hostname) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this entire method (load_balance_for_hostname) just be moved to DnsResolver
? All of this should be internal details of it, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nmittler, as you know DnsResolver
is part of DemandProxyState
, and it needs to modify itself on its 2 HashMaps, so it's mutable reference. If moving load_balance_for_hostname
into DnsResolver
, it would impact that the whole DemandProxyState
need to be a mutable reference as well. So hope it can be as it is now. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually concerned when I saw that DemandProxyState.resolve_on_demand_dns and DemandProxyState.load_balance_for_hostname both working around &self
to get &mut self
. Not sure when that was introduced, but it seems a bit hacky and should probably be fixed if possible. Perhaps the issue is that we have some methods that needlessly require &mut self
(e.g. if you lock an object that you write to, you don't need mut)?
@howardjohn I may be off here, since my rust is a little rusty at this point :) ... thoughts?
Hi @nmittler, as we discussed, it becomes more simplicity. |
src/state.rs
Outdated
// by_hostname is a map from hostname to resolved IP addresses for now. | ||
// | ||
// in a future with support for per-pod DNS resolv.conf settings we may need | ||
// to change this to a map from source workload uid to resolved IP addresses. | ||
#[serde(skip_serializing)] | ||
by_hostname: HashMap<String, ResolvedDns>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just call this field resolved
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change it, how about resolved_hostname
?
src/state.rs
Outdated
) -> Option<ResolvedDns> { | ||
// optimize so that if multiple requests to the same hostname come in at the same time, | ||
// we don't start more than one background on-demand DNS task | ||
let is_the_first_req = self.get_singleflight_by_hostname(hostname.to_owned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right. This code should be something like this:
async fn resolve_host(
&self,
hostname: &String,
workload: &Workload,
) -> Option<ResolvedDns> {
// Look up the host in the local cache. If it's there,
// just return it now.
if let Some(result) = self.get_ips_for_hostname(hostname) {
return Some(result)
}
// Perform the resolution for the host and block until it
// completes. Take care to only allow a single request to be
// in-flight at a time. Other threads will just wait for the first
// request to complete.
match self.get_or_create_notify(hostname) {
(notify, true) => {
// First request, perform the resolution and notify
// all other threads when resolution is complete.
self.resolve_on_demand_dns(workload).await;
notify.notify_waiters();
// Now that we've notified all waiters, we can safely
// remove the notify for this host. Any future requests
// Should just get it from the local cache.
self.remove_notify(hostname)
}
(notify, false) => {
// Already in progress, just wait for the result.
notify.notified().await;
}
}
// At this point, the local cache has been populated.
// Just return the cached contents.
return self.get_ips_for_hostname(hostname)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nmittler, there are 2 mutable borrows occurs in both self.get_or_create_notify
(first mutable borrow) and self.resolve_on_demand_dns
(second mutable borrow), so I have to use this let element = self.get_cached_resolve_dns_for_hostname(hostname);
to avoid this problem by getting the element from HashMap via read lock.
However, I can change get_singleflight_by_hostname
into get_or_create_notify
. Does it make sense?
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
@zhlsunshine I created #722, which re-works this a bit based on some things we discussed. PTAL. |
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here. |
Co-authored-by: nmittler <nmittler@gmail.com>
Hi @nmittler, frankly, I'm focusing on some validations of ambient mesh in CSP environment in recent, what I'm sure about is that the original code has the problem: there may be multiple DNS resolving for the same hostname in concurrent requests. |
@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it. @howardjohn @hzxuzhonghu any thoughts? |
I agree with Nate
…On Thu, Nov 9, 2023 at 10:14 AM Nathan Mittler ***@***.***> wrote:
@zhlsunshine <https://github.com/zhlsunshine> just to be clear, you
haven't actually seen performance problems in a deployed system? My concern
is fundamentally about the complexity of the solution. I'd like to see that
the problem is real before we optimize for it.
@howardjohn <https://github.com/howardjohn> @hzxuzhonghu
<https://github.com/hzxuzhonghu> any thoughts?
—
Reply to this email directly, view it on GitHub
<#672 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXKQHA3WWYB256NJ5ATYDT6MTAVCNFSM6AAAAAA4EL76HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUGEZTANZTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it. |
I understand what you're trying to optimize for. What I'm suggesting is that it may not be a huge issue in practice. I'd rather not add this type of complexity until we actually have a customer that is being impacted. |
One thing to note is the behavior currently is equal to without DNS proxy.
So it's not like this is fixing a regression, purely a performance benefit
, and the intent of the proxy is not performance.
…On Fri, Nov 10, 2023, 9:42 AM Nathan Mittler ***@***.***> wrote:
@zhlsunshine <https://github.com/zhlsunshine>
Hi @nmittler <https://github.com/nmittler>, yes, I haven't. But multiple
requests with the same hostname given in concurrency and whose resolved IP
addresses are not in the DNS resolved cache, then this situation leads to
the same number of times of DNS resolving operations for the same hostname,
it's really waste, so this optimization should be valid even we do not test
the performance for it.
I understand what you're trying to optimize for. What I'm suggesting is
that it may not be a huge issue in practice. I'd rather not add this type
of complexity until we actually have a customer that is being impacted.
—
Reply to this email directly, view it on GitHub
<#672 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXM3PZ3N2OV4FSFUXPDYDZRQLAVCNFSM6AAAAAA4EL76HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGE2TONBQGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @howardjohn, besides the performance change, we also did some code refactor for the DNS resolver to make it more clear, does it make sense? |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
optimize the function
load_balance_for_hostname
to avoid starting multiple background on-demand DNS tasks.