Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize the function load_balance_for_hostname to avoid starting multiple DNS tasks #672

Closed
wants to merge 40 commits into from

Conversation

zhlsunshine
Copy link
Contributor

optimize the function load_balance_for_hostname to avoid starting multiple background on-demand DNS tasks.

… more than one background on-demand DNS task
@zhlsunshine zhlsunshine requested a review from a team as a code owner August 30, 2023 11:38
@istio-policy-bot
Copy link

😊 Welcome @zhlsunshine! This is either your first contribution to the Istio ztunnel repo, or it's been
awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 30, 2023
src/state.rs Outdated Show resolved Hide resolved
@zhlsunshine zhlsunshine changed the title optimize the function load_balance_for_hostname to avoid starting multiple DNS task optimize the function load_balance_for_hostname to avoid starting multiple DNS tasks Aug 31, 2023
@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 1, 2023
src/state.rs Outdated Show resolved Hide resolved
@zhlsunshine
Copy link
Contributor Author

Hi @howardjohn & @hzxuzhonghu, is there any comment for this PR? thanks!

@hzxuzhonghu
Copy link
Member

Will take a look

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I donot fully understand the idea.

I think we just need CachedResolvedDNSMap which stores the hash map of hostname to SingleFlight

The single flight only need to store the Notify

src/state.rs Outdated Show resolved Hide resolved
src/state.rs Outdated Show resolved Hide resolved
src/state.rs Outdated Show resolved Hide resolved
@zhlsunshine
Copy link
Contributor Author

I donot fully understand the idea.

I think we just need CachedResolvedDNSMap which stores the hash map of hostname to SingleFlight

The single flight only need to store the Notify

Hi @hzxuzhonghu, I think I can give more explains here for these 2 AtomicBools:
Although state is Arc<RwLock>, all the reading and writing operations themselves for ProxyState are atomic just in the code block, such as read CachedResolvedDNSMap via get_cached_resolve_dns_for_hostname and write CachedResolvedDNSMap via set_cached_resolve_dns_for_hostname, so we need these 2 AtomicBools to make sure that only the first arrive request to write the hashmap, and others need to wait for the first completed.

@zhlsunshine
Copy link
Contributor Author

Hi @howardjohn & @hzxuzhonghu & @nmittler, is there anything that I can do to make it approved? Thanks!

src/state.rs Outdated Show resolved Hide resolved
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we can have a test for it

@zhlsunshine
Copy link
Contributor Author

In addition, we can have a test for it

Hi @hzxuzhonghu, sure, I added a concurrency requests test named build_concurrency_requests_unknown_dest for this change!

@zhlsunshine
Copy link
Contributor Author

Hi @howardjohn, @hzxuzhonghu and @nmittler, there is a change for this PR based on @nmittler 's suggestion, creating a new struct named DnsResolver to holds all DNS resolving relates to simplify the ProxyState, and DnsResolver is part feature of DemandProxyState. Could you please help to review it again? Thanks in advance!

src/state.rs Outdated Show resolved Hide resolved
src/state.rs Outdated Show resolved Hide resolved
src/state.rs Outdated Show resolved Hide resolved
src/state.rs Outdated
@@ -333,20 +570,17 @@ impl DemandProxyState {
let workload_uid = workload.uid.to_owned();
let hostname = workload.hostname.to_owned();
metrics.as_ref().on_demand_dns.get_or_create(&labels).inc();
let rdns = match state.get_ips_for_hostname(&workload.hostname) {
let rdns = match state.dns_resolver.get_ips_for_hostname(&hostname) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this entire method (load_balance_for_hostname) just be moved to DnsResolver? All of this should be internal details of it, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nmittler, as you know DnsResolver is part of DemandProxyState, and it needs to modify itself on its 2 HashMaps, so it's mutable reference. If moving load_balance_for_hostname into DnsResolver, it would impact that the whole DemandProxyState need to be a mutable reference as well. So hope it can be as it is now. Does it make sense?

Copy link
Contributor

@nmittler nmittler Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually concerned when I saw that DemandProxyState.resolve_on_demand_dns and DemandProxyState.load_balance_for_hostname both working around &self to get &mut self. Not sure when that was introduced, but it seems a bit hacky and should probably be fixed if possible. Perhaps the issue is that we have some methods that needlessly require &mut self(e.g. if you lock an object that you write to, you don't need mut)?

@howardjohn I may be off here, since my rust is a little rusty at this point :) ... thoughts?

@zhlsunshine
Copy link
Contributor Author

Hi @nmittler, as we discussed, it becomes more simplicity.

src/state.rs Outdated
// by_hostname is a map from hostname to resolved IP addresses for now.
//
// in a future with support for per-pod DNS resolv.conf settings we may need
// to change this to a map from source workload uid to resolved IP addresses.
#[serde(skip_serializing)]
by_hostname: HashMap<String, ResolvedDns>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call this field resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it, how about resolved_hostname?

src/state.rs Outdated
) -> Option<ResolvedDns> {
// optimize so that if multiple requests to the same hostname come in at the same time,
// we don't start more than one background on-demand DNS task
let is_the_first_req = self.get_singleflight_by_hostname(hostname.to_owned());
Copy link
Contributor

@nmittler nmittler Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right. This code should be something like this:

async fn resolve_host(
        &self,
        hostname: &String,
        workload: &Workload,
    ) -> Option<ResolvedDns> {
    // Look up the host in the local cache. If it's there,
    // just return it now.
    if let Some(result) = self.get_ips_for_hostname(hostname) {
        return Some(result)
    }

    // Perform the resolution for the host and block until it
    // completes. Take care to only allow a single request to be
    // in-flight at a time. Other threads will just wait for the first
    // request to complete.
    match self.get_or_create_notify(hostname) {
        (notify, true) => {
            // First request, perform the resolution and notify
            // all other threads when resolution is complete.
            self.resolve_on_demand_dns(workload).await;
            notify.notify_waiters();

            // Now that we've notified all waiters, we can safely
            // remove the notify for this host. Any future requests
            // Should just get it from the local cache.
            self.remove_notify(hostname)
        }
        (notify, false) => {
            // Already in progress, just wait for the result.
            notify.notified().await;
        }
    }

    // At this point, the local cache has been populated.
    // Just return the cached contents.
    return self.get_ips_for_hostname(hostname)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nmittler, there are 2 mutable borrows occurs in both self.get_or_create_notify (first mutable borrow) and self.resolve_on_demand_dns (second mutable borrow), so I have to use this let element = self.get_cached_resolve_dns_for_hostname(hostname); to avoid this problem by getting the element from HashMap via read lock.
However, I can change get_singleflight_by_hostname into get_or_create_notify. Does it make sense?

nmittler added a commit to nmittler/ztunnel that referenced this pull request Nov 3, 2023
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
@nmittler
Copy link
Contributor

nmittler commented Nov 3, 2023

@zhlsunshine I created #722, which re-works this a bit based on some things we discussed. PTAL.

nmittler added a commit to nmittler/ztunnel that referenced this pull request Nov 4, 2023
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
nmittler added a commit to nmittler/ztunnel that referenced this pull request Nov 4, 2023
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
nmittler added a commit to nmittler/ztunnel that referenced this pull request Nov 4, 2023
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
nmittler added a commit to nmittler/ztunnel that referenced this pull request Nov 4, 2023
This modifies the approach described in istio#672. Attempting to clarify some things we had discussed offline.
@nmittler
Copy link
Contributor

nmittler commented Nov 6, 2023

@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here.

Co-authored-by: nmittler <nmittler@gmail.com>
@zhlsunshine
Copy link
Contributor Author

@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here.

Hi @nmittler, frankly, I'm focusing on some validations of ambient mesh in CSP environment in recent, what I'm sure about is that the original code has the problem: there may be multiple DNS resolving for the same hostname in concurrent requests.
BTW, I have no more comment for your rework PR#722. And I also insists on that you should be a co-author of this PR thanks to your help and comments. ^_^

@nmittler
Copy link
Contributor

nmittler commented Nov 9, 2023

@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.

@howardjohn @hzxuzhonghu any thoughts?

@howardjohn
Copy link
Member

howardjohn commented Nov 10, 2023 via email

@zhlsunshine
Copy link
Contributor Author

@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.

@howardjohn @hzxuzhonghu any thoughts?

Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.

@nmittler
Copy link
Contributor

@zhlsunshine

Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.

I understand what you're trying to optimize for. What I'm suggesting is that it may not be a huge issue in practice. I'd rather not add this type of complexity until we actually have a customer that is being impacted.

@howardjohn
Copy link
Member

howardjohn commented Nov 10, 2023 via email

@zhlsunshine
Copy link
Contributor Author

Hi @howardjohn, besides the performance change, we also did some code refactor for the DNS resolver to make it more clear, does it make sense?

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 19, 2023
@istio-testing
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@howardjohn
Copy link
Member

#1098

@howardjohn howardjohn closed this May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR needs to be rebased before being merged size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants