Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce the lock contention in task spawn. #6001

Merged
merged 104 commits into from Dec 7, 2023

Conversation

wathenjiang
Copy link
Contributor

@wathenjiang wathenjiang commented Sep 12, 2023

We have to use a OwnedTasks to track the pointers of all Tasks, this is a multi-threaded scenario. When the Multi-threaded concurrent tokio::spawn is called, the lock contention would be very busy.

I propose a solution without losing any single-threaded spawn task performance, to improve the multi-threaded spawn task performance aboout 15% ~ 20%.

This solution is that: let tasks with different task ID to be put into the different LinkedList. The different LinkedList use own Mutex to make sure multi-threaded concurrency safety instead of a runtime unique Mutex.

The origin version:

spawn_tasks_current_thread
                        time:   [654.87 ns 664.26 ns 672.13 ns]
                        change: [-4.5814% -1.2511% +2.1604%] (p = 0.48 > 0.05)
                        No change in performance detected.

spawn_tasks_current_thread_parallel
                        time:   [537.55 ns 540.00 ns 542.20 ns]
                        change: [-1.0603% -0.2591% +0.6283%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

spawn_tasks             time:   [1.0447 µs 1.0533 µs 1.0610 µs]
                        change: [+0.9379% +2.2768% +3.7191%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild

spawn_tasks_parallel    time:   [992.38 ns 1.0002 µs 1.0073 µs]
                        change: [+0.1973% +1.4194% +2.5819%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

This version:

spawn_tasks_current_thread
                        time:   [705.92 ns 724.31 ns 739.78 ns]

spawn_tasks_current_thread_parallel
                        time:   [529.33 ns 531.00 ns 532.61 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

spawn_tasks             time:   [881.56 ns 892.21 ns 902.10 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

spawn_tasks_parallel    time:   [815.00 ns 819.87 ns 824.60 ns]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

The origin version:

spawn_tasks_current_thread
                        time:   [654.87 ns 664.26 ns 672.13 ns]
                        change: [-4.5814% -1.2511% +2.1604%] (p = 0.48 > 0.05)
                        No change in performance detected.

spawn_tasks_current_thread_parallel
                        time:   [537.55 ns 540.00 ns 542.20 ns]
                        change: [-1.0603% -0.2591% +0.6283%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

spawn_tasks             time:   [1.0447 µs 1.0533 µs 1.0610 µs]
                        change: [+0.9379% +2.2768% +3.7191%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild

spawn_tasks_parallel    time:   [992.38 ns 1.0002 µs 1.0073 µs]
                        change: [+0.1973% +1.4194% +2.5819%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

This version:

spawn_tasks_current_thread
                        time:   [705.92 ns 724.31 ns 739.78 ns]

spawn_tasks_current_thread_parallel
                        time:   [529.33 ns 531.00 ns 532.61 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

spawn_tasks             time:   [881.56 ns 892.21 ns 902.10 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

spawn_tasks_parallel    time:   [815.00 ns 819.87 ns 824.60 ns]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Sep 12, 2023
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Sep 12, 2023
@wathenjiang wathenjiang force-pushed the reduce-lock-contention branch 5 times, most recently from 12c8a11 to 445d79f Compare September 12, 2023 14:02
@carllerche
Copy link
Member

Thanks for doing this. I tried something similar but did not measure an improvement. What CPU are you running this on?

@hawkw hawkw self-requested a review September 12, 2023 16:30
@wathenjiang
Copy link
Contributor Author

wathenjiang commented Sep 13, 2023

Thanks for doing this. I tried something similar but did not measure an improvement. What CPU are you running this on?

The above benchmark test runs on the Linux, the CPU is Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz.

I also run the same benchmark test on other cpu architecture like ARM (mac m1), I found the silimar improvement too.

The lock contention problem would be heavier, when we use more threads to spwan tasks concurrently. Do you make sure the task::spawn is doing on diffrenet worker threads concurrently?

The reason why I propose to use criterion crate in #5979 is that we could do conrrent jobs more conveniently. With criterion crate, we can do something like below(from the spawn_concurrent.rs):

async fn job(iters: usize, procs: usize) {
    for _ in 0..procs {
        let mut threads_handles = Vec::with_capacity(procs);
        threads_handles.push(tokio::spawn(async move {
            let mut thread_handles = Vec::with_capacity(iters / procs);
            for _ in 0..iters / procs {
                thread_handles.push(tokio::spawn(async {
                    let val = 1 + 1;
                    tokio::task::yield_now().await;
                    black_box(val)
                }));
            }
            for handle in thread_handles {
                handle.await.unwrap();
            }
        }));
        for handle in threads_handles {
            handle.await.unwrap();
        }
    }
}

We can get iters now, which is unreachable in bencher crate. This means we can make sure the number of tokio::spawn is same, when the way tokio::spawn changes(single thread or multi-thread concurrency).

The falmegraph told the same improvement.

The origin version:
image

This version:

image

The tokio::runtime::task::list::OwnedTasks<S>::bind_inner reduces from 36.66% to 32.03%, and the tokio::runtime::task::list::OwnedTasks<S>::remove reduces from 33.64% to 28.64%.

@wathenjiang
Copy link
Contributor Author

I haven't paid attention to this PR because I have been too busy recently, sorry for that. I'd like to know how this PR is progressing so far. Any feedback on the design would be greatly appreciated.

As far as the current code review is concerned, we may have inconsistent views in the following areas:

  • Whether the number of mutexes in sharedList used by ownedTasks should be fixed
  • Whether spawn_concurrency_level should be supported for configuration]

Whether it should be such a complex design depends on tokio's use case, I provide some factual reference here.

I use the following code for benchmark testing:

use std::time::Instant;

use criterion::{measurement::WallTime, *};

fn bench_parallel_spawn_multi_thread(c: &mut Criterion) {
    let mut group = c.benchmark_group("spawn_parallel_multi_thread2");
    let ws: [usize; 8] = [1, 2, 4, 8, 16, 32, 64, 128];
    let ss: [usize; 12] = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048];

    for w in ws {
        for s in ss {
            spawn_tasks_parallel_multi_thread(&mut group, w, s);
        }
    }
}

fn spawn_tasks_parallel_multi_thread(g: &mut BenchmarkGroup<WallTime>, w: usize, s: usize) {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(w)
        .spawn_concurrency_level(black_box(s))
        .build()
        .unwrap();

    g.bench_function(format!("{:03}-{:04}", w, s), |b| {
        b.iter_custom(|iters| {
            let start = Instant::now();
            runtime.block_on(async {
                black_box(spawn_job(iters as usize, w).await);
            });
            start.elapsed()
        })
    });
}

async fn spawn_job(iters: usize, procs: usize) {
    for _ in 0..procs {
        let mut threads_handles = Vec::with_capacity(procs);
        threads_handles.push(tokio::spawn(async move {
            let mut thread_handles = Vec::with_capacity(iters / procs);
            for _ in 0..iters / procs {
                thread_handles.push(tokio::spawn(async {
                    let val = 1 + 1;
                    tokio::task::yield_now().await;
                    black_box(val)
                }));
            }
            for handle in thread_handles {
                handle.await.unwrap();
            }
        }));
        for handle in threads_handles {
            handle.await.unwrap();
        }
    }
}

criterion_group!(benches, bench_parallel_spawn_multi_thread,);
criterion_main!(benches);

The following test results were obtained on a Linux machine with a 16 cores x86 CPU:

mutexes/workers 1 2 4 8 16 32 64 128
1 803.57 687.35 986.31 1015.4 1030.8 1070.4 1005 1127.9
2 793.59 668.52 914.75 883.09 872.51 874.69 950.95 971.81
4 817.35 650.24 832.3 832.16 885.87 858.87 850.21 860.88
8 827.45 631.54 770.99 852.24 811.57 804.15 876.1 892.08
16 815.97 594.82 838.56 846.6 825.85 820.52 803.44 885.4
32 820.33 577.8 776.64 805.6 796.62 805.46 781.81 874.36
64 769.73 594.62 750.4 801.3 734.45 798.41 804.04 816.49
128 763.23 599.43 742.1 739.86 784 777.41 731.66 814.59
256 761.52 608.53 745.58 745.21 713.8 748.61 802.93 805.13
512 807.22 596.78 694.39 727.19 785.3 783.67 740.8 733.06
1024 773.73 621.63 749.25 781.1 723.71 775.19 797.04 811.37
2048 808.78 578.05 662.01 775.17 786.92 774.91 794.79 800.56

If the above test results are plotted into a graph, then it is as shown in the figure below:
image

We can see that:

  • No matter how many Mutex configurations there are, the optimal thing is to use 2 worker threads to perform tokio::spawn concurrently.
  • For all numbers of worker threads, there is a situation where the time taken by tokio::spawn first decreases as the number of Mutex increases, but eventually increases.
  • The blue line is the baseline, that is, the trend chart of the performance change of tokio::spawn caused by a single worker thread as the number of Mutex.
  • When the number of Mutex is 1, regardless of the number of worker threads, increasing the number of Mutex can always reduce the time consumption of tokio::sapwn.

Finally, it is difficult to estimate how many mutexes to provide in different worker threads is optimal for the time consumption of tokio::spawn. Perhaps a fixed-size mutex number is provided by default like 64 (currently at least 4 times the current worker threads, and is a minimum integer power of 2, which is relatively more complicated), and providing a configurable mutex number option for those extreme tokio::spawn highly concurrency is a good solution.

@Darksonn
Copy link
Contributor

Sorry for the delay in review. I've been visiting conferences last week and next week, so I'm not as actively watching the repository right now as I usually am.

I'd like to know how this PR is progressing so far. Any feedback on the design would be greatly appreciated.

Overall, I'm pretty happy with the approach. As you mention, whether we should have a config option is my only real question. I would lean towards no because it's simpler, but if it's unstable, I do not mind having it.

Other than that, I think there were only various minor things to fix last I looked.

@wathenjiang
Copy link
Contributor Author

wathenjiang commented Nov 21, 2023

I have considered for some time and believe that your idea is better choice. We will not allow users to configure the spawn_concurrency_level option directly, even in unstable mode.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes below, I'll be happy to merge this.

Comment on lines 1333 to 1340
fn get_spawn_concurrency_level(core_threads : usize) -> usize {
const MAX_SPAWN_CONCURRENCY_LEVEL: usize = 1 << 16;
let mut size = 1;
while size / 4 < core_threads && size < MAX_SPAWN_CONCURRENCY_LEVEL {
size <<= 1;
}
size.min(MAX_SPAWN_CONCURRENCY_LEVEL)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are removing it as an option from the builder, then all changes related to spawn_concurrency_level in this file and other files should be removed. Instead, I would just move this into runtime/task/list.rs as a const.

tokio/src/runtime/task/list.rs Outdated Show resolved Hide resolved
tokio/src/runtime/task/list.rs Outdated Show resolved Hide resolved
tokio/src/runtime/task/list.rs Outdated Show resolved Hide resolved
tokio/src/runtime/task/mod.rs Outdated Show resolved Hide resolved
tokio/src/util/sharded_list.rs Outdated Show resolved Hide resolved
Comment on lines 355 to 299
pub(crate) fn for_each<F>(&mut self, mut f: F)
pub(crate) fn for_each<F>(&mut self, f: &mut F)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change and continue to use mut f: F. The code in sharded_list.rs will still work because mutable references are also FnMut.

tokio/src/runtime/task/list.rs Outdated Show resolved Hide resolved
tokio/src/runtime/task/list.rs Outdated Show resolved Hide resolved
tokio/src/runtime/task/list.rs Outdated Show resolved Hide resolved
/// Due to the above reasons, we set a maximum value for the shared list size,
/// denoted as `MAX_SHARED_LIST_SIZE`.
fn gen_shared_list_size(num_cores: usize) -> usize {
const MAX_SHARED_LIST_SIZE: usize = 1 << 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maximum seems really large to me. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used the settings of the previous spawn_concurrency_level function. The previous function could support user configuration, so this value was set to a large value.

If you want a smaller value, I'm ok with that.

As far as I know, some current server CPUs can reach close to 200 CPU cores (or hyperthreadings), such as the AMD EPYC 9654 which has 192 threads. Maybe we can set the value to 256(4 times this value is 1<<10)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to retract my previous point. I believe this should be a memory-bound scenario, so we only need to ensure a certain number of mutexes. In fact, the number of mutexes should not depend on the number of worker threads, but actually depends on the possible level of concurrency.

Below, I describe the concurrency level that may actually occur, rather than what occurs in the benchmark.

In real applications that use Tokio, multiple threads may perform tokio::spawn concurrently. However, having too many threads concurrently performing tokio::spawn at the same time can be considered a poor design choice in the higher-level application architecture. It is more likely that multiple threads will perform removing tasks concurrently at the same time. Nonetheless, removing a task from the list only takes up a minimal amount of runtime during the entire task life cycle, so the concurrency level of removing tasks is usually not a significant concern.

Therefore, I agree with your idea, and it is reasonable to set the upper limit to a small value. Setting this to 64 or 128 might make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. What do you think about multiplying by 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total number of random memory operations is consistent, providing multiple locks. Each thread obtains one of the locks through round robin, and can only perform random memory access after obtaining the lock. I got the following test results:

mutexes/threads 1 2 4 8 12 16 24 32 48 64 128
0 5.980172444 2.899437975 1.447906311 0.828731566 0.689618066 0.612355429 0.589401394 0.587380871 0.525477567 0.578456362 0.552132325
1 7.970250774 17.29034894 22.60164692 25.97284605 28.12352579 33.31359697 31.18786342 31.61139126 29.23225856 30.94094675 31.59191497
2 7.883931727 15.97845738 16.11107368 18.73377898 20.34614133 23.02624802 22.69439808 23.15802647 21.80570219 22.48815498 22.98585238
4 7.975676415 10.25364766 11.88074538 15.40198137 15.51024255 16.35328034 15.46874828 15.7982897 15.48703267 15.67227903 15.35829948
8 8.058803258 8.138193999 7.619081588 7.936418179 7.654288652 7.901945312 7.642439744 7.861542054 7.730389506 7.821229611 7.748344488
16 9.797308994 6.213334839 4.455407945 4.496371955 4.291254249 4.130849346 4.347601475 4.294096757 3.990391527 4.028562691 4.059085994
32 8.742854719 4.847656612 3.301780829 2.578327826 2.480488617 2.331294827 2.388718271 2.306257478 2.421350161 2.278177495 2.26569423
64 8.042672888 4.963568223 3.012473492 2.08243512 1.828237002 1.653421053 1.550811454 1.536452054 1.519761769 1.618966043 1.48010674
128 8.62801309 4.978525185 2.637936755 1.777546296 1.549096849 1.359814529 1.43875245 1.385468038 1.238832309 1.249940559 1.248131329
256 8.584906215 4.591742459 2.441556366 1.504790937 1.335449235 1.169191715 1.115906268 1.230570609 1.075581823 1.048285585 1.02977064
512 8.171549127 4.182283461 2.37535305 1.54202412 1.1690348 1.054650104 1.015366906 1.153238581 0.993319168 0.998864737 0.981392837
1024 8.533398132 4.175120792 2.209645233 1.412410651 1.055442085 0.938202817 1.122801927 0.940661156 0.888767412 0.914867532 0.92237305

This is mainly because when the number of worker threads is relatively small, setting the number of locks to 4 times has a significant performance improvement compared to 2 times.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For complete testing, please refer to https://gist.github.com/wathenjiang/30b689a7ef20b4ea667a2e8f358c321d

I think this performance test can provide a reference for how many mutexes are needed for OwnedTasks.

wathenjiang and others added 2 commits November 26, 2023 09:18
Co-authored-by: Alice Ryhl <aliceryhl@google.com>
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

tokio/src/util/sharded_list.rs Outdated Show resolved Hide resolved
@Darksonn Darksonn enabled auto-merge (squash) December 7, 2023 10:32
@Darksonn Darksonn merged commit 3a4aef1 into tokio-rs:master Dec 7, 2023
78 checks passed
kodiakhq bot pushed a commit to pdylanross/fatigue that referenced this pull request Dec 11, 2023
Bumps tokio from 1.34.0 to 1.35.0.

Release notes
Sourced from tokio's releases.

Tokio v1.35.0
1.35.0 (December 8th, 2023)
Added

net: add Apple watchOS support (#6176)

Changed

io: drop the Sized requirements from AsyncReadExt.read_buf (#6169)
runtime: make Runtime unwind safe (#6189)
runtime: reduce the lock contention in task spawn (#6001)
tokio: update nix dependency to 0.27.1 (#6190)

Fixed

chore: make --cfg docsrs work without net feature (#6166)
chore: use relaxed load for unsync_load on miri (#6179)
runtime: handle missing context on wake (#6148)
taskdump: fix taskdump cargo config example (#6150)
taskdump: skip notified tasks during taskdumps (#6194)
tracing: avoid creating resource spans with current parent, use a None parent instead (#6107)
tracing: make task span explicit root (#6158)

Documented

io: flush in AsyncWriteExt examples (#6149)
runtime: document fairness guarantees and current behavior (#6145)
task: document cancel safety of LocalSet::run_until (#6147)

#6001: tokio-rs/tokio#6001
#6107: tokio-rs/tokio#6107
#6144: tokio-rs/tokio#6144
#6145: tokio-rs/tokio#6145
#6147: tokio-rs/tokio#6147
#6148: tokio-rs/tokio#6148
#6149: tokio-rs/tokio#6149
#6150: tokio-rs/tokio#6150
#6158: tokio-rs/tokio#6158
#6166: tokio-rs/tokio#6166
#6169: tokio-rs/tokio#6169
#6176: tokio-rs/tokio#6176
#6179: tokio-rs/tokio#6179
#6189: tokio-rs/tokio#6189
#6190: tokio-rs/tokio#6190
#6194: tokio-rs/tokio#6194



Commits

92a3455 chore: prepare Tokio v1.35.0 (#6197)
1968565 chore: use relaxed load for unsync_load (#6203)
c9273f1 sync: improve safety comments for WakeList (#6200)
e05d0f8 changelog: fix missing link for 1.8.2 (#6199)
debcb22 Revert "net: add SocketAddr::as_abstract_namespace (#6144)" (#6198)
83b7397 io: drop the Sized requirements from AsyncReadExt.read_buf (#6169)
3991f9f docs: fix typo in 'tokio/src/sync/broadcast.rs' (#6182)
48c0e62 chore: use relaxed load for unsync_load on miri (#6179)
d561b58 taskdump: skip notified tasks during taskdumps (#6194)
3a4aef1 runtime: reduce the lock contention in task spawn (#6001)
Additional commits viewable in compare view




Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants