Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] remove mutex in batch span processor. #533

Closed
wants to merge 9 commits into from

Conversation

TommyCpp
Copy link
Contributor

@TommyCpp TommyCpp commented Apr 24, 2021

Instead of using mutexes, we just clone the sender.

The performance improves

time:   [1.5888 ms 1.5997 ms 1.6113 ms]
change: [-44.416% -43.990% -43.497%] (p = 0.00 < 0.05)

related #520

@TommyCpp TommyCpp requested a review from a team as a code owner April 24, 2021 02:22
@TommyCpp TommyCpp changed the title feat: remove mutex in batch span processor. [Discussion] remove mutex in batch span processor. Apr 24, 2021
@codecov
Copy link

codecov bot commented Apr 24, 2021

Codecov Report

Merging #533 (fd310c7) into main (88e779d) will increase coverage by 0.0%.
The diff coverage is 74.0%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #533   +/-   ##
=====================================
  Coverage   52.2%   52.3%           
=====================================
  Files         96      96           
  Lines       8506    8514    +8     
=====================================
+ Hits        4447    4455    +8     
  Misses      4059    4059           
Impacted Files Coverage Δ
opentelemetry/src/sdk/trace/span_processor.rs 80.3% <74.0%> (-1.2%) ⬇️
opentelemetry/src/global/trace.rs 15.6% <0.0%> (+0.2%) ⬆️
opentelemetry/src/trace/mod.rs 50.0% <0.0%> (+6.0%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88e779d...fd310c7. Read the comment docs.

@jtescher
Copy link
Member

@TommyCpp you mentioned in #520 that cloning would increase capacity, are span limits still being enforced with this change?

@TommyCpp
Copy link
Contributor Author

TommyCpp commented Apr 24, 2021

are span limits still being enforced with this change

I thought so. But upon closer examination, I found that the future bound channel guaranteed to allow every sender to send at least once. Thus, the try_send will always succeed for a newly created Sender. That means we will not get errors when the channel is full.

Also based on the discussion here rust-lang/futures-rs#403 (comment). The capacity of the bound channel is buffer size + number of senders there have ever been which will lead to OOM in our use cases. Although I found this to be surprising.

Overall I don't think we can just clone Sender here 😞

@jtescher
Copy link
Member

Hm what are other good options here? new thread like the current simple processor / crossbeam sync channel that loops sending to the future mpsc channel or drops if that channel is full?

@TommyCpp
Copy link
Contributor Author

TommyCpp commented Apr 25, 2021

new thread like the current simple processor / crossbeam sync channel that loops sending to the future mpsc channel or drops if that channel is full?

Can try that. Crossbeam channel's try_send gives us the desired behavior and it takes a &self. We could spawn a new thread to "relay" the message from the main task/thread to the export task/thread. The max_queue_size can be enforced in the crossbeam channel. I'm not sure if that's going to be a performance improvement, but I can bench it when I finish a POC.


Did a POC and the results look good. Around 20%-45% performance improvement

 
BatchSpanProcessor/with 1 concurrent task                        
                        time:   [570.84 us 583.43 us 596.90 us]
                        change: [-29.334% -24.649% -18.352%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 50 measurements (10.00%)
  2 (4.00%) high mild
  3 (6.00%) high severe

BatchSpanProcessor/with 2 concurrent task                        
                        time:   [693.16 us 705.30 us 719.07 us]
                        change: [-26.410% -22.148% -16.501%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 50 measurements (16.00%)
  4 (8.00%) high mild
  4 (8.00%) high severe

BatchSpanProcessor/with 4 concurrent task                        
                        time:   [942.06 us 955.24 us 968.36 us]
                        change: [-33.543% -31.358% -28.973%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 50 measurements (8.00%)
  2 (4.00%) high mild
  2 (4.00%) high severe

BatchSpanProcessor/with 8 concurrent task                        
                        time:   [1.4606 ms 1.4733 ms 1.4870 ms]
                        change: [-44.982% -43.717% -42.304%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 50 measurements (6.00%)
  1 (2.00%) low mild
  2 (4.00%) high severe

BatchSpanProcessor/with 16 concurrent task                        
                        time:   [2.4540 ms 2.4817 ms 2.5095 ms]
                        change: [-48.691% -46.453% -44.102%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  1 (2.00%) low mild
  6 (12.00%) high severe

BatchSpanProcessor/with 32 concurrent task                        
                        time:   [4.6234 ms 4.6722 ms 4.7267 ms]
                        change: [-46.452% -45.520% -44.607%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
  1 (2.00%) low mild
  5 (10.00%) high mild

@TommyCpp TommyCpp marked this pull request as draft April 25, 2021 17:47
@TommyCpp
Copy link
Contributor Author

I guess another choice we have is to use an unbound channel and maintain an internal state ourselves to count the element within the channel. We can use lock-free operation on the state to avoid the mutex.

@jtescher
Copy link
Member

unbound channel and maintain an internal state ourselves

What about an unbounded crossbeam channel, keeping the inner bounded channel for the limits?

@TommyCpp
Copy link
Contributor Author

unbound channel and maintain an internal state ourselves

What about an unbounded crossbeam channel, keeping the inner bounded channel for the limits?

I don't think we need a dedicated thread or crossbeam channel in this case. Suppose we defined the channel like

struct Sender {
    unbounded_sender: UnbounedSender<Msg>,
    state: Arc<State>,
}

struct Receiver {
   unbounded_receiver: UnboundedReceiver<Msg>,
   state: Arc<State>
}

struct State {
   elements_num: AtomicNumber,
   ...
}

We can then define a try_send method with &self rather than &mut self. We can enforce the span limit with State.

@TommyCpp
Copy link
Contributor Author

TommyCpp commented Apr 29, 2021

I found the async_channel crate seems to be a good choice when I research the implementation of the channel. Its sender has a try_send method that takes &self. I did a POC with it and most results look good except it performed worse if there wasn't any concurrency.

I think this is an easy fix with some performance improvement and it keeps the batch span processor simple. In the future, we may wrap our own channel to further improve the performance.

@jtescher let me know what do you think 😬

BatchSpanProcessor/with 1 concurrent task                        
                        time:   [936.25 us 969.14 us 1.0046 ms]
                        change: [+9.6398% **+21.109%** +31.546%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 50 measurements (2.00%)
  1 (2.00%) high severe

BatchSpanProcessor/with 2 concurrent task                        
                        time:   [744.50 us 760.84 us 777.97 us]
                        change: [-39.810% -33.943% -28.101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 50 measurements (4.00%)
  2 (4.00%) high severe

BatchSpanProcessor/with 4 concurrent task                        
                        time:   [854.81 us 863.26 us 872.00 us]
                        change: [-46.646% -44.508% -41.412%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 50 measurements (4.00%)
  1 (2.00%) high mild
  1 (2.00%) high severe

BatchSpanProcessor/with 8 concurrent task                        
                        time:   [1.4787 ms 1.4834 ms 1.4889 ms]
                        change: [-50.795% -48.749% -46.637%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 50 measurements (22.00%)
  1 (2.00%) low severe
  4 (8.00%) low mild
  6 (12.00%) high severe

BatchSpanProcessor/with 16 concurrent task                        
                        time:   [2.5739 ms 2.5812 ms 2.5890 ms]
                        change: [-51.467% -49.728% -48.071%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
  2 (4.00%) low mild
  1 (2.00%) high mild
  3 (6.00%) high severe

BatchSpanProcessor/with 32 concurrent task                        
                        time:   [4.4602 ms 4.5027 ms 4.5489 ms]
                        change: [-49.916% -48.948% -47.841%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 50 measurements (8.00%)
  2 (4.00%) high mild
  2 (4.00%) high severe

@jtescher
Copy link
Member

@TommyCpp how do the existing span start/end overhead benchmarks compare? E.g. if you change the group function to

fn trace_benchmark_group<F: Fn(&sdktrace::Tracer)>(c: &mut Criterion, name: &str, f: F) {
    let rt = tokio::runtime::Runtime::new().unwrap();
    rt.block_on(async {
        let mut group = c.benchmark_group(name);

        group.bench_function("always-sample", |b| {
            let provider = sdktrace::TracerProvider::builder()
                .with_config(sdktrace::config().with_sampler(sdktrace::Sampler::AlwaysOn))
                .with_default_batch_exporter(VoidExporter, opentelemetry::runtime::Tokio)
                .build();
            let always_sample = provider.get_tracer("always-sample", None);

            b.iter(|| f(&always_sample));
        });

        group.bench_function("never-sample", |b| {
            let provider = sdktrace::TracerProvider::builder()
                .with_config(sdktrace::config().with_sampler(sdktrace::Sampler::AlwaysOff))
                .with_default_batch_exporter(VoidExporter, opentelemetry::runtime::Tokio)
                .build();
            let never_sample = provider.get_tracer("never-sample", None);
            b.iter(|| f(&never_sample));
        });

        group.finish();
    });
}

With the mutex, the batch processor seems to have about twice the overhead on the application process, wondering how that changes with the various approaches you've tried.

@TommyCpp
Copy link
Contributor Author

TommyCpp commented Apr 30, 2021

I ran the test and both of the approaches have significant improvement compared with Mutex version but there isn't much of the difference between those two.

  1. Implementation with extra thread to relay messages:
start-end-span/always-sample                                                                             
                        time:   [730.20 ns 812.46 ns 916.13 ns]
                        change: [+1.8579% +10.296% +19.213%] (p = 0.02 < 0.05)
                        Performance has regressed.
start-end-span/never-sample                                                                            
                        time:   [183.98 ns 189.81 ns 195.69 ns]
                        change: [-2.7054% -1.0013% +0.6458%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

start-end-span-4-attrs/always-sample                        
                        time:   [976.92 ns 985.06 ns 992.30 ns]
                        change: [-46.251% -40.904% -31.121%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

start-end-span-4-attrs/never-sample                        
                        time:   [215.68 ns 216.29 ns 217.01 ns]
                        change: [+10.460% +10.864% +11.302%] (p = 0.00 < 0.05)
                        Performance has regressed.


start-end-span-8-attrs/always-sample                        
                        time:   [1.7696 us 1.7853 us 1.8011 us]
                        change: [-81.733% -81.255% -80.537%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high severe

start-end-span-8-attrs/never-sample                        
                        time:   [252.54 ns 253.12 ns 253.80 ns]
                        change: [-73.111% -73.035% -72.963%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild


start-end-span-all-attr-types/always-sample                        
                        time:   [1.3350 us 1.3539 us 1.3776 us]
                        change: [-84.195% -83.846% -83.464%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

start-end-span-all-attr-types/never-sample                        
                        time:   [229.42 ns 230.13 ns 231.05 ns]
                        change: [-73.581% -73.501% -73.418%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

start-end-span-all-attr-types-2x/always-sample                        
                        time:   [2.9964 us 3.0272 us 3.0641 us]
                        change: [-79.493% -79.075% -78.634%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high severe

start-end-span-all-attr-types-2x/never-sample                        
                        time:   [251.54 ns 252.57 ns 253.93 ns]
                        change: [-69.990% -67.314% -64.104%] (p = 0.00 < 0.05)
                        Performance has improved.

  1. Implementation with async_channel:
start-end-span/always-sample                                                                            
                        time:   [452.25 ns 454.40 ns 456.23 ns]
                        change: [-58.468% -57.921% -57.371%] (p = 0.00 < 0.05)
                        Performance has improved.
start-end-span/never-sample                                                                            
                        time:   [163.16 ns 163.39 ns 163.65 ns]
                        change: [-12.297% -12.021% -11.743%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild


start-end-span-4-attrs/always-sample                        
                        time:   [1.0922 us 1.1052 us 1.1177 us]
                        change: [-40.728% -39.715% -38.670%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

start-end-span-4-attrs/never-sample                        
                        time:   [187.76 ns 188.26 ns 188.82 ns]
                        change: [-5.6437% -4.9963% -4.4181%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe


start-end-span-8-attrs/always-sample                        
                        time:   [1.8076 us 1.8232 us 1.8412 us]
                        change: [-80.949% -80.628% -80.343%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

start-end-span-8-attrs/never-sample                        
                        time:   [216.77 ns 217.01 ns 217.22 ns]
                        change: [-77.328% -77.249% -77.182%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  8 (8.00%) low mild


start-end-span-all-attr-types/always-sample                        
                        time:   [1.4915 us 1.5059 us 1.5194 us]
                        change: [-81.972% -81.674% -81.377%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild

start-end-span-all-attr-types/never-sample                        
                        time:   [236.72 ns 238.55 ns 240.54 ns]
                        change: [-73.182% -73.008% -72.834%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild


start-end-span-all-attr-types-2x/always-sample                        
                        time:   [2.8460 us 2.8737 us 2.9035 us]
                        change: [-80.096% -79.716% -79.327%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  8 (8.00%) low severe
  5 (5.00%) low mild
  6 (6.00%) high severe

start-end-span-all-attr-types-2x/never-sample                        
                        time:   [272.03 ns 272.42 ns 272.87 ns]
                        change: [-68.585% -65.774% -62.426%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild

@jtescher
Copy link
Member

jtescher commented May 2, 2021

Span building and reporting are likely the largest sources of performance overhead that otel will introduce when tracing, so finding the right solution here is fairly important. @djc / @frigus02 / @awiede any ideas on how to improve these numbers further? This is probably one of the last areas before tracing can stabilize at 1.0.

@djc
Copy link
Contributor

djc commented May 2, 2021

Hmm, I think I'd need to have a bit more context/pointers into the source code to be able to make any suggestions.

Off the top of my head, maybe using something like smartstring for stack-allocated short strings could help with span building performance? I've gotten some decent wins out of it.

@TommyCpp
Copy link
Contributor Author

TommyCpp commented May 2, 2021

I did two POC on different implementations. The first one with an extra thread to relay the message can be found in this PR(or here). The other one using async_channel can be found here.

@jtescher
Copy link
Member

jtescher commented May 3, 2021

@djc fundamentally the question is the best way to send span data to the batch span processor queue when a span ends. Currently the batch processor handle wraps a futures::channel::mpsc::Sender, but sending requires &mut self which is currently handled by a mutex but is sub-optimal. @TommyCpp has examples of switching to crossbeam_channel::Sender (which would require another thread, similar to the simple processor) and async_channel::Sender, both of which take &self instead and have different performance characteristics.

Along with the linked alternatives, we could consider options like using tokio/async-std channels directly, possibly by extending the otel runtimes for them to add a try_send method.

@frigus02
Copy link
Member

frigus02 commented May 3, 2021

This is a great investigation @TommyCpp. Sadly, I don't have any ideas on how to make span reporting even more performant. I haven't looked at span building, yet. I try to find time in the next days.

crossbeam_channel and async_channel both seem to be good choices here. I wonder which one has a smaller affect of application performance: batch export in a separate thread or task. Although I imagine this might actually depend on the application.

I assume either of those is good enough for a first version. We might want to document the performance overhead of the API/SDK somewhere. Not necessarily with CPU/memory numbers but something like "batch span processor creates one thread and keeps at most X spans in memory"?

I also just read through performance.md. It suggests letting the user choose between prevent information loss and prevent blocking. I think in theory we could let the user choose between send and try_send. Although that does sound like something we could add later on in a non-breaking way.

@djc
Copy link
Contributor

djc commented May 3, 2021

For an mpsc queue, wrapping a Mutex around a sender doesn't seem to make much sense, we should always clone the sender instead. Was there a downside to that in terms of managing the capacity of the channel? What is calling BatchSpanProcessor::on_end()? Who/what triggers flushing spans to the exporter?

@TommyCpp
Copy link
Contributor Author

TommyCpp commented May 3, 2021

Was there a downside to that in terms of managing the capacity of the channel?

According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.

What is calling BatchSpanProcessor::on_end()

When a Span drops. The TracerProvider will call this function and pass the span as the parameter. Note that the spec requires this function to be thread safe as the spans can ended in multiple thread at the same time.

@djc
Copy link
Contributor

djc commented May 3, 2021

According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.

In practice, how much of an issue does this appear to be? How many slots do we expect to have in the channel, and what is the range in number of senders we expect to be active? It looks like the tokio mpsc implementation might not have this issue.

@TommyCpp
Copy link
Contributor Author

TommyCpp commented May 3, 2021

According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.

In practice, how much of an issue does this appear to be? How many slots do we expect to have in the channel, and what is the range in number of senders we expect to be active? It looks like the tokio mpsc implementation might not have this issue.

I believe when the sender drops, its message doesn't get cleaned up(See rust-lang/futures-rs#2381 (comment)). Thus, there is a high risk of OOM here.

@jtescher
Copy link
Member

jtescher commented May 4, 2021

@djc the spec suggests fairly strongly not to have components with unbounded memory consumption, but more specifically here we need to be able to enforce the batch processor's max queue size configuration. @TommyCpp had explored using unbounded queues + an atomic to maintain the limit.

@TommyCpp
Copy link
Contributor Author

TommyCpp commented May 7, 2021

Looks like there is a new version of Rust. Will fix those lint problems tonight.

@TommyCpp
Copy link
Contributor Author

Should we make a decision on this one? @open-telemetry/rust-approvers

@jtescher
Copy link
Member

Might be nice to experiment with a runtime specific option to see how that performs in terms of overhead on the traced application. Other than that whichever currently looks the most performant would get my vote.

@jtescher
Copy link
Member

@TommyCpp also moving the span processing to new thread that owns the mpsc sender should be as performant as the simple processor is currently (would be basically the same impl, just sending to async runtime instead of executor::block_on), so ideally we could hit ~375ns or less.

@TommyCpp
Copy link
Contributor Author

also moving the span processing to new thread that owns the mpsc sender should be as performant

That's basically is using a thread to relay message right?. It has a good performance. But it will be hard to drop the spans at an appropriate time.

@jtescher
Copy link
Member

@TommyCpp was thinking two channels, first unbounded crossbeam for a quick way to get span data to the other thread, then other thread is basically just receiving from the channel like the simple processor does, and then pushes to owned mpsc channel or drops if that is full. Basically replace the mutex with a thread and queue, keep the rest the way it is.

@TommyCpp
Copy link
Contributor Author

@jtescher I believe this PR's implementation is exactly using this method once I replace the channel in BSP from bounded crossbeam channel to unbounded crossbeam channel.

Copy link
Member

@jtescher jtescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TommyCpp
Copy link
Contributor Author

TommyCpp commented May 18, 2021

Did another round of benchmarking

Benchmark Baseline(Current implementation) mpmc channel new thread Tokio channel
BSP/1 task 1.2ms 1.4ms ( +14%) 664us (-47%) 759us(-8%)
BSP/2 tasks 1.3ms 930 us (-31%) 865us(-38%) 668.27us(-45%)
BSP/4 tasks 2.0ms 1.07ms(-45%) 1.16ms(-41%) 839.55us(-47%)
BSP/8 tasks 3.5ms 1.69ms(-52%) 1.7ms(-51%) 1.3048ms(-54%)
BSP/16 tasks 6.3ms 2.68ms(-57%) 3.1ms(-50%) 2.5242ms(-61%)
BSP/32 tasks 10.6ms 4.7ms(-55%) 5.08ms(-49%) 5.15ms(-52%)
start-end-span/always-sample 941.14ns 429ns (-54%) 436ns(-50%) 454.97ns(-51%)
start-end-span/never-sample 170.07ns 166.78ns(-3%) 150ns(-11%) 149.71ns(-11.987%)
start-end-span-4-attrs/always-sample 1.8034us 1.08us(-41%) 1.23us(-33%) 1.18us(-36%)
start-end-span-4-attrs/never-sample 187.49ns 179.95ns(-4%) 186.63ns(+0.89%) 169.64ns(-9%)
start-end-span-8-attrs/always-sample 2.38us 1.7092us(-28%) 1.9882us(-16%) 1.7026us(-28%)
start-end-span-8-attrs/never-sample 214ns 231.45ns(+5%) 223.82ns(+3.9%) 211.24ns(-1.9%)
start-end-span-all-attr-types/always-sample 2.0105us 1.63(-19%) 1.79us(-11%) 1.45us(-28.6%)
start-end-span-all-attr-types/never-sample 195.05us 200.47ns(+1.1%) 218ns(+11%) 186.63ns(-4%)
start-end-span-all-attr-types-2x/always-sample 3.8546us 2.6451us(-30.9%) 3.34us(-11%) 2.81us(-27%)
start-end-span-all-attr-types-2x/never-sample 259.07ns 276.76(+3.92%) 254.04ns(-2.7%) 260.54ns(-7%)

Using a new thread is probably a good start. But I am still worried about the possibility of OOM here. If we have multiple producers it could in theory OOM because the relay thread cannot handle the incoming volume right?

@djc
Copy link
Contributor

djc commented May 18, 2021

Looks like the tokio channels are the best solution performance-wise (the percentages seem incorrect in a few places)?

@TommyCpp
Copy link
Contributor Author

@djc Yeah, I think you are right. There were some boxes with the wrong number when I copy & paste the numbers. I think I fixed those.

the original output can be found here.

@TommyCpp TommyCpp marked this pull request as ready for review May 21, 2021 00:21
@joshtriplett
Copy link

I'd love to see this as well.

FWIW, the "flume" crate provides a high-performance channel that works quite well, and it's runtime-independent; it works well under either tokio or async-std, and it also supports synchronous send/recv which would allow using a thread instead of a task.

@TommyCpp
Copy link
Contributor Author

the "flume" crate provides a high-performance channel that works quite well

Thanks for the advice. I did a quick test and it seems to have similar performance as async_channel's mpmc channel. Although its performance seems to be worse when the concurrent sending task increased to 16+.

@TommyCpp
Copy link
Contributor Author

Update the tokio channel implementation in #560. Will try to see if I can build our own bounded mpsc channel with unbounded channel and atomic integer. That's about the only implementation left to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants