[Discussion] remove mutex in batch span processor. #533

TommyCpp · 2021-04-24T02:22:46Z

Instead of using mutexes, we just clone the sender.

The performance improves

time:   [1.5888 ms 1.5997 ms 1.6113 ms]
change: [-44.416% -43.990% -43.497%] (p = 0.00 < 0.05)

related #520

codecov · 2021-04-24T02:31:55Z

Codecov Report

Merging #533 (fd310c7) into main (88e779d) will increase coverage by 0.0%.
The diff coverage is 74.0%.

@@          Coverage Diff          @@
##            main    #533   +/-   ##
=====================================
  Coverage   52.2%   52.3%           
=====================================
  Files         96      96           
  Lines       8506    8514    +8     
=====================================
+ Hits        4447    4455    +8     
  Misses      4059    4059

Impacted Files	Coverage Δ
opentelemetry/src/sdk/trace/span_processor.rs	`80.3% <74.0%> (-1.2%)`	⬇️
opentelemetry/src/global/trace.rs	`15.6% <0.0%> (+0.2%)`	⬆️
opentelemetry/src/trace/mod.rs	`50.0% <0.0%> (+6.0%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88e779d...fd310c7. Read the comment docs.

jtescher · 2021-04-24T17:27:20Z

@TommyCpp you mentioned in #520 that cloning would increase capacity, are span limits still being enforced with this change?

TommyCpp · 2021-04-24T19:57:31Z

are span limits still being enforced with this change

I thought so. But upon closer examination, I found that the future bound channel guaranteed to allow every sender to send at least once. Thus, the try_send will always succeed for a newly created Sender. That means we will not get errors when the channel is full.

Also based on the discussion here rust-lang/futures-rs#403 (comment). The capacity of the bound channel is buffer size + number of senders there have ever been which will lead to OOM in our use cases. Although I found this to be surprising.

Overall I don't think we can just clone Sender here 😞

jtescher · 2021-04-25T00:09:45Z

Hm what are other good options here? new thread like the current simple processor / crossbeam sync channel that loops sending to the future mpsc channel or drops if that channel is full?

TommyCpp · 2021-04-25T02:06:41Z

new thread like the current simple processor / crossbeam sync channel that loops sending to the future mpsc channel or drops if that channel is full?

Can try that. Crossbeam channel's try_send gives us the desired behavior and it takes a &self. We could spawn a new thread to "relay" the message from the main task/thread to the export task/thread. The max_queue_size can be enforced in the crossbeam channel. I'm not sure if that's going to be a performance improvement, but I can bench it when I finish a POC.

Did a POC and the results look good. Around 20%-45% performance improvement

 
BatchSpanProcessor/with 1 concurrent task                        
                        time:   [570.84 us 583.43 us 596.90 us]
                        change: [-29.334% -24.649% -18.352%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 50 measurements (10.00%)
  2 (4.00%) high mild
  3 (6.00%) high severe

BatchSpanProcessor/with 2 concurrent task                        
                        time:   [693.16 us 705.30 us 719.07 us]
                        change: [-26.410% -22.148% -16.501%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 50 measurements (16.00%)
  4 (8.00%) high mild
  4 (8.00%) high severe

BatchSpanProcessor/with 4 concurrent task                        
                        time:   [942.06 us 955.24 us 968.36 us]
                        change: [-33.543% -31.358% -28.973%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 50 measurements (8.00%)
  2 (4.00%) high mild
  2 (4.00%) high severe

BatchSpanProcessor/with 8 concurrent task                        
                        time:   [1.4606 ms 1.4733 ms 1.4870 ms]
                        change: [-44.982% -43.717% -42.304%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 50 measurements (6.00%)
  1 (2.00%) low mild
  2 (4.00%) high severe

BatchSpanProcessor/with 16 concurrent task                        
                        time:   [2.4540 ms 2.4817 ms 2.5095 ms]
                        change: [-48.691% -46.453% -44.102%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  1 (2.00%) low mild
  6 (12.00%) high severe

BatchSpanProcessor/with 32 concurrent task                        
                        time:   [4.6234 ms 4.6722 ms 4.7267 ms]
                        change: [-46.452% -45.520% -44.607%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
  1 (2.00%) low mild
  5 (10.00%) high mild

…port task. Also update the benchmark to include parameters.

TommyCpp · 2021-04-26T01:06:14Z

I guess another choice we have is to use an unbound channel and maintain an internal state ourselves to count the element within the channel. We can use lock-free operation on the state to avoid the mutex.

jtescher · 2021-04-26T02:47:14Z

unbound channel and maintain an internal state ourselves

What about an unbounded crossbeam channel, keeping the inner bounded channel for the limits?

TommyCpp · 2021-04-26T04:18:01Z

unbound channel and maintain an internal state ourselves

What about an unbounded crossbeam channel, keeping the inner bounded channel for the limits?

I don't think we need a dedicated thread or crossbeam channel in this case. Suppose we defined the channel like

struct Sender {
    unbounded_sender: UnbounedSender<Msg>,
    state: Arc<State>,
}

struct Receiver {
   unbounded_receiver: UnboundedReceiver<Msg>,
   state: Arc<State>
}

struct State {
   elements_num: AtomicNumber,
   ...
}

We can then define a try_send method with &self rather than &mut self. We can enforce the span limit with State.

TommyCpp · 2021-04-29T01:04:13Z

I found the async_channel crate seems to be a good choice when I research the implementation of the channel. Its sender has a try_send method that takes &self. I did a POC with it and most results look good except it performed worse if there wasn't any concurrency.

I think this is an easy fix with some performance improvement and it keeps the batch span processor simple. In the future, we may wrap our own channel to further improve the performance.

@jtescher let me know what do you think 😬

BatchSpanProcessor/with 1 concurrent task                        
                        time:   [936.25 us 969.14 us 1.0046 ms]
                        change: [+9.6398% **+21.109%** +31.546%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 50 measurements (2.00%)
  1 (2.00%) high severe

BatchSpanProcessor/with 2 concurrent task                        
                        time:   [744.50 us 760.84 us 777.97 us]
                        change: [-39.810% -33.943% -28.101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 50 measurements (4.00%)
  2 (4.00%) high severe

BatchSpanProcessor/with 4 concurrent task                        
                        time:   [854.81 us 863.26 us 872.00 us]
                        change: [-46.646% -44.508% -41.412%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 50 measurements (4.00%)
  1 (2.00%) high mild
  1 (2.00%) high severe

BatchSpanProcessor/with 8 concurrent task                        
                        time:   [1.4787 ms 1.4834 ms 1.4889 ms]
                        change: [-50.795% -48.749% -46.637%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 50 measurements (22.00%)
  1 (2.00%) low severe
  4 (8.00%) low mild
  6 (12.00%) high severe

BatchSpanProcessor/with 16 concurrent task                        
                        time:   [2.5739 ms 2.5812 ms 2.5890 ms]
                        change: [-51.467% -49.728% -48.071%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
  2 (4.00%) low mild
  1 (2.00%) high mild
  3 (6.00%) high severe

BatchSpanProcessor/with 32 concurrent task                        
                        time:   [4.4602 ms 4.5027 ms 4.5489 ms]
                        change: [-49.916% -48.948% -47.841%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 50 measurements (8.00%)
  2 (4.00%) high mild
  2 (4.00%) high severe

jtescher · 2021-04-29T19:09:06Z

@TommyCpp how do the existing span start/end overhead benchmarks compare? E.g. if you change the group function to

fn trace_benchmark_group<F: Fn(&sdktrace::Tracer)>(c: &mut Criterion, name: &str, f: F) {
    let rt = tokio::runtime::Runtime::new().unwrap();
    rt.block_on(async {
        let mut group = c.benchmark_group(name);

        group.bench_function("always-sample", |b| {
            let provider = sdktrace::TracerProvider::builder()
                .with_config(sdktrace::config().with_sampler(sdktrace::Sampler::AlwaysOn))
                .with_default_batch_exporter(VoidExporter, opentelemetry::runtime::Tokio)
                .build();
            let always_sample = provider.get_tracer("always-sample", None);

            b.iter(|| f(&always_sample));
        });

        group.bench_function("never-sample", |b| {
            let provider = sdktrace::TracerProvider::builder()
                .with_config(sdktrace::config().with_sampler(sdktrace::Sampler::AlwaysOff))
                .with_default_batch_exporter(VoidExporter, opentelemetry::runtime::Tokio)
                .build();
            let never_sample = provider.get_tracer("never-sample", None);
            b.iter(|| f(&never_sample));
        });

        group.finish();
    });
}

With the mutex, the batch processor seems to have about twice the overhead on the application process, wondering how that changes with the various approaches you've tried.

TommyCpp · 2021-04-30T00:46:48Z

I ran the test and both of the approaches have significant improvement compared with Mutex version but there isn't much of the difference between those two.

Implementation with extra thread to relay messages:

start-end-span/always-sample                                                                             
                        time:   [730.20 ns 812.46 ns 916.13 ns]
                        change: [+1.8579% +10.296% +19.213%] (p = 0.02 < 0.05)
                        Performance has regressed.
start-end-span/never-sample                                                                            
                        time:   [183.98 ns 189.81 ns 195.69 ns]
                        change: [-2.7054% -1.0013% +0.6458%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

start-end-span-4-attrs/always-sample                        
                        time:   [976.92 ns 985.06 ns 992.30 ns]
                        change: [-46.251% -40.904% -31.121%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

start-end-span-4-attrs/never-sample                        
                        time:   [215.68 ns 216.29 ns 217.01 ns]
                        change: [+10.460% +10.864% +11.302%] (p = 0.00 < 0.05)
                        Performance has regressed.


start-end-span-8-attrs/always-sample                        
                        time:   [1.7696 us 1.7853 us 1.8011 us]
                        change: [-81.733% -81.255% -80.537%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high severe

start-end-span-8-attrs/never-sample                        
                        time:   [252.54 ns 253.12 ns 253.80 ns]
                        change: [-73.111% -73.035% -72.963%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild


start-end-span-all-attr-types/always-sample                        
                        time:   [1.3350 us 1.3539 us 1.3776 us]
                        change: [-84.195% -83.846% -83.464%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

start-end-span-all-attr-types/never-sample                        
                        time:   [229.42 ns 230.13 ns 231.05 ns]
                        change: [-73.581% -73.501% -73.418%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

start-end-span-all-attr-types-2x/always-sample                        
                        time:   [2.9964 us 3.0272 us 3.0641 us]
                        change: [-79.493% -79.075% -78.634%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high severe

start-end-span-all-attr-types-2x/never-sample                        
                        time:   [251.54 ns 252.57 ns 253.93 ns]
                        change: [-69.990% -67.314% -64.104%] (p = 0.00 < 0.05)
                        Performance has improved.

Implementation with async_channel:

start-end-span/always-sample                                                                            
                        time:   [452.25 ns 454.40 ns 456.23 ns]
                        change: [-58.468% -57.921% -57.371%] (p = 0.00 < 0.05)
                        Performance has improved.
start-end-span/never-sample                                                                            
                        time:   [163.16 ns 163.39 ns 163.65 ns]
                        change: [-12.297% -12.021% -11.743%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high mild


start-end-span-4-attrs/always-sample                        
                        time:   [1.0922 us 1.1052 us 1.1177 us]
                        change: [-40.728% -39.715% -38.670%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

start-end-span-4-attrs/never-sample                        
                        time:   [187.76 ns 188.26 ns 188.82 ns]
                        change: [-5.6437% -4.9963% -4.4181%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe


start-end-span-8-attrs/always-sample                        
                        time:   [1.8076 us 1.8232 us 1.8412 us]
                        change: [-80.949% -80.628% -80.343%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

start-end-span-8-attrs/never-sample                        
                        time:   [216.77 ns 217.01 ns 217.22 ns]
                        change: [-77.328% -77.249% -77.182%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  8 (8.00%) low mild


start-end-span-all-attr-types/always-sample                        
                        time:   [1.4915 us 1.5059 us 1.5194 us]
                        change: [-81.972% -81.674% -81.377%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild

start-end-span-all-attr-types/never-sample                        
                        time:   [236.72 ns 238.55 ns 240.54 ns]
                        change: [-73.182% -73.008% -72.834%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild


start-end-span-all-attr-types-2x/always-sample                        
                        time:   [2.8460 us 2.8737 us 2.9035 us]
                        change: [-80.096% -79.716% -79.327%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  8 (8.00%) low severe
  5 (5.00%) low mild
  6 (6.00%) high severe

start-end-span-all-attr-types-2x/never-sample                        
                        time:   [272.03 ns 272.42 ns 272.87 ns]
                        change: [-68.585% -65.774% -62.426%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild

jtescher · 2021-05-02T18:43:24Z

Span building and reporting are likely the largest sources of performance overhead that otel will introduce when tracing, so finding the right solution here is fairly important. @djc / @frigus02 / @awiede any ideas on how to improve these numbers further? This is probably one of the last areas before tracing can stabilize at 1.0.

djc · 2021-05-02T19:11:10Z

Hmm, I think I'd need to have a bit more context/pointers into the source code to be able to make any suggestions.

Off the top of my head, maybe using something like smartstring for stack-allocated short strings could help with span building performance? I've gotten some decent wins out of it.

TommyCpp · 2021-05-02T23:22:46Z

I did two POC on different implementations. The first one with an extra thread to relay the message can be found in this PR(or here). The other one using async_channel can be found here.

jtescher · 2021-05-03T18:25:14Z

@djc fundamentally the question is the best way to send span data to the batch span processor queue when a span ends. Currently the batch processor handle wraps a futures::channel::mpsc::Sender, but sending requires &mut self which is currently handled by a mutex but is sub-optimal. @TommyCpp has examples of switching to crossbeam_channel::Sender (which would require another thread, similar to the simple processor) and async_channel::Sender, both of which take &self instead and have different performance characteristics.

Along with the linked alternatives, we could consider options like using tokio/async-std channels directly, possibly by extending the otel runtimes for them to add a try_send method.

frigus02 · 2021-05-03T19:26:03Z

This is a great investigation @TommyCpp. Sadly, I don't have any ideas on how to make span reporting even more performant. I haven't looked at span building, yet. I try to find time in the next days.

crossbeam_channel and async_channel both seem to be good choices here. I wonder which one has a smaller affect of application performance: batch export in a separate thread or task. Although I imagine this might actually depend on the application.

I assume either of those is good enough for a first version. We might want to document the performance overhead of the API/SDK somewhere. Not necessarily with CPU/memory numbers but something like "batch span processor creates one thread and keeps at most X spans in memory"?

I also just read through performance.md. It suggests letting the user choose between prevent information loss and prevent blocking. I think in theory we could let the user choose between send and try_send. Although that does sound like something we could add later on in a non-breaking way.

djc · 2021-05-03T20:14:27Z

For an mpsc queue, wrapping a Mutex around a sender doesn't seem to make much sense, we should always clone the sender instead. Was there a downside to that in terms of managing the capacity of the channel? What is calling BatchSpanProcessor::on_end()? Who/what triggers flushing spans to the exporter?

TommyCpp · 2021-05-03T20:48:13Z

Was there a downside to that in terms of managing the capacity of the channel?

According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.

What is calling BatchSpanProcessor::on_end()

When a Span drops. The TracerProvider will call this function and pass the span as the parameter. Note that the spec requires this function to be thread safe as the spans can ended in multiple thread at the same time.

djc · 2021-05-03T20:56:51Z

According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.

In practice, how much of an issue does this appear to be? How many slots do we expect to have in the channel, and what is the range in number of senders we expect to be active? It looks like the tokio mpsc implementation might not have this issue.

TommyCpp · 2021-05-03T23:10:59Z

According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.

In practice, how much of an issue does this appear to be? How many slots do we expect to have in the channel, and what is the range in number of senders we expect to be active? It looks like the tokio mpsc implementation might not have this issue.

I believe when the sender drops, its message doesn't get cleaned up(See rust-lang/futures-rs#2381 (comment)). Thus, there is a high risk of OOM here.

jtescher · 2021-05-04T16:47:51Z

@djc the spec suggests fairly strongly not to have components with unbounded memory consumption, but more specifically here we need to be able to enforce the batch processor's max queue size configuration. @TommyCpp had explored using unbounded queues + an atomic to maintain the limit.

TommyCpp · 2021-05-07T14:06:54Z

Looks like there is a new version of Rust. Will fix those lint problems tonight.

TommyCpp · 2021-05-15T04:38:46Z

Should we make a decision on this one? @open-telemetry/rust-approvers

jtescher · 2021-05-15T20:38:18Z

Might be nice to experiment with a runtime specific option to see how that performs in terms of overhead on the traced application. Other than that whichever currently looks the most performant would get my vote.

jtescher · 2021-05-16T18:31:27Z

@TommyCpp also moving the span processing to new thread that owns the mpsc sender should be as performant as the simple processor is currently (would be basically the same impl, just sending to async runtime instead of executor::block_on), so ideally we could hit ~375ns or less.

TommyCpp · 2021-05-16T21:57:37Z

also moving the span processing to new thread that owns the mpsc sender should be as performant

That's basically is using a thread to relay message right?. It has a good performance. But it will be hard to drop the spans at an appropriate time.

jtescher · 2021-05-16T22:29:13Z

@TommyCpp was thinking two channels, first unbounded crossbeam for a quick way to get span data to the other thread, then other thread is basically just receiving from the channel like the simple processor does, and then pushes to owned mpsc channel or drops if that is full. Basically replace the mutex with a thread and queue, keep the rest the way it is.

TommyCpp · 2021-05-17T00:51:07Z

@jtescher I believe this PR's implementation is exactly using this method once I replace the channel in BSP from bounded crossbeam channel to unbounded crossbeam channel.

jtescher

LGTM

TommyCpp · 2021-05-18T01:17:40Z

Did another round of benchmarking

Benchmark	Baseline(Current implementation)	mpmc channel	new thread	Tokio channel
BSP/1 task	1.2ms	1.4ms ( +14%)	664us (-47%)	759us(-8%)
BSP/2 tasks	1.3ms	930 us (-31%)	865us(-38%)	668.27us(-45%)
BSP/4 tasks	2.0ms	1.07ms(-45%)	1.16ms(-41%)	839.55us(-47%)
BSP/8 tasks	3.5ms	1.69ms(-52%)	1.7ms(-51%)	1.3048ms(-54%)
BSP/16 tasks	6.3ms	2.68ms(-57%)	3.1ms(-50%)	2.5242ms(-61%)
BSP/32 tasks	10.6ms	4.7ms(-55%)	5.08ms(-49%)	5.15ms(-52%)
start-end-span/always-sample	941.14ns	429ns (-54%)	436ns(-50%)	454.97ns(-51%)
start-end-span/never-sample	170.07ns	166.78ns(-3%)	150ns(-11%)	149.71ns(-11.987%)
start-end-span-4-attrs/always-sample	1.8034us	1.08us(-41%)	1.23us(-33%)	1.18us(-36%)
start-end-span-4-attrs/never-sample	187.49ns	179.95ns(-4%)	186.63ns(+0.89%)	169.64ns(-9%)
start-end-span-8-attrs/always-sample	2.38us	1.7092us(-28%)	1.9882us(-16%)	1.7026us(-28%)
start-end-span-8-attrs/never-sample	214ns	231.45ns(+5%)	223.82ns(+3.9%)	211.24ns(-1.9%)
start-end-span-all-attr-types/always-sample	2.0105us	1.63(-19%)	1.79us(-11%)	1.45us(-28.6%)
start-end-span-all-attr-types/never-sample	195.05us	200.47ns(+1.1%)	218ns(+11%)	186.63ns(-4%)
start-end-span-all-attr-types-2x/always-sample	3.8546us	2.6451us(-30.9%)	3.34us(-11%)	2.81us(-27%)
start-end-span-all-attr-types-2x/never-sample	259.07ns	276.76(+3.92%)	254.04ns(-2.7%)	260.54ns(-7%)

Using a new thread is probably a good start. But I am still worried about the possibility of OOM here. If we have multiple producers it could in theory OOM because the relay thread cannot handle the incoming volume right?

djc · 2021-05-18T19:34:05Z

Looks like the tokio channels are the best solution performance-wise (the percentages seem incorrect in a few places)?

TommyCpp · 2021-05-20T02:19:15Z

@djc Yeah, I think you are right. There were some boxes with the wrong number when I copy & paste the numbers. I think I fixed those.

the original output can be found here.

joshtriplett · 2021-05-25T01:39:07Z

I'd love to see this as well.

FWIW, the "flume" crate provides a high-performance channel that works quite well, and it's runtime-independent; it works well under either tokio or async-std, and it also supports synchronous send/recv which would allow using a thread instead of a task.

TommyCpp · 2021-05-25T03:18:11Z

the "flume" crate provides a high-performance channel that works quite well

Thanks for the advice. I did a quick test and it seems to have similar performance as async_channel's mpmc channel. Although its performance seems to be worse when the concurrent sending task increased to 16+.

TommyCpp · 2021-05-26T03:46:24Z

Update the tokio channel implementation in #560. Will try to see if I can build our own bounded mpsc channel with unbounded channel and atomic integer. That's about the only implementation left to do.

TommyCpp requested a review from a team as a code owner April 24, 2021 02:22

TommyCpp changed the title ~~feat: remove mutex in batch span processor.~~ [Discussion] remove mutex in batch span processor. Apr 24, 2021

TommyCpp force-pushed the tommycpp/issue-520 branch from 1430196 to 52c9cc0 Compare April 25, 2021 17:46

TommyCpp marked this pull request as draft April 25, 2021 17:47

TommyCpp added 2 commits April 25, 2021 17:35

feat: add the batch span processor benchmark.

d4c036f

feat: use a crossbeam and dedicated thread to relay the message to ex…

566262a

…port task. Also update the benchmark to include parameters.

TommyCpp force-pushed the tommycpp/issue-520 branch from 52c9cc0 to 566262a Compare April 25, 2021 21:36

fix: format

6b49d32

fix: remove unnecessary test.

0a32d81

TommyCpp force-pushed the tommycpp/issue-520 branch from e9d82d3 to 0a32d81 Compare April 29, 2021 00:51

Merge branch 'main' into tommycpp/issue-520

b587817

TommyCpp added 2 commits May 9, 2021 10:11

Merge branch 'main' into tommycpp/issue-520

072efa3

Merge branch 'main' into tommycpp/issue-520

b7662ee

TommyCpp added 2 commits May 16, 2021 21:03

Merge branch 'main' into tommycpp/issue-520

38f9c10

feat: replace bounded crossbeam channel to unbounded crossbeam channel.

fd310c7

jtescher approved these changes May 17, 2021

View reviewed changes

TommyCpp marked this pull request as ready for review May 21, 2021 00:21

TommyCpp mentioned this pull request Jun 4, 2021

feat: allow users to use different channels based on runtime in batch span processor #560

Merged

1 task

jtescher closed this in #560 Jun 5, 2021

[Discussion] remove mutex in batch span processor. #533

[Discussion] remove mutex in batch span processor. #533

Conversation

TommyCpp commented Apr 24, 2021 • edited

codecov bot commented Apr 24, 2021 • edited

Codecov Report

jtescher commented Apr 24, 2021

TommyCpp commented Apr 24, 2021 • edited

jtescher commented Apr 25, 2021

TommyCpp commented Apr 25, 2021 • edited

TommyCpp commented Apr 26, 2021

jtescher commented Apr 26, 2021

TommyCpp commented Apr 26, 2021

TommyCpp commented Apr 29, 2021 • edited

jtescher commented Apr 29, 2021

TommyCpp commented Apr 30, 2021 • edited

jtescher commented May 2, 2021

djc commented May 2, 2021

TommyCpp commented May 2, 2021 • edited

jtescher commented May 3, 2021

frigus02 commented May 3, 2021

djc commented May 3, 2021

TommyCpp commented May 3, 2021 • edited

djc commented May 3, 2021

TommyCpp commented May 3, 2021 • edited

jtescher commented May 4, 2021

TommyCpp commented May 7, 2021

TommyCpp commented May 15, 2021

jtescher commented May 15, 2021

jtescher commented May 16, 2021

TommyCpp commented May 16, 2021

jtescher commented May 16, 2021

TommyCpp commented May 17, 2021

jtescher left a comment

Choose a reason for hiding this comment

TommyCpp commented May 18, 2021 • edited

djc commented May 18, 2021

TommyCpp commented May 20, 2021

joshtriplett commented May 25, 2021

TommyCpp commented May 25, 2021

TommyCpp commented May 26, 2021

TommyCpp commented Apr 24, 2021 •

edited

codecov bot commented Apr 24, 2021 •

edited

TommyCpp commented Apr 24, 2021 •

edited

TommyCpp commented Apr 25, 2021 •

edited

TommyCpp commented Apr 29, 2021 •

edited

TommyCpp commented Apr 30, 2021 •

edited

TommyCpp commented May 2, 2021 •

edited

TommyCpp commented May 3, 2021 •

edited

TommyCpp commented May 3, 2021 •

edited

TommyCpp commented May 18, 2021 •

edited