-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] remove mutex in batch span processor. #533
Conversation
Codecov Report
@@ Coverage Diff @@
## main #533 +/- ##
=====================================
Coverage 52.2% 52.3%
=====================================
Files 96 96
Lines 8506 8514 +8
=====================================
+ Hits 4447 4455 +8
Misses 4059 4059
Continue to review full report at Codecov.
|
I thought so. But upon closer examination, I found that the future bound channel guaranteed to allow every sender to send at least once. Thus, the Also based on the discussion here rust-lang/futures-rs#403 (comment). The capacity of the bound channel is Overall I don't think we can just clone |
Hm what are other good options here? new thread like the current simple processor / crossbeam sync channel that loops sending to the future mpsc channel or drops if that channel is full? |
Can try that. Crossbeam channel's Did a POC and the results look good. Around
|
1430196
to
52c9cc0
Compare
…port task. Also update the benchmark to include parameters.
52c9cc0
to
566262a
Compare
I guess another choice we have is to use an unbound channel and maintain an internal state ourselves to count the element within the channel. We can use lock-free operation on the state to avoid the mutex. |
What about an unbounded crossbeam channel, keeping the inner bounded channel for the limits? |
I don't think we need a dedicated thread or crossbeam channel in this case. Suppose we defined the channel like struct Sender {
unbounded_sender: UnbounedSender<Msg>,
state: Arc<State>,
}
struct Receiver {
unbounded_receiver: UnboundedReceiver<Msg>,
state: Arc<State>
}
struct State {
elements_num: AtomicNumber,
...
} We can then define a |
e9d82d3
to
0a32d81
Compare
I found the I think this is an easy fix with some performance improvement and it keeps the batch span processor simple. In the future, we may wrap our own channel to further improve the performance. @jtescher let me know what do you think 😬
|
@TommyCpp how do the existing span start/end overhead benchmarks compare? E.g. if you change the group function to fn trace_benchmark_group<F: Fn(&sdktrace::Tracer)>(c: &mut Criterion, name: &str, f: F) {
let rt = tokio::runtime::Runtime::new().unwrap();
rt.block_on(async {
let mut group = c.benchmark_group(name);
group.bench_function("always-sample", |b| {
let provider = sdktrace::TracerProvider::builder()
.with_config(sdktrace::config().with_sampler(sdktrace::Sampler::AlwaysOn))
.with_default_batch_exporter(VoidExporter, opentelemetry::runtime::Tokio)
.build();
let always_sample = provider.get_tracer("always-sample", None);
b.iter(|| f(&always_sample));
});
group.bench_function("never-sample", |b| {
let provider = sdktrace::TracerProvider::builder()
.with_config(sdktrace::config().with_sampler(sdktrace::Sampler::AlwaysOff))
.with_default_batch_exporter(VoidExporter, opentelemetry::runtime::Tokio)
.build();
let never_sample = provider.get_tracer("never-sample", None);
b.iter(|| f(&never_sample));
});
group.finish();
});
} With the mutex, the batch processor seems to have about twice the overhead on the application process, wondering how that changes with the various approaches you've tried. |
I ran the test and both of the approaches have significant improvement compared with Mutex version but there isn't much of the difference between those two.
|
Span building and reporting are likely the largest sources of performance overhead that otel will introduce when tracing, so finding the right solution here is fairly important. @djc / @frigus02 / @awiede any ideas on how to improve these numbers further? This is probably one of the last areas before tracing can stabilize at 1.0. |
Hmm, I think I'd need to have a bit more context/pointers into the source code to be able to make any suggestions. Off the top of my head, maybe using something like smartstring for stack-allocated short strings could help with span building performance? I've gotten some decent wins out of it. |
@djc fundamentally the question is the best way to send span data to the batch span processor queue when a span ends. Currently the batch processor handle wraps a Along with the linked alternatives, we could consider options like using tokio/async-std channels directly, possibly by extending the otel runtimes for them to add a |
This is a great investigation @TommyCpp. Sadly, I don't have any ideas on how to make span reporting even more performant. I haven't looked at span building, yet. I try to find time in the next days.
I assume either of those is good enough for a first version. We might want to document the performance overhead of the API/SDK somewhere. Not necessarily with CPU/memory numbers but something like "batch span processor creates one thread and keeps at most X spans in memory"? I also just read through performance.md. It suggests letting the user choose between prevent information loss and prevent blocking. I think in theory we could let the user choose between |
For an |
According to the doc of future-rs. Each sender that cloned gets a free slot to send message. There has been some discussion around it and the conclusion is to keep it as it is. As a result, if we clone the channel for each function call. We won't be able to enforce the message limit using the channel.
When a |
In practice, how much of an issue does this appear to be? How many slots do we expect to have in the channel, and what is the range in number of senders we expect to be active? It looks like the tokio mpsc implementation might not have this issue. |
I believe when the sender drops, its message doesn't get cleaned up(See rust-lang/futures-rs#2381 (comment)). Thus, there is a high risk of OOM here. |
@djc the spec suggests fairly strongly not to have components with unbounded memory consumption, but more specifically here we need to be able to enforce the batch processor's max queue size configuration. @TommyCpp had explored using unbounded queues + an atomic to maintain the limit. |
Looks like there is a new version of Rust. Will fix those lint problems tonight. |
Should we make a decision on this one? @open-telemetry/rust-approvers |
Might be nice to experiment with a runtime specific option to see how that performs in terms of overhead on the traced application. Other than that whichever currently looks the most performant would get my vote. |
@TommyCpp also moving the span processing to new thread that owns the mpsc sender should be as performant as the simple processor is currently (would be basically the same impl, just sending to async runtime instead of executor::block_on), so ideally we could hit ~375ns or less. |
That's basically is using a thread to relay message right?. It has a good performance. But it will be hard to drop the spans at an appropriate time. |
@TommyCpp was thinking two channels, first unbounded crossbeam for a quick way to get span data to the other thread, then other thread is basically just receiving from the channel like the simple processor does, and then pushes to owned mpsc channel or drops if that is full. Basically replace the mutex with a thread and queue, keep the rest the way it is. |
@jtescher I believe this PR's implementation is exactly using this method once I replace the channel in BSP from bounded crossbeam channel to unbounded crossbeam channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Did another round of benchmarking
Using a new thread is probably a good start. But I am still worried about the possibility of OOM here. If we have multiple producers it could in theory OOM because the relay thread cannot handle the incoming volume right? |
Looks like the tokio channels are the best solution performance-wise (the percentages seem incorrect in a few places)? |
I'd love to see this as well. FWIW, the "flume" crate provides a high-performance channel that works quite well, and it's runtime-independent; it works well under either tokio or async-std, and it also supports synchronous send/recv which would allow using a thread instead of a task. |
Thanks for the advice. I did a quick test and it seems to have similar performance as async_channel's mpmc channel. Although its performance seems to be worse when the concurrent sending task increased to 16+. |
Update the tokio channel implementation in #560. Will try to see if I can build our own bounded mpsc channel with unbounded channel and atomic integer. That's about the only implementation left to do. |
Instead of using mutexes, we just clone the sender.
The performance improves
related #520