`opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles #536

Sushisource · 2021-04-30T16:49:04Z

Hello,

Been trying to get tracing exporting going to either Jaeger or Zipkin, and with both I'm running into issues actually getting much exported.

Following the readme for either, they mention to call global::shut_down_provider(); // sending remaining spans on exit (which, the Readmes seem to be outdated). Calling what looks like the current equivalent (in the title) just locks up entirely. I'm using opentelemetry with the rt-tokio feature, and both the Jaeger and Zipkin crates, with or without their tokio features, exhibit this behavior.

When I don't call this and instead do something silly like just sleep for a while at the end of a test, I'll get output like:

OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 248781 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 112387 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })

Somehow I had Jaeger exporting working OK before (never with opentelemetry::global::set_text_map_propagator(opentelemetry_jaeger::Propagator::new()); or the shutdown call), but when trying to run a much larger scale load test I was never able to collect anything because of the packet size limits. The error helpfully suggests I try setting a smaller batch size, but I was totally unable to determine how to do that after reviewing the docs for a while.

Thanks for the hard work on this crate -- my honest feedback is that the API is pretty challenging to figure out. Taking something like the above batch size problem, I can find https://docs.rs/opentelemetry/0.13.0/opentelemetry/sdk/trace/struct.BatchSpanProcessorBuilder.html#method.with_max_export_batch_size which looks like it might be what I want, but I can't actually find any way to use that in combination with the Jaeger exporter.

Thanks again! Let me know if I can provide more info or feedback.

The text was updated successfully, but these errors were encountered:

TommyCpp · 2021-05-02T00:30:43Z

Thanks for the feedback. We try to adapt the builder pattern for every configuration we encounter. But there are many configurations so it can be confusing.

To your questions. I think what you are looking for is with_max_packet_size. I think in the future we could introduce include the component that throws the error in error messages. May help identify where to look at.

As for the shutdown_tracer_provider hang forever. Could you provide more information on what's your runtime looks like. Is it tokio current thread or multiple thread scheduler?

One thing that may worth trying is disable the default features in opentelemetry-zipkin and use reqwest-client instead. The dfeault HTTP client in opentelemetry-zipkin is currently reqwest-blocking-client and it doesn't really work with async runtime well.

Conversely, the functionality in reqwest::blocking must not be executed within an async runtime, or it will panic when attempting to block

Sushisource · 2021-05-04T16:45:51Z

Thanks @TommyCpp -- So setting with_max_packet_size just lets me hit that error with smaller values, but it doesn't seem to fix the problem of actually allowing me to tell the batch exporter to send smaller batches.

For the hang, I'm using tokio with the multiple thread scheduler. I tried zipkin with and without reqwest-client and both hung, and jaeger hung with tokio set or not.

jtescher · 2021-05-04T17:07:02Z

@Sushisource you can also use the OTEL_BSP_MAX_EXPORT_BATCH_SIZE env var to export in smaller batches https://github.com/open-telemetry/opentelemetry-specification/blob/5a19b53d71e967659517c02a69b801381d29bf1e/specification/sdk-environment-variables.md#batch-span-processor

simon-an · 2021-05-07T09:15:42Z

Same issue here.

In my case it is a combination of

#[actix_web::main] 
.with_client(reqwest::Client::new())
.install_batch(opentelemetry::runtime::Tokio);

TommyCpp · 2021-05-07T14:08:59Z

For the hang, I'm using tokio with the multiple thread scheduler. I tried zipkin with and without reqwest-client and both hung, and jaeger hung with tokio set or not.

Could you provide an example if it's possible. I ran some examples and it seems to be working. Wondering if I missed something here. Thanks in advance.

Sushisource · 2021-05-11T22:53:36Z

@TommyCpp This repros it consistently for me: https://gist.github.com/Sushisource/0c04fcc0fecef386a46cd6dd25c7a296

TommyCpp · 2021-05-11T23:33:43Z

@TommyCpp This repros it consistently for me: https://gist.github.com/Sushisource/0c04fcc0fecef386a46cd6dd25c7a296

@Sushisource Thanks for putting together a demo. Tokio test macro by default runs on a single thread executor. You can either enable rt-tokio-current-thread and use runtime::TokioCurrentThread with install_batch. Or tell tokio to use a multi-thread executor by using #[tokio::test(flavor = "multi_thread")].

The gist works for me when I tried the above approaches. Let me know if those do the trick for you 😬

TommyCpp · 2021-05-13T00:42:32Z

Same issue here.

In my case it is a combination of
#[actix_web::main] 
.with_client(reqwest::Client::new())
.install_batch(opentelemetry::runtime::Tokio);

@simon-an May be try runtime::TokioCurrentThread?. I think the actix_web uses tokio current thread runtime by default.

Sushisource · 2021-05-14T16:25:04Z

@TommyCpp Ah, sure this makes sense and fixes it. It would be really great if the examples (READMEs like the one here) mentioned that using a multithreaded runtime, or setting the current thread batch executor is necessary to avoid blocking - I think probably a lot of people are going to try using the library via tests like this when they first try it out and (obviously, lol) this can be a confusing thing to run into.

Thanks for the help! Feel free to close this issue, really appreciate it.

simon-an · 2021-05-17T16:11:23Z

@TommyCpp Don't I have the multithreaded runtime, when i have a toml like this?

actix-http = { version = "3.0.0-beta.6", default-features = false, features = ["compress"] }
actix-web = { version = "4.0.0-beta.6", default-features = false, features = ["compress"] }
actix-web-opentelemetry = { version ="0.11.0-beta.3", features = ["metrics"] } 
tokio = { version = "1.5", features = ["rt-multi-thread", "macros", "time"] }

opentelemetry = { version = "0.13", features = ["trace", "metrics", "rt-tokio"] } 
opentelemetry-zipkin = { version = "0.11", features = ["reqwest-client"], default-features = false }
opentelemetry-application-insights = {version = "0.14", features = ["reqwest-client"], default-features = false }

You are right. It works when i use

opentelemetry::runtime::TokioCurrentThread

instead of

opentelemetry::runtime::Tokio

example project

TommyCpp · 2021-05-18T01:26:32Z

@TommyCpp Don't I have the multithreaded runtime, when i have a toml like this?

I believe the actix_web is actually running on a tokio current thread scheduler by default.

simon-an · 2021-05-18T09:02:25Z

@TommyCpp i found a solution and updated the example project.

    Handle::current().spawn_blocking(move || {
        opentelemetry::global::shutdown_tracer_provider();
    }).await.unwrap();

Thanks for pointing out, that there might be a current thread scheduler in a multithreaded runtime. That was not that clear to me, but makes scence in the context of a webserver, where thread::spawn should never be executed on a other worker's thread.

Geal · 2022-01-21T15:20:23Z

I see a similar bug in Apollo router, running on the tokio multi threaded runtime, in the main code, not in tests. Reproduction info is here.

The exporter is created here.

The fix proposed by @simon-an works, but I took some time to dig a bit more into it.

It happens when we try to reload the configuration and create a new subscriber. We need to replace the tracer provider because we might switch to a different exporter (I feel there's another way to do that without replacing the tracer provider but couldn't find it)
It hangs when the try_initialize_subscriber function drops the result of set_tracer_provider, and I tracked it down to BatchSpanProcessor::shutdown:

opentelemetry-rust/opentelemetry/src/sdk/trace/span_processor.rs

Lines 266 to 275 in a81c6c6

    
               fn shutdown(&mut self) -> TraceResult<()> { 
        
                   let (res_sender, res_receiver) = oneshot::channel(); 
        
                   self.message_sender 
        
                       .try_send(BatchMessage::Shutdown(res_sender))?; 
        
                   futures_executor::block_on(res_receiver) 
        
                       .map_err(|err| TraceError::Other(err.into())) 
        
                       .and_then(|identity| identity) 
        
               } 
        
           }

That function is called because the TracerProvider calls shutdown on all its tracers on drop.

Through generous println applications, I see that timeline:

set_tracer_provider is called
TracerProvider is dropped
BatchSpanProcessor::shutdown is called. At this point the batch task is still running fine, regularly receiving flush messages from the ticker
the shutdown function sends the shutdown message
the shutdown function blocks on the receiver. It hangs at this point. The batch task does not receive any new message, which is very surprising (the ticker should continue to send)
then the new batch task from the new tracer provider spawns and starts receiving flush messages from the ticker

I suspect a bad interaction between the tokio runtime and the futures executor. I tried replacing it with a call to tokio's block_on, but then I get a panic because I am trying to start a new tokio runtime from inside an existing runtime.

This is not a blocking bug for us, as I said the proposed fix works. But it would be nice to get a good fix inside the opentelemetry crate, and I hope I gave enough context to help in fixing it 🙂

The test wasn't using the multi_thread executor, so eventually it would trip over the bug in opentelemetry which we've hit in the past where the runtime would hang for not fully understood reasons. see: open-telemetry/opentelemetry-rust#536 Using the multi_thread executor is the recommended work-around. I've also remove the redundant thread spawn from the telemetry Drop implementation, since I don't think it's required as well as the spawn_blocking() call. I've run this test in a tight loop for a couple of hours (on my M1 laptop) with no issues.

The test wasn't using the multi_thread executor, so eventually it would trip over the bug in opentelemetry which we've hit in the past where the runtime would hang for not fully understood reasons. see: open-telemetry/opentelemetry-rust#536 Using the multi_thread executor is the recommended work-around. I've also remove the redundant thread spawn from the telemetry Drop implementation, since I don't think it's required as well as the spawn_blocking() call. I've run this test in a tight loop for a couple of hours (on my M1 laptop) with no issues. fixes: #914

Oliboy50 · 2023-06-01T09:36:38Z

Can we know here when someone will take the time to fix this issue helped by what Geal said?

Just so we know when we will be able to remove our workaround 🙏

Oliboy50 · 2024-02-16T14:54:23Z

🆙 🙏 Like Geal said:

it would be nice to get a good fix inside the opentelemetry crate

Because it's currently painful to implement a wrapper around the opentelemetry crate, moreover since we don't have AsyncDrop yet (having to execute an async workaround in a impl Drop is not that easy).

I hope #1514 will help solving this issue once for all 🙏.

TommyCpp mentioned this issue May 18, 2021

Fix debug loop, add notes for #[tokio::test] #552

Merged

jtescher closed this as completed in #552 May 18, 2021

Geal mentioned this issue Feb 3, 2022

initial studio agent, client awareness and operation reporting apollographql/router#309

Merged

garypen mentioned this issue Apr 27, 2022

use the multi_thread executor to run the telemetry hang test apollographql/router#922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles #536

`opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles #536

Sushisource commented Apr 30, 2021

TommyCpp commented May 2, 2021 •

edited

Sushisource commented May 4, 2021

jtescher commented May 4, 2021

simon-an commented May 7, 2021

TommyCpp commented May 7, 2021

Sushisource commented May 11, 2021

TommyCpp commented May 11, 2021 •

edited

TommyCpp commented May 13, 2021

Sushisource commented May 14, 2021

simon-an commented May 17, 2021 •

edited

TommyCpp commented May 18, 2021

simon-an commented May 18, 2021 •

edited

Geal commented Jan 21, 2022

Oliboy50 commented Jun 1, 2023

Oliboy50 commented Feb 16, 2024 •

edited

opentelemetry::global::shutdown_tracer_provider() Hangs indefinitely, and other troubles #536

opentelemetry::global::shutdown_tracer_provider() Hangs indefinitely, and other troubles #536

Comments

Sushisource commented Apr 30, 2021

TommyCpp commented May 2, 2021 • edited

Sushisource commented May 4, 2021

jtescher commented May 4, 2021

simon-an commented May 7, 2021

TommyCpp commented May 7, 2021

Sushisource commented May 11, 2021

TommyCpp commented May 11, 2021 • edited

TommyCpp commented May 13, 2021

Sushisource commented May 14, 2021

simon-an commented May 17, 2021 • edited

TommyCpp commented May 18, 2021

simon-an commented May 18, 2021 • edited

Geal commented Jan 21, 2022

Oliboy50 commented Jun 1, 2023

Oliboy50 commented Feb 16, 2024 • edited

`opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles #536

`opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles #536

TommyCpp commented May 2, 2021 •

edited

TommyCpp commented May 11, 2021 •

edited

simon-an commented May 17, 2021 •

edited

simon-an commented May 18, 2021 •

edited

Oliboy50 commented Feb 16, 2024 •

edited