Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opentelemetry::global::shutdown_tracer_provider() Hangs indefinitely, and other troubles #536

Closed
Sushisource opened this issue Apr 30, 2021 · 15 comments · Fixed by #552
Closed

Comments

@Sushisource
Copy link

Hello,

Been trying to get tracing exporting going to either Jaeger or Zipkin, and with both I'm running into issues actually getting much exported.

Following the readme for either, they mention to call global::shut_down_provider(); // sending remaining spans on exit (which, the Readmes seem to be outdated). Calling what looks like the current equivalent (in the title) just locks up entirely. I'm using opentelemetry with the rt-tokio feature, and both the Jaeger and Zipkin crates, with or without their tokio features, exhibit this behavior.

When I don't call this and instead do something silly like just sleep for a while at the end of a test, I'll get output like:

OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 248781 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 112387 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })

Somehow I had Jaeger exporting working OK before (never with opentelemetry::global::set_text_map_propagator(opentelemetry_jaeger::Propagator::new()); or the shutdown call), but when trying to run a much larger scale load test I was never able to collect anything because of the packet size limits. The error helpfully suggests I try setting a smaller batch size, but I was totally unable to determine how to do that after reviewing the docs for a while.

Thanks for the hard work on this crate -- my honest feedback is that the API is pretty challenging to figure out. Taking something like the above batch size problem, I can find https://docs.rs/opentelemetry/0.13.0/opentelemetry/sdk/trace/struct.BatchSpanProcessorBuilder.html#method.with_max_export_batch_size which looks like it might be what I want, but I can't actually find any way to use that in combination with the Jaeger exporter.

Thanks again! Let me know if I can provide more info or feedback.

@TommyCpp
Copy link
Contributor

TommyCpp commented May 2, 2021

Thanks for the feedback. We try to adapt the builder pattern for every configuration we encounter. But there are many configurations so it can be confusing.

To your questions. I think what you are looking for is with_max_packet_size. I think in the future we could introduce include the component that throws the error in error messages. May help identify where to look at.

As for the shutdown_tracer_provider hang forever. Could you provide more information on what's your runtime looks like. Is it tokio current thread or multiple thread scheduler?


One thing that may worth trying is disable the default features in opentelemetry-zipkin and use reqwest-client instead. The dfeault HTTP client in opentelemetry-zipkin is currently reqwest-blocking-client and it doesn't really work with async runtime well.

Conversely, the functionality in reqwest::blocking must not be executed within an async runtime, or it will panic when attempting to block

@Sushisource
Copy link
Author

Thanks @TommyCpp -- So setting with_max_packet_size just lets me hit that error with smaller values, but it doesn't seem to fix the problem of actually allowing me to tell the batch exporter to send smaller batches.

For the hang, I'm using tokio with the multiple thread scheduler. I tried zipkin with and without reqwest-client and both hung, and jaeger hung with tokio set or not.

@jtescher
Copy link
Member

jtescher commented May 4, 2021

@simon-an
Copy link

simon-an commented May 7, 2021

Same issue here.

In my case it is a combination of

#[actix_web::main] 
.with_client(reqwest::Client::new())
.install_batch(opentelemetry::runtime::Tokio);

@TommyCpp
Copy link
Contributor

TommyCpp commented May 7, 2021

For the hang, I'm using tokio with the multiple thread scheduler. I tried zipkin with and without reqwest-client and both hung, and jaeger hung with tokio set or not.

Could you provide an example if it's possible. I ran some examples and it seems to be working. Wondering if I missed something here. Thanks in advance.

@Sushisource
Copy link
Author

@TommyCpp This repros it consistently for me: https://gist.github.com/Sushisource/0c04fcc0fecef386a46cd6dd25c7a296

@TommyCpp
Copy link
Contributor

TommyCpp commented May 11, 2021

@TommyCpp This repros it consistently for me: https://gist.github.com/Sushisource/0c04fcc0fecef386a46cd6dd25c7a296

@Sushisource Thanks for putting together a demo. Tokio test macro by default runs on a single thread executor. You can either enable rt-tokio-current-thread and use runtime::TokioCurrentThread with install_batch. Or tell tokio to use a multi-thread executor by using #[tokio::test(flavor = "multi_thread")].

The gist works for me when I tried the above approaches. Let me know if those do the trick for you 😬

@TommyCpp
Copy link
Contributor

Same issue here.

In my case it is a combination of

#[actix_web::main] 
.with_client(reqwest::Client::new())
.install_batch(opentelemetry::runtime::Tokio);

@simon-an May be try runtime::TokioCurrentThread?. I think the actix_web uses tokio current thread runtime by default.

@Sushisource
Copy link
Author

@TommyCpp Ah, sure this makes sense and fixes it. It would be really great if the examples (READMEs like the one here) mentioned that using a multithreaded runtime, or setting the current thread batch executor is necessary to avoid blocking - I think probably a lot of people are going to try using the library via tests like this when they first try it out and (obviously, lol) this can be a confusing thing to run into.

Thanks for the help! Feel free to close this issue, really appreciate it.

@simon-an
Copy link

simon-an commented May 17, 2021

@TommyCpp Don't I have the multithreaded runtime, when i have a toml like this?

actix-http = { version = "3.0.0-beta.6", default-features = false, features = ["compress"] }
actix-web = { version = "4.0.0-beta.6", default-features = false, features = ["compress"] }
actix-web-opentelemetry = { version ="0.11.0-beta.3", features = ["metrics"] } 
tokio = { version = "1.5", features = ["rt-multi-thread", "macros", "time"] }

opentelemetry = { version = "0.13", features = ["trace", "metrics", "rt-tokio"] } 
opentelemetry-zipkin = { version = "0.11", features = ["reqwest-client"], default-features = false }
opentelemetry-application-insights = {version = "0.14", features = ["reqwest-client"], default-features = false }

You are right. It works when i use

opentelemetry::runtime::TokioCurrentThread

instead of

opentelemetry::runtime::Tokio

example project

@TommyCpp
Copy link
Contributor

@TommyCpp Don't I have the multithreaded runtime, when i have a toml like this?

I believe the actix_web is actually running on a tokio current thread scheduler by default.

@simon-an
Copy link

simon-an commented May 18, 2021

@TommyCpp i found a solution and updated the example project.

    Handle::current().spawn_blocking(move || {
        opentelemetry::global::shutdown_tracer_provider();
    }).await.unwrap();

Thanks for pointing out, that there might be a current thread scheduler in a multithreaded runtime. That was not that clear to me, but makes scence in the context of a webserver, where thread::spawn should never be executed on a other worker's thread.

@Geal
Copy link

Geal commented Jan 21, 2022

I see a similar bug in Apollo router, running on the tokio multi threaded runtime, in the main code, not in tests. Reproduction info is here.

The exporter is created here.

The fix proposed by @simon-an works, but I took some time to dig a bit more into it.

It happens when we try to reload the configuration and create a new subscriber. We need to replace the tracer provider because we might switch to a different exporter (I feel there's another way to do that without replacing the tracer provider but couldn't find it)
It hangs when the try_initialize_subscriber function drops the result of set_tracer_provider, and I tracked it down to BatchSpanProcessor::shutdown:

fn shutdown(&mut self) -> TraceResult<()> {
let (res_sender, res_receiver) = oneshot::channel();
self.message_sender
.try_send(BatchMessage::Shutdown(res_sender))?;
futures_executor::block_on(res_receiver)
.map_err(|err| TraceError::Other(err.into()))
.and_then(|identity| identity)
}
}

That function is called because the TracerProvider calls shutdown on all its tracers on drop.

Through generous println applications, I see that timeline:

  • set_tracer_provider is called
  • TracerProvider is dropped
  • BatchSpanProcessor::shutdown is called. At this point the batch task is still running fine, regularly receiving flush messages from the ticker
  • the shutdown function sends the shutdown message
  • the shutdown function blocks on the receiver. It hangs at this point. The batch task does not receive any new message, which is very surprising (the ticker should continue to send)
  • then the new batch task from the new tracer provider spawns and starts receiving flush messages from the ticker

I suspect a bad interaction between the tokio runtime and the futures executor. I tried replacing it with a call to tokio's block_on, but then I get a panic because I am trying to start a new tokio runtime from inside an existing runtime.

This is not a blocking bug for us, as I said the proposed fix works. But it would be nice to get a good fix inside the opentelemetry crate, and I hope I gave enough context to help in fixing it 🙂

garypen added a commit to apollographql/router that referenced this issue Apr 27, 2022
The test wasn't using the multi_thread executor, so eventually it would
trip over the bug in opentelemetry which we've hit in the past where the
runtime would hang for not fully understood reasons.

see: open-telemetry/opentelemetry-rust#536

Using the multi_thread executor is the recommended work-around.

I've also remove the redundant thread spawn from the telemetry Drop
implementation, since I don't think it's required as well as the
spawn_blocking() call.

I've run this test in a tight loop for a couple of hours (on my M1
laptop) with no issues.
garypen added a commit to apollographql/router that referenced this issue Apr 27, 2022
The test wasn't using the multi_thread executor, so eventually it would
trip over the bug in opentelemetry which we've hit in the past where the
runtime would hang for not fully understood reasons.

see: open-telemetry/opentelemetry-rust#536

Using the multi_thread executor is the recommended work-around.

I've also remove the redundant thread spawn from the telemetry Drop
implementation, since I don't think it's required as well as the
spawn_blocking() call.

I've run this test in a tight loop for a couple of hours (on my M1
laptop) with no issues.

fixes: #914
@Oliboy50
Copy link

Oliboy50 commented Jun 1, 2023

Can we know here when someone will take the time to fix this issue helped by what Geal said?

Just so we know when we will be able to remove our workaround 🙏

@Oliboy50
Copy link

Oliboy50 commented Feb 16, 2024

🆙 🙏 Like Geal said:

it would be nice to get a good fix inside the opentelemetry crate

Because it's currently painful to implement a wrapper around the opentelemetry crate, moreover since we don't have AsyncDrop yet (having to execute an async workaround in a impl Drop is not that easy).

I hope #1514 will help solving this issue once for all 🙏.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants