Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[@opentelmetry/instrumentation-fetch] Internal span setTimeout race condition prevents some metrics from being traced #4683

Open
cmackenthun opened this issue May 6, 2024 · 2 comments
Labels
bug Something isn't working needs:reproducer This bug/feature is in need of a minimal reproducer priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect

Comments

@cmackenthun
Copy link

cmackenthun commented May 6, 2024

What happened?

Steps to Reproduce

  1. Setup application with fetch instrumentation
  2. Add a POST call (type of call shouldn't matter, but that's what ours was)
  3. Add code to navigate to a new site outside of your application after that call is successful
  4. Run code in a variety of browsers and internet speeds

Expected Result

All network calls for that endpoint are traced

Actual Result

A percentage of these network calls are not traced

Additional Details

We have seen that one of our metrics is not being reported 100% of the time and have narrowed it down to the usage of setTimeout in opentelemetry-instrumentation-fetch as the reason why. Sometimes the routing happens quick enough that the timeout function hasn't run and the span is never ended and therefore can never be flushed or collected on our side.

There is a comment in the fetch instrumentation code about the reasoning for the timeout being waiting for "observer to collect information about resources", but I think overall we would prefer to get metrics that are lacking these resources than none at all. If the setTimeout has to stick around, it would be nice if the fetch instrumentation could know to end these spans when the page is being navigated away from so we can get a higher count of spans

Some docs about setTimeout not be an overly consistent method to use for exact timing: https://developer.mozilla.org/en-US/docs/Web/API/setTimeout#reasons_for_delays_longer_than_specified

@cmackenthun cmackenthun added bug Something isn't working triage labels May 6, 2024
@dyladan dyladan added priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect and removed triage labels May 15, 2024
@dyladan
Copy link
Member

dyladan commented May 15, 2024

@cmackenthun can you please provide a minimal reproduction so we can work on this?

@dyladan dyladan added the needs:reproducer This bug/feature is in need of a minimal reproducer label May 15, 2024
@cmackenthun
Copy link
Author

Sure thing @dyladan

Example repo: https://github.com/cmackenthun/otel--fetch-instrumentation-example

Basically I just copied the example directory from the repo and made some changes to show my usecase. As you can hopefully see from testing (or my gifs below). If I wait for the fetch call to be done and then flush and navigate away, I don't get my fetch span. But if I add an arbitrary wait between the fetch and flush, then I do get my fetch span exported. To me this is because of the setTimeout that is used to end the span inside the fetch instrumentation flow.

fetch() ➡️ flush() ➡️ redirect()

2024-05-17 11 25 28

fetch() ➡️ wait 1s ➡️ flush() ➡️ redirect()

2024-05-17 11 26 04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:reproducer This bug/feature is in need of a minimal reproducer priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect
Projects
None yet
Development

No branches or pull requests

2 participants