Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(express): use the same clock for span start and end #1210

Merged
merged 3 commits into from
Sep 28, 2022

Conversation

dyladan
Copy link
Member

@dyladan dyladan commented Sep 28, 2022

Fixes #1193
Fixes #1209

Express instrumentation does not pass a time to span start but does pass a time to span.end, which results in undefined behavior. Previous to open-telemetry/opentelemetry-js#3134 this was not a problem as the span always used the same hrTimer that we expose in core. Now that we anchor clocks, it is possible for these clocks to be out of sync.

@dyladan dyladan requested a review from a team as a code owner September 28, 2022 12:25
@dyladan dyladan added bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect labels Sep 28, 2022
@gregolsen
Copy link

Hey @dyladan - first off thank you so much for your work on the anchored clock to mitigate the time drift!
We have recently tried to rollout 1.7.0/0.33.0 OTEL instrumentation and faced a similar issue as the ones this PR fixes – negative span durations caused by start time being higher than the end time (negative span duration probably results in out of bound error and is shown as a very long duration in Honeycomb instead):
image
I understand that this PR address the issue with node.js instrumentation only.
Do you think there could be a similar issue but in the web instrumentation?

@dyladan
Copy link
Member Author

dyladan commented Sep 28, 2022

Hey @dyladan - first off thank you so much for your work on the anchored clock to mitigate the time drift! We have recently tried to rollout 1.7.0/0.33.0 OTEL instrumentation and faced a similar issue as the ones this PR fixes – negative span durations caused by start time being higher than the end time (negative span duration probably results in out of bound error and is shown as a very long duration in Honeycomb instead): image I understand that this PR address the issue with node.js instrumentation only. Do you think there could be a similar issue but in the web instrumentation?

There definitely could be a similar issue but I'm surprised to see it both in XHR and Fetch. Is that happening on all spans or just some?

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

Merging #1210 (41a6c12) into main (1c450f5) will increase coverage by 0.56%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1210      +/-   ##
==========================================
+ Coverage   96.08%   96.64%   +0.56%     
==========================================
  Files          14       20       +6     
  Lines         893     1132     +239     
  Branches      191      245      +54     
==========================================
+ Hits          858     1094     +236     
- Misses         35       38       +3     
Impacted Files Coverage Δ
...try-instrumentation-express/src/instrumentation.ts 99.28% <100.00%> (ø)
...opentelemetry-instrumentation-express/src/types.ts 100.00% <0.00%> (ø)
...tapackages/auto-instrumentations-node/src/utils.ts 98.07% <0.00%> (ø)
...trumentation-express/src/enums/ExpressLayerType.ts 100.00% <0.00%> (ø)
...nstrumentation-express/src/enums/AttributeNames.ts 100.00% <0.00%> (ø)
...opentelemetry-instrumentation-express/src/utils.ts 97.36% <0.00%> (ø)

@gregolsen
Copy link

Good question @dyladan , looks like in our case it happened on a significant portion of the spans but not all of them:
image

@dyladan
Copy link
Member Author

dyladan commented Sep 28, 2022

All of the bad spans are coming from the same instrumentations? Do you have any other instrumentations? Are they reporting good or bad spans?

@gregolsen
Copy link

gregolsen commented Sep 28, 2022

All of the bad spans are coming from the same instrumentations? Do you have any other instrumentations? Are they reporting good or bad spans?

Majority of the "negative" spans were coming from XHR and Fetch instrumentations. Notable, document load wasn't affected at all as it probably just uses the timings provided by the browser resource timing.

image

@gregolsen
Copy link

Relevant code for the XHR instrumentation:
https://github.com/open-telemetry/opentelemetry-js/blob/610808d3b64b9f660f4dd640ad961b8c9f67be66/experimental/packages/opentelemetry-instrumentation-xml-http-request/src/xhr.ts#L434-L441

I think both Fetch and XHR have to store the timestamp and then schedule a timeout to wait for the network events to be ready. There's a comment explaining that:

 // the timeout is needed as observer doesn't have yet information
 // when event "load" is called. Also the time may differ depends on
 // browser and speed of computer

@dyladan dyladan merged commit cbeef6e into open-telemetry:main Sep 28, 2022
@dyladan dyladan deleted the express-clocks branch September 28, 2022 17:00
@dyladan dyladan mentioned this pull request Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect
Projects
None yet
6 participants