Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context to metrics reporting of buffer-full events #1566

Merged
merged 12 commits into from Jan 22, 2024

Conversation

plantfansam
Copy link
Contributor

We report buffer-full dropped spans in two contexts: on_finish and force_flush. Since force_flush is used in specific contexts, I thought it would be useful to supply a label so that users can have visibility into that.

We could alternatively use the reason to disambiguate the two scenarios, rather than introducing a new tag.

Copy link
Contributor

@fbogsany fbogsany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion. Otherwise LGTM.

@@ -82,7 +82,7 @@ def on_finish(span)
n = spans.size + 1 - max_queue_size
if n.positive?
spans.shift(n)
report_dropped_spans(n, reason: 'buffer-full')
report_dropped_spans(n, reason: 'buffer-full', context: 'on_finish')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using semconv code.* for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupendous idea! 03fb66c

@@ -82,7 +82,7 @@ def on_finish(span)
n = spans.size + 1 - max_queue_size
if n.positive?
spans.shift(n)
report_dropped_spans(n, reason: 'buffer-full')
report_dropped_spans(n, reason => 'buffer-full', OpenTelemetry::SemanticConventions::Trace::CODE_FUNCTION => 'on_finish')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use hash rocket syntax to have constant as key (afaik)

@@ -204,8 +204,8 @@ def report_result(result_code, batch)
end
end

def report_dropped_spans(count, reason:)
@metrics_reporter.add_to_counter('otel.bsp.dropped_spans', increment: count, labels: { 'reason' => reason })
def report_dropped_spans(count, labels = {})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I changed the method signature, we do not need to change other call sites for report_dropped_spans since they were using the kwarg reason: 'foo' which we can also interpret as a hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'd rather we used named arguments consistently, and Strings for keys and values in the labels. I.e. I'd prefer:

def report_dropped_spans(count, labels: nil) ... end

report_dropped_spans(n, labels: { 'reason' => 'buffer-full', 'code.function' => 'force_flush' })

or:

def report_dropped_spans(count, reason:, function: nil)
  @metrics_reporter.add_to_counter('otel.bsp.dropped_spans', increment: count, labels: { 'reason' => reason, 'code.function' => function }.compact)
end

report_dropped_spans(n, reason: 'buffer-full', function: 'force_flush')

I prefer the latter, since it is cleaner for callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do the latter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericmustin
Copy link
Contributor

Seems completely arbitrary and unrelated to the specification. Also lgtm

@plantfansam
Copy link
Contributor Author

Woo hoo, I just need an adult to merge it 😄

@plantfansam plantfansam mentioned this pull request Jan 19, 2024
@fbogsany
Copy link
Contributor

Seems completely arbitrary and unrelated to the specification.

Unrelated, yes. Not completely arbitrary. For transparency, the goal is to also enable logging of the trace_ids of the spans that are dropped. It isn't completely clear how to do that in a way that'll make everyone happy, e.g. with levelled logging, so we're going to experiment with this in Shopify by monkey-patching report_dropped_spans. The existing code only passes the count of dropped spans to this method, but we actually need the dropped spans themselves. Hence this change.

@fbogsany
Copy link
Contributor

Also, to be clear, we are dropping spans in production due to buffer-full and we don't know which of the callsites is responsible, hence adding a label to disambiguate. Changing the meaning of the existing label might break some consumers of the metric.

@@ -222,7 +222,7 @@ def to_span_data
_(test_exporter.failed_batches.size).must_equal(0)
_(test_exporter.batches.size).must_equal(0)

_(bsp.instance_variable_get(:@spans).size).must_equal(1)
_(bsp.instance_variable_get(:@spans).size).must_equal(0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are now properly dropping spans during shutdown we change the test expectation

@robertlaurin robertlaurin merged commit 9da08e4 into open-telemetry:main Jan 22, 2024
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants