Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consumer] Allow annotating consumer errors with metadata #9041

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

evan-bradley
Copy link
Contributor

Description:

Revival of #7439

This explores one possible way to allow adding metadata to errors returned from consumers. The goal here is to allow transmitting more data back up the pipeline if there is an error at some stage, with the goal of it being used by an upstream component, e.g. a component that will retry data, or a receiver that will propagate an error code back to the sender.

The current design eliminates the permanent/retryable error types in favor of a single error type that supports adding signal data to be retried. If no data is added to be retried, the error is considered permanent. Currently there is no distinction made between the signals for the sake of simplicity, the caller should know what signal is used when retrieving the retryable items from the error. Any options for retrying the data (e.g. a delay) are offered as options when adding data to retry.

The error type currently supports a few general metadata fields that are copied when a downstream error is wrapped:

  • Partial successes can be expressed by setting the number of rejected items.
  • gRPC and HTTP status codes can be set and translated between if necessary.

Link to tracking Issue:

Resolves #7047

cc @dmitryax

consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor

jmacd commented Jan 10, 2024

Happy to see this added. As discussed in #9260, there is a potential to propagate backwards the information contained in PartialSuccess responses from OTLP exports.

I worry about the code complexity introduced to have "success error" responses, meaning error != nil but the interpretation being success. However, this is what it will take to back-propagate partial successes, we want callers to see success with metadata about the number of rejected points if possible. Great to see this, thanks @evan-bradley.

@jmacd
Copy link
Contributor

jmacd commented Jan 10, 2024

As discussed in open-telemetry/oteps#238, it would be useful for setting the correct otel.outcome label, for callers to have access to the underlying gRPC and/or HTTP error code. Thanks!

consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/partial.go Show resolved Hide resolved
@mx-psi mx-psi added needed-for-1.0 release:required-for-ga Must be resolved before GA release and removed needed-for-1.0 labels Feb 7, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 22, 2024
@evan-bradley evan-bradley removed the Stale label Mar 6, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

codeboten added a commit that referenced this pull request May 1, 2024
…adic arguments (#10041)

Call out that unnamed types, e.g. the function signature of an exported
function, should not be relied upon by API consumers. In particular,
updating a function to be variadic will break users who were depending
on that function's signature.

#### Link to tracking issue
Helps
#9041

Co-authored-by: Evan Bradley <evan-bradley@users.noreply.github.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
}

// NewHTTPError wraps an error with a given HTTP status code.
func NewHTTPError(err error, code int) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of the new New* functions for the new error types take options so that we don't have to break the function signature in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm split on this. On one hand, that seems like a good idea. On the other, without a clear use-case in mind right now, I'm worried it will end up being dead code. Can you think of any potential options we could add to these errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a WithLogger option for network errors if we ever wanted to add debug statements or something. I don't really have a good idea in mind.

With our updated policy on variadic param additions are we covered if we don't do it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a WithLogger option for network errors if we ever wanted to add debug statements or something. I don't really have a good idea in mind.

For most cases I could see a logger possibly being helpful, but I think we should avoid putting a logger inside our error structs. Thanks for brainstorming on it.

With our updated policy on variadic param additions are we covered if we don't do it now?

The policy only allows us to avoid the deprecation process, but either way we're covered until 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @dmitryax's source idea, I'm back to thinking Options would be a good idea.

consumer/consumererror/README.md Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors_test.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors.go Outdated Show resolved Hide resolved
Copy link

linux-foundation-easycla bot commented May 17, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@evan-bradley
Copy link
Contributor Author

@codeboten Any idea what's up with the CLA? I used the "batch suggestions" feature to group the suggestions together. I can rebase the PR and redo those changes if I messed something up.

consumer/consumererror/networkerrors.go Outdated Show resolved Hide resolved
evan-bradley and others added 2 commits May 17, 2024 16:06
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
// between the status codes supported by each transport if necessary.
//
// It should be created with NewHTTPStatus or NewGRPCStatus.
type NetworkError struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why having one struct for both instead of 2? If you want to have 2 structs you can use generics to avoid duplicate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put one struct so users can do errors.As(err, &NetworkError{}) and get back an error. We want only a single type because receivers want status codes for their particular transport regardless of what transport the exporter uses. I think putting methods on an error type will be better for usability than having functions in the consumererror package since we can do a single check to determine whether we have a NetworkError type rather than having to check each time we call a consumererror function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with sticking with a 1 error type to represent these types of errors as it works best with the errors package and simplifies what components consuming the errors need to do.

If we had separate error types for GRPC and HTTP then any component would have to check both types to see if what underlying transport error is being conveyed. Encapsulating the underlying source in a single struct simplifies that.

Multiple structs that implement an interface also wouldn't work bc the interface wouldn't work with errors.As.

consumer/consumererror/partial.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors.go Outdated Show resolved Hide resolved
consumer/consumererror/networkerrors.go Outdated Show resolved Hide resolved
// e.g. the duration before sending should be retried.
type RetryOption func(err *retryableCommon)

func WithRetryDelay(delay time.Duration) RetryOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should instead have a standalone RetryDelay error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep a single error for a few reasons:

  1. We agreed that upstream components should use the copy of data returned by the downstream component, so delays without data would be inconsistent with that.
  2. Upstream components can get both the data and retry information out of the error a bit more easily.
  3. These two concepts are pretty closely related; if you want to retry data, you likely also care about how long to wait before retrying it.

andrzej-stencel pushed a commit to andrzej-stencel/opentelemetry-collector that referenced this pull request May 27, 2024
…adic arguments (open-telemetry#10041)

Call out that unnamed types, e.g. the function signature of an exported
function, should not be relied upon by API consumers. In particular,
updating a function to be variadic will break users who were depending
on that function's signature.

#### Link to tracking issue
Helps
open-telemetry#9041

Co-authored-by: Evan Bradley <evan-bradley@users.noreply.github.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
@mx-psi mx-psi requested a review from bogdandrutu June 5, 2024 08:25
@mx-psi
Copy link
Member

mx-psi commented Jun 5, 2024

I will merge this by end of week unless there are further comments (edit: no, see #9041 (comment))

@evan-bradley
Copy link
Contributor Author

I will merge this by end of week unless there are further comments

@mx-psi Thanks for keeping an eye on this. I need to make one more change to this to add "error source" metadata to the transport errors, and I also told @dmitryax I would wait for his review. I'd like to hold off until next week if that's okay.

@mx-psi
Copy link
Member

mx-psi commented Jun 7, 2024

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:required-for-ga Must be resolved before GA release
Projects
Status: Waiting for reviews
Development

Successfully merging this pull request may close these issues.

Investigate how to expose exporterhelper.NewThrottleRetry in the consumererror
8 participants