Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use (Async) ExecChainHandler to measure IOExceptions (#3800) #3801

Closed
wants to merge 20 commits into from
Closed

Use (Async) ExecChainHandler to measure IOExceptions (#3800) #3801

wants to merge 20 commits into from

Conversation

cachescrubber
Copy link
Contributor

@cachescrubber cachescrubber commented May 2, 2023

Use HttpRequestRetryStrategy to measure IOExceptions.
Fixes #3800, #3797, #3706

@cachescrubber
Copy link
Contributor Author

The only solution I could find to measure errors in HttpAsyncClient is to use a delegating HttpRequestRetryStrategy which measures after the very last retry. WDYT? Are there other possibilities? The code would need some polishing, I just want verify the suggested solution with you.

@cachescrubber cachescrubber changed the title Test case to reproduce gh-3800 Use HttpRequestRetryStrategy to measure IOExceptions (#3800) May 5, 2023
@cachescrubber
Copy link
Contributor Author

Unfortunately there are cases when the HttpContext has not been initialized with the Timer.Resource sample.

Basically when errors happen before the actual http request, the HttpRequestInterceptor, which initializes the Timer.Resource sample is never called. Scenarios are for example, IO errors like ConnectionRefused.

possibly cause of #3797?

@cachescrubber
Copy link
Contributor Author

@cachescrubber cachescrubber changed the title Use HttpRequestRetryStrategy to measure IOExceptions (#3800) Use (Async) ExecChainHandler to measure IOExceptions (#3800) May 13, 2023
@cachescrubber
Copy link
Contributor Author

This is becoming something bigger then expected.

The current implementation for both async (MicrometerHttpClientInterceptor) and classic (MicrometerHttpRequestExecutor) is implemented in a way that

  • metrics do not include the time needed to establish a connection.
  • metrics do not count failed connection attempts.
  • async metrics even do not include IO errors for established connection (original topic here)

Some experimentation using a HttpRequestRetryStrategy to meter IO errors felt awkward and I was not able to capture errors during connection establishment.

So I checked with the HttpClient Team on their advise how to instrument HttpClient (async and classic). They recommend that we should use (Async) ExecChainHandler to capture execution time metrics.

I update this PR for (currently hc5 only) and implemented the suggested instrumentation.

@cachescrubber
Copy link
Contributor Author

HttpClient has a Feature to execute automatic retries in case of special return codes (503) or retrieable IO exceptions. Default setup (DefaultHttpRequestRetryStrategy) is to retry 1 time after a 1 sec delay.

It is possible to meter the http transaction combined or individually.

combined means

  • measure the total time over all attempts incl. the inferred delays.
  • record the outcome of the last attempt.

individual means

  • measure time for each individual request - excluding the inferred delays.
  • record the outcome for each attempt.

    - Support ObservationRegistry in AsyncExecChainHandler
    - Configurable meterName
    - Introduce ApacheHttpClientMetricsBinder
@sonatype-lift
Copy link
Contributor

sonatype-lift bot commented May 16, 2023

🛠 Lift Auto-fix

Some of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1

# Download the patch
curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3801.diff -o lift-autofixes.diff

# Apply the patch with git
git apply lift-autofixes.diff

# Review the changes
git diff

Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command:

curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3801.diff | git apply

Once you're satisfied, commit and push your changes in your project.

Footnotes

  1. You can preview the patch by opening the patch URL in the browser.

# Conflicts:
#	micrometer-core/src/test/java/io/micrometer/core/instrument/binder/httpcomponents/hc5/MicrometerHttpClientInterceptorTest.java
# Conflicts:
#	micrometer-core/src/test/java/io/micrometer/core/instrument/binder/httpcomponents/hc5/MicrometerHttpClientInterceptorTest.java
@bclozel
Copy link
Contributor

bclozel commented Jun 26, 2023

Superseded by #3800 and merged with cb86bc2. Thanks so much for your contribution @cachescrubber !

@bclozel bclozel closed this Jun 26, 2023
@bclozel bclozel added the superseded An issue that has been superseded by another label Jun 26, 2023
@cachescrubber
Copy link
Contributor Author

Hi @bclozel I just took the time to review your changes. Nice to see the ObservationRegistry in action without the intermediate ObservationOrTimerCompatibleInstrumentation. Looks much cleaner!

I think you missed on important aspect of my original PR. I introduced support to meterRetries (see #3801 (comment)). Basically, the (Async)HttpClient has a build in retry capability which needs special treatment in the instrumentation, otherwise the observations are not correct.

@bclozel
Copy link
Contributor

bclozel commented Jun 27, 2023

Hey @cachescrubber - I left this out on purpose, because the behavior is inconsistent between the "classic" and "async" flavours and I'm afraid there's nothing we can do about it.

The AsyncExecChain.Scope#execCount (the number of executions, > 1 meaning it's a retry) is only available to the async variant. There is no such thing in ExecChain.Scope and the execCount information in HttpRequestRetryExec is not made available to the rest of the chain. I initially thought about adding a "retry.count"="1" as a key value but this won't work for the "classic" version.

Also, setting up the micrometer ChainHandler before the retry one does not yield the same behavior in "classic" and "async":

  • in "classic", we only see a single execution from the ObservationExecChainHandler and record only one observation
  • in "async", the ObservationExecChainHandler is called for each retry, creating a new observation each time

Because of this, I don't think we can consistently produce information as Keyvalue or directly in Observation.Context. If you have suggestions let me know.

@cachescrubber
Copy link
Contributor Author

cachescrubber commented Jun 27, 2023

Hi @bclozel - you are right - instrumentation is a bit different between classic and async. This is why I introduced

  • ApacheHttpClientMetricsBinder#instrument(org.apache.hc.client5.http.impl.async.HttpAsyncClientBuilder)
  • ApacheHttpClientMetricsBinder#instrument(org.apache.hc.client5.http.impl.classic.HttpClientBuilder)

this way, the actual placement of the handler inside the handler-chain becomes an implementation detail, an no longer the responsibility of the user. Also, the (Async)ExecChainHandler implementations were package-private, not extensible.

I'm pretty sure I covered the retry scenarios with integration tests showing consistent behavior for both async and classic. Please see (in my last version)

  • hc5.ApacheHttpClientMetricsBinderTest#retriesAreMetered_overall
  • hc5.ApacheHttpClientMetricsBinderTest#retriesAreMetered_individually
  • hc5.ApacheHttpClientMetricsBinderTest#testPositiveOutcomeAfterRetry_overall
  • hc5.ApacheHttpClientMetricsBinderTest#testPositiveOutcomeAfterRetry_individual

I actually think the tests would show that specially the async meters are not correct like they are right now.

Next to the placement within the handler-chain, (scope.execCount.get() == 1) is the crucial part in the MeteringAsyncExecChainHandler in order to get correct results.

I think I could submit a new PR starting with a new Integration Test showing/demonstrating the issue. Did you removed the Wiremock based tests on purpose? Because I'm not sure how to mock the required behaviour.

@bclozel
Copy link
Contributor

bclozel commented Jun 27, 2023

instrumentation is a bit different between classic and async. This is why I introduced

ApacheHttpClientMetricsBinder#instrument(org.apache.hc.client5.http.impl.async.HttpAsyncClientBuilder)
ApacheHttpClientMetricsBinder#instrument(org.apache.hc.client5.http.impl.classic.HttpClientBuilder)

I'd rather not tie our instrumentation to implementations details of another ExecChain implementation. Given the inconsistencies and the lack of support for cancellation overall, I don't think this would be wise.

this way, the actual placement of the handler inside the handler-chain becomes an implementation detail, an no longer the responsibility of the user. Also, the (Async)ExecChainHandler implementations were package-private, not extensible.

I don't think your initial PR was supporting cancellations and the implementation was still leaking observations. With this in place, I think that the micrometer ExecChain must be configured before the retry one because of internal lifecycle issues with the connection resources. Maybe that's something we should document on the ObservationExecChainHandler (although we're already documenting that it should be configured first).

I actually think the tests would show that specially the async meters are not correct like they are right now.

I think they would be consistent with the behavior of the library: in the async case retries trigger the entire chain, whereas in the classic case this happens internally.

Next to the placement within the handler-chain, (scope.execCount.get() == 1) is the crucial part in the MeteringAsyncExecChainHandler in order to get correct results.

I see where you're coming from, but this merely hides the fact that several separate requests were made. If I understand correctly, if retries are not metered, there is a single observation that does not cover all retries in the async mode. Depending on how the setup is made, this could be the opposite for the classic case.

With all that in mind, I don't think we should try to compensate for the inconsistencies in the library itself.

@cachescrubber
Copy link
Contributor Author

I don't think your initial PR was supporting cancellations and the implementation was still leaking observations. With this in place, I think that the micrometer ExecChain must be configured before the retry one because of internal lifecycle issues with the connection resources.

Yes, my initial PR did not support cancellations. I created another PR reinstating my meterRetries support over your much improved HandlerChain implementation. It would be great if you could have a look at #3941. Is it still leaking observations or does your async cancellation support also kicks in here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
superseded An issue that has been superseded by another
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instrumentations for Apache HttpComponents do not meter errors and leak memory
2 participants