Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to automatically retry http requests if they fail #182

Merged
merged 3 commits into from
Mar 3, 2023

Conversation

mstruk
Copy link
Contributor

@mstruk mstruk commented Feb 6, 2023

Additional config options are added oauth.http.retries and oauth.http.retry.pause.millis. The retry mechanism is supported in the clients when fetching the token, in the introspection validator when it connects to the introspection or userinfo endpoint, and in OAuth over PLAIN when the server obtains the token in client's name.

Signed-off-by: Marko Strukelj marko.strukelj@gmail.com

Additional config options are added `oauth.http.retries` and `oauth.http.retry.pause.millis`. The retry mechanism is supported in the clients when fetching the token, in the introspection validator when it connects to the introspection or userinfo endpoint, and in OAuth over PLAIN when the server obtains the token in client's name.

Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>
@mstruk mstruk added this to the 0.12.0 milestone Feb 6, 2023
@mstruk mstruk marked this pull request as ready for review February 9, 2023 11:08
@mstruk mstruk requested a review from scholzj February 9, 2023 11:08
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits, but LGTM otherwise.

@@ -107,7 +107,7 @@
</module>
<module name="CyclomaticComplexity">
<!-- default is 10-->
<property name="max" value="19"/>
<property name="max" value="20"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would annotation on the class/method infringing this be better than increasing the limit for every class? We normally use annotations in other projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reverted and moved warning suppression into the code.

@@ -46,7 +46,7 @@
<strimzi-oauth.version>1.0.0-SNAPSHOT</strimzi-oauth.version>
<checkstyle.dir>..</checkstyle.dir>

<kafka.docker.image>quay.io/strimzi/kafka:0.31.1-kafka-3.2.3</kafka.docker.image>
<kafka.docker.image>quay.io/strimzi/kafka:0.33.0-kafka-3.3.1</kafka.docker.image>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for using 3.3.1 and not 3.3.2? (both here and below). It probably doesn't matter much, just wondering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. I updated to the latest.

…fka image for 3.3.x

Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>

import java.io.IOException;

public interface IOTask<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a sentence of Javadoc to explain what this interface abstracts, and exactly what the contract is. For example, I assume there's an assumption somewhere that run() should be idempotent. Otherwise if we ended up re-attempting a request which had actually already succeeded in a prior request we'd incur the same non-idempotent side-effects.

Also it seems a bit odd to have IOTask but the retry method be in HttpUtil. The retry is actually independent of HTTP AFAICS, so you could make it a static method of this class. Or perhaps just rename it HttpTask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this interface should be better named as HttpTask as it exists to perform HTTP requests to the authorization server. It primarily exist because the run() method on it can throw an IOException which makes the code using it more elegant, not adding another needless layer of additional exception handling, unwrapping from a RuntimeException, difficulty to differentiate between wrapping and non-wrapped RuntimeException, introducing suttle differences of behaviour between invoking HttpUtil methods directly or via HttpUtil.doWithRetries().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed IOTask to HttpTasks.

@@ -45,6 +46,44 @@ public class HttpUtil {
static final int DEFAULT_CONNECT_TIMEOUT = getConnectTimeout(new Config());
static final int DEFAULT_READ_TIMEOUT = getReadTimeout(new Config());

public static <T> T doWithRetries(int retries, long retryPauseMillis, IOTask<T> task) throws ExecutionException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the callers of this method all to essentially the same thing, which is wrap a call to a request method (get or post) in a lambda which also maintains some metrics. I wonder if it would make sense to make the metrics part of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added MetricsHandler to doWithRetries method. It made the code cleaner in places of use.
Great suggestion, thanks.

if (i > 1 && retryPauseMillis > 0) {
log.debug("Pausing before retrying failed action (for {}ms)", retryPauseMillis);
try {
Thread.sleep(retryPauseMillis);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the fixed pause time. If something goes wrong with an endpoint, more clients are likely to end up in the retry loop the longer the endpoint is down. A recovering Oauth server is likely to be struck with significant load.

Because of the blocking/thread-using nature of what we're doing here, I suspect users will be inclined to stick to small pause times and smallish numbers of retries. Suppose they configure 5 retries with a 200ms pause. If the server normally sees 50 requests/s, then if it's down for > 1s when it comes back it will see those 50 requests in the first 200ms, i.e. 5 times higher than normal. If it were a JVM-based server this would happen before JIT had kicked in making the problem worse because the response time would be slower, so it would end up seeing even higher load.

So I wonder if we should apply a backoff so that the peak server load isn't as bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm trying to avoid is for any request to idle, since the calling logic is inside the Kafka worker thread and especially on the broker side, idling might quickly exhaust the worker pool. I don't know how worker threads are distributed among listeners but if the worker pool is shared then even non-oauth listeners might end up without available worker threads. I foresee the retry count of 2 or 3 max with a low retryPause, maybe 1s max. But the dam effect for the case when authorization server is not available for a great number of requests can become a problem yes.

OTOH one should always size the authz server 10x to survive such spikes.

I'd say, my priority here is - don't break the broker. Let authorization server protect itself, maybe by refusing / dropping connections if it's under too much stress.

Another factor to be aware of is also connect timeout and read timeout which default to 60 seconds. Without explicitly configuring different values that's another point of possible blocking for a long time.

If this is used in the standalone client app then an exponential backoff would be less of a problem from thread exhaustion POV.

Maybe we don't add exponential backoff for now. If we decide to need it in the future we can add a boolean flag that enables it, and multiply the configured retryPauseMillis x2 for each next retry.

…til.doWithRetries() contract

Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>
Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @mstruk

@mstruk mstruk merged commit e643303 into strimzi:main Mar 3, 2023
@mstruk
Copy link
Contributor Author

mstruk commented Mar 3, 2023

Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants