Add support to automatically retry http requests if they fail #182

mstruk · 2023-02-06T09:19:58Z

Additional config options are added oauth.http.retries and oauth.http.retry.pause.millis. The retry mechanism is supported in the clients when fetching the token, in the introspection validator when it connects to the introspection or userinfo endpoint, and in OAuth over PLAIN when the server obtains the token in client's name.

Signed-off-by: Marko Strukelj marko.strukelj@gmail.com

Additional config options are added `oauth.http.retries` and `oauth.http.retry.pause.millis`. The retry mechanism is supported in the clients when fetching the token, in the introspection validator when it connects to the introspection or userinfo endpoint, and in OAuth over PLAIN when the server obtains the token in client's name. Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>

scholzj

Two nits, but LGTM otherwise.

scholzj · 2023-02-09T16:11:48Z

.checkstyle/checkstyle.xml

@@ -107,7 +107,7 @@
        </module>
        <module name="CyclomaticComplexity">
            <!-- default is 10-->
-            <property name="max" value="19"/>
+            <property name="max" value="20"/>


Would annotation on the class/method infringing this be better than increasing the limit for every class? We normally use annotations in other projects.

Ok, I reverted and moved warning suppression into the code.

scholzj · 2023-02-09T16:35:46Z

testsuite/pom.xml

@@ -46,7 +46,7 @@
        <strimzi-oauth.version>1.0.0-SNAPSHOT</strimzi-oauth.version>
        <checkstyle.dir>..</checkstyle.dir>

-        <kafka.docker.image>quay.io/strimzi/kafka:0.31.1-kafka-3.2.3</kafka.docker.image>
+        <kafka.docker.image>quay.io/strimzi/kafka:0.33.0-kafka-3.3.1</kafka.docker.image>


Any reason for using 3.3.1 and not 3.3.2? (both here and below). It probably doesn't matter much, just wondering

No reason. I updated to the latest.

…fka image for 3.3.x Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>

tombentley · 2023-02-14T13:17:29Z

oauth-common/src/main/java/io/strimzi/kafka/oauth/common/IOTask.java

+
+import java.io.IOException;
+
+public interface IOTask<T> {


Can we have a sentence of Javadoc to explain what this interface abstracts, and exactly what the contract is. For example, I assume there's an assumption somewhere that run() should be idempotent. Otherwise if we ended up re-attempting a request which had actually already succeeded in a prior request we'd incur the same non-idempotent side-effects.

Also it seems a bit odd to have IOTask but the retry method be in HttpUtil. The retry is actually independent of HTTP AFAICS, so you could make it a static method of this class. Or perhaps just rename it HttpTask.

You are right, this interface should be better named as HttpTask as it exists to perform HTTP requests to the authorization server. It primarily exist because the run() method on it can throw an IOException which makes the code using it more elegant, not adding another needless layer of additional exception handling, unwrapping from a RuntimeException, difficulty to differentiate between wrapping and non-wrapped RuntimeException, introducing suttle differences of behaviour between invoking HttpUtil methods directly or via HttpUtil.doWithRetries().

I renamed IOTask to HttpTasks.

tombentley · 2023-02-14T13:48:34Z

oauth-common/src/main/java/io/strimzi/kafka/oauth/common/HttpUtil.java

@@ -45,6 +46,44 @@ public class HttpUtil {
    static final int DEFAULT_CONNECT_TIMEOUT = getConnectTimeout(new Config());
    static final int DEFAULT_READ_TIMEOUT = getReadTimeout(new Config());

+    public static <T> T doWithRetries(int retries, long retryPauseMillis, IOTask<T> task) throws ExecutionException {


All the callers of this method all to essentially the same thing, which is wrap a call to a request method (get or post) in a lambda which also maintains some metrics. I wonder if it would make sense to make the metrics part of this method.

I added MetricsHandler to doWithRetries method. It made the code cleaner in places of use.
Great suggestion, thanks.

tombentley · 2023-02-14T13:50:53Z

oauth-common/src/main/java/io/strimzi/kafka/oauth/common/HttpUtil.java

+            if (i > 1 && retryPauseMillis > 0) {
+                log.debug("Pausing before retrying failed action (for {}ms)", retryPauseMillis);
+                try {
+                    Thread.sleep(retryPauseMillis);


I'm not sure about the fixed pause time. If something goes wrong with an endpoint, more clients are likely to end up in the retry loop the longer the endpoint is down. A recovering Oauth server is likely to be struck with significant load.

Because of the blocking/thread-using nature of what we're doing here, I suspect users will be inclined to stick to small pause times and smallish numbers of retries. Suppose they configure 5 retries with a 200ms pause. If the server normally sees 50 requests/s, then if it's down for > 1s when it comes back it will see those 50 requests in the first 200ms, i.e. 5 times higher than normal. If it were a JVM-based server this would happen before JIT had kicked in making the problem worse because the response time would be slower, so it would end up seeing even higher load.

So I wonder if we should apply a backoff so that the peak server load isn't as bad.

Yeah I'm trying to avoid is for any request to idle, since the calling logic is inside the Kafka worker thread and especially on the broker side, idling might quickly exhaust the worker pool. I don't know how worker threads are distributed among listeners but if the worker pool is shared then even non-oauth listeners might end up without available worker threads. I foresee the retry count of 2 or 3 max with a low retryPause, maybe 1s max. But the dam effect for the case when authorization server is not available for a great number of requests can become a problem yes.

OTOH one should always size the authz server 10x to survive such spikes.

I'd say, my priority here is - don't break the broker. Let authorization server protect itself, maybe by refusing / dropping connections if it's under too much stress.

Another factor to be aware of is also connect timeout and read timeout which default to 60 seconds. Without explicitly configuring different values that's another point of possible blocking for a long time.

If this is used in the standalone client app then an exponential backoff would be less of a problem from thread exhaustion POV.

Maybe we don't add exponential backoff for now. If we decide to need it in the future we can add a boolean flag that enables it, and multiply the configured retryPauseMillis x2 for each next retry.

…til.doWithRetries() contract Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>

tombentley

Thank @mstruk

mstruk · 2023-03-03T14:03:19Z

Thanks everyone.

mstruk added this to the 0.12.0 milestone Feb 6, 2023

mstruk marked this pull request as ready for review February 9, 2023 11:08

mstruk requested a review from scholzj February 9, 2023 11:08

scholzj requested a review from tombentley February 9, 2023 16:10

scholzj approved these changes Feb 9, 2023

View reviewed changes

Address PR suggestions - suppress warnings in the code, use latest ka…

7025a9e

…fka image for 3.3.x Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>

scholzj approved these changes Feb 14, 2023

View reviewed changes

tombentley reviewed Feb 14, 2023

View reviewed changes

Rename IOTask to HttpTask and include MetricsHandler as part of HttpU…

4826a1b

…til.doWithRetries() contract Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>

tombentley approved these changes Mar 3, 2023

View reviewed changes

mstruk merged commit e643303 into strimzi:main Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to automatically retry http requests if they fail #182

Add support to automatically retry http requests if they fail #182

mstruk commented Feb 6, 2023

scholzj left a comment

scholzj Feb 9, 2023

mstruk Feb 14, 2023

scholzj Feb 9, 2023

mstruk Feb 14, 2023

tombentley Feb 14, 2023

mstruk Mar 2, 2023

mstruk Mar 2, 2023

tombentley Feb 14, 2023

mstruk Mar 2, 2023

tombentley Feb 14, 2023

mstruk Mar 2, 2023

tombentley left a comment

mstruk commented Mar 3, 2023

Add support to automatically retry http requests if they fail #182

Add support to automatically retry http requests if they fail #182

Conversation

mstruk commented Feb 6, 2023

scholzj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tombentley left a comment

Choose a reason for hiding this comment

mstruk commented Mar 3, 2023