-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to automatically retry http requests if they fail #182
Conversation
Additional config options are added `oauth.http.retries` and `oauth.http.retry.pause.millis`. The retry mechanism is supported in the clients when fetching the token, in the introspection validator when it connects to the introspection or userinfo endpoint, and in OAuth over PLAIN when the server obtains the token in client's name. Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits, but LGTM otherwise.
.checkstyle/checkstyle.xml
Outdated
@@ -107,7 +107,7 @@ | |||
</module> | |||
<module name="CyclomaticComplexity"> | |||
<!-- default is 10--> | |||
<property name="max" value="19"/> | |||
<property name="max" value="20"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would annotation on the class/method infringing this be better than increasing the limit for every class? We normally use annotations in other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I reverted and moved warning suppression into the code.
testsuite/pom.xml
Outdated
@@ -46,7 +46,7 @@ | |||
<strimzi-oauth.version>1.0.0-SNAPSHOT</strimzi-oauth.version> | |||
<checkstyle.dir>..</checkstyle.dir> | |||
|
|||
<kafka.docker.image>quay.io/strimzi/kafka:0.31.1-kafka-3.2.3</kafka.docker.image> | |||
<kafka.docker.image>quay.io/strimzi/kafka:0.33.0-kafka-3.3.1</kafka.docker.image> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for using 3.3.1 and not 3.3.2? (both here and below). It probably doesn't matter much, just wondering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. I updated to the latest.
…fka image for 3.3.x Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>
|
||
import java.io.IOException; | ||
|
||
public interface IOTask<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a sentence of Javadoc to explain what this interface abstracts, and exactly what the contract is. For example, I assume there's an assumption somewhere that run()
should be idempotent. Otherwise if we ended up re-attempting a request which had actually already succeeded in a prior request we'd incur the same non-idempotent side-effects.
Also it seems a bit odd to have IOTask
but the retry method be in HttpUtil
. The retry is actually independent of HTTP AFAICS, so you could make it a static method of this class. Or perhaps just rename it HttpTask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this interface should be better named as HttpTask
as it exists to perform HTTP requests to the authorization server. It primarily exist because the run()
method on it can throw an IOException
which makes the code using it more elegant, not adding another needless layer of additional exception handling, unwrapping from a RuntimeException
, difficulty to differentiate between wrapping and non-wrapped RuntimeException
, introducing suttle differences of behaviour between invoking HttpUtil
methods directly or via HttpUtil.doWithRetries()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed IOTask to HttpTasks.
@@ -45,6 +46,44 @@ public class HttpUtil { | |||
static final int DEFAULT_CONNECT_TIMEOUT = getConnectTimeout(new Config()); | |||
static final int DEFAULT_READ_TIMEOUT = getReadTimeout(new Config()); | |||
|
|||
public static <T> T doWithRetries(int retries, long retryPauseMillis, IOTask<T> task) throws ExecutionException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the callers of this method all to essentially the same thing, which is wrap a call to a request method (get
or post
) in a lambda which also maintains some metrics. I wonder if it would make sense to make the metrics part of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added MetricsHandler
to doWithRetries
method. It made the code cleaner in places of use.
Great suggestion, thanks.
if (i > 1 && retryPauseMillis > 0) { | ||
log.debug("Pausing before retrying failed action (for {}ms)", retryPauseMillis); | ||
try { | ||
Thread.sleep(retryPauseMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the fixed pause time. If something goes wrong with an endpoint, more clients are likely to end up in the retry loop the longer the endpoint is down. A recovering Oauth server is likely to be struck with significant load.
Because of the blocking/thread-using nature of what we're doing here, I suspect users will be inclined to stick to small pause times and smallish numbers of retries. Suppose they configure 5 retries with a 200ms pause. If the server normally sees 50 requests/s, then if it's down for > 1s when it comes back it will see those 50 requests in the first 200ms, i.e. 5 times higher than normal. If it were a JVM-based server this would happen before JIT had kicked in making the problem worse because the response time would be slower, so it would end up seeing even higher load.
So I wonder if we should apply a backoff so that the peak server load isn't as bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm trying to avoid is for any request to idle, since the calling logic is inside the Kafka worker thread and especially on the broker side, idling might quickly exhaust the worker pool. I don't know how worker threads are distributed among listeners but if the worker pool is shared then even non-oauth listeners might end up without available worker threads. I foresee the retry count of 2 or 3 max with a low retryPause, maybe 1s max. But the dam effect for the case when authorization server is not available for a great number of requests can become a problem yes.
OTOH one should always size the authz server 10x to survive such spikes.
I'd say, my priority here is - don't break the broker. Let authorization server protect itself, maybe by refusing / dropping connections if it's under too much stress.
Another factor to be aware of is also connect timeout and read timeout which default to 60 seconds. Without explicitly configuring different values that's another point of possible blocking for a long time.
If this is used in the standalone client app then an exponential backoff would be less of a problem from thread exhaustion POV.
Maybe we don't add exponential backoff for now. If we decide to need it in the future we can add a boolean flag that enables it, and multiply the configured retryPauseMillis x2 for each next retry.
…til.doWithRetries() contract Signed-off-by: Marko Strukelj <marko.strukelj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @mstruk
Thanks everyone. |
Additional config options are added
oauth.http.retries
andoauth.http.retry.pause.millis
. The retry mechanism is supported in the clients when fetching the token, in the introspection validator when it connects to the introspection or userinfo endpoint, and in OAuth over PLAIN when the server obtains the token in client's name.Signed-off-by: Marko Strukelj marko.strukelj@gmail.com