Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add idleTimeoutDecay param to QTP #9354

Closed
wants to merge 14 commits into from

Conversation

magibney
Copy link

See discussion on #9237

Copy link
Author

@magibney magibney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

porting over comments from magibney@dfab46b

Comment on lines 1102 to 1105
long itNanos = TimeUnit.MILLISECONDS.toNanos(idleTimeout);
if (NanoTime.elapsed(idleBaseline, now) > itNanos &&
NanoTime.elapsed(last = _lastShrink.get(), now) > (siNanos = getShrinkInterval()) &&
_lastShrink.compareAndSet(last, Math.max(last, now - itNanos) + siNanos))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the inline assignment as it is hard to read.
Also I'm not sure we actually need to check the idle timeout here, nor that the we need an idleBaseline.
Couldn't this just be:

Suggested change
long itNanos = TimeUnit.MILLISECONDS.toNanos(idleTimeout);
if (NanoTime.elapsed(idleBaseline, now) > itNanos &&
NanoTime.elapsed(last = _lastShrink.get(), now) > (siNanos = getShrinkInterval()) &&
_lastShrink.compareAndSet(last, Math.max(last, now - itNanos) + siNanos))
if (NanoTime.elapsed(last, now) > getShrinkInterval() && _lastShrink.compareAndSet(last, now))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b082e96 removes the inline variable assignment (and adds some comments for clarity).

iiuc though, we definitely do still want to check idle timeout here (and need idleBaseline), otherwise idleTimeout is effectively a ceiling on the amount of time a thread might wait before considering itself potentially eligible to be removed from the pool, and I had thought we'd want idleTimeout to be a floor for the amount of time a thread must be idle before considering itself to be eligible for removal.

If we don't check idleTimeout here, then the only place we're really using idleTimeout is in idleJobPoll() -- so a thread will always try to remove itself (wrt only the shrinkInterval), unless a job is immediately available. Respecting idleTimeout as a floor for determining when a thread is eligible to die should help the pool grow more consistently (to accommodate high load), while allowing (via idleTimeoutDecay) the pool to shrink more efficiently once we're past the idleTimeout window of a transient spike.

@gregw
Copy link
Contributor

gregw commented Feb 20, 2023

@magibney Probably time to mark this as non draft.

@gregw gregw requested review from lorban and sbordet February 20, 2023 21:53
@magibney magibney marked this pull request as ready for review February 20, 2023 22:28
@magibney
Copy link
Author

Thanks again for the review/consideration, and please let me know if there's anything more I can do to assist!

Not sure how much more work is anticipated on this PR from your end, but if there's any chance this might be able to make it into 10/11.0.14, that would be awesome.

For context: Solr would immediately benefit from this feature (any memory not being used for idle threadpool capacity will passively be used by OS page cache). Of course, no worries if this doesn't make it into 10.0.14, but I figured it couldn't hurt to ask explicitly!

@sbordet
Copy link
Contributor

sbordet commented Feb 22, 2023

@magibney it likely will not make Jetty 1x.0.14, since this mechanism is delicate, we likely have people relying on the current behavior, and the new behavior does not have a test case, so we would like to give it more consideration.

@sbordet
Copy link
Contributor

sbordet commented Feb 22, 2023

What currently happens is that a thread idle times out, and looks at another "time" to decide whether to exit or go back polling from the task queue.

I'd call this extra time exitPeriod or similar, but I find it has a hard javadoc to write or, in other words, does not really represent a property of the thread pool that can be easily explained (such as "the pool shrinks 10 threads/min if it's completely idle").

Let's consider the case of 10 threads idling out at the same moment, which may be a common case during load spikes; let's assume one of them exits, so it sets the exitTime (or lastShrinkTime), and the others will go back polling the task queue for another idleTimeout, so this exitPeriod or shrink rate is not really respected, when idleTimeout > exitPeriod.
The next thread wakes up after another idleTimeout, so the appearance is that setting this new exitPeriod parameter has no effect.
We would need to recompute the idleTimeout for an already expired, but non exited thread, so that it only polls for exitPeriod, not the whole idleTimeout.

What is the right name for this property?
Perhaps shrinkTimeout? Its inverse is easily the "shrink rate", and when shrinkTimeout==idleTimeout we have the current behavior.

@magibney sorry if we're being picky on this one, but it's a key Jetty component that we don't want to modify lightly.

Let us know if you would be down to implement the above, along with a test case, otherwise we'll keep your contribution and do it ourselves.
Thanks for persisting on this!

@magibney
Copy link
Author

sorry if we're being picky on this one, but it's a key Jetty component that we don't want to modify lightly.

Totally understandable!

Let's consider the case of 10 threads idling out at the same moment, which may be a common case during load spikes; let's assume one of them exits, so it sets the exitTime (or lastShrinkTime), and the others will go back polling the task queue for another idleTimeout, so this exitPeriod or shrink rate is not really respected, when idleTimeout > exitPeriod

Agreed that this is a useful case to consider, and likely to be common in practice; but the behavior you describe is not exactly what should happen with the PR as currently implemented, and in fact I think that the current PR should exhibit the desired behavior. When a thread expires, it doesn't set lastShrinkTime to now, but rather "backdates" lastShrinkTime to the greater of lastShrinkTime + shrinkInterval or now - idleTimeout + shrinkInterval. This allows the window to move in a consistent way, and should allow all 10 threads in this example to expire according to the configured idleTimeout and idleTimeoutDecay.

does not really represent a property of the thread pool that can be easily explained (such as "the pool shrinks 10 threads/min if it's completely idle").

The proposed idleTimeoutDecay could be described as "the maximum number of threads that, having been idle for idleTimeout millis, are allowed to die per idleTimeout interval". It would be quite straightforward to modify the behavior to be "per minute" instead of "per idleTimeout interval" -- would be a bit more straightforward to write, but I suspect that defining behavior wrt idleTimeout may be more useful (if slightly more convoluted to describe).

So in the example case, if idleTimeout is set to 1 minute, and all become eligible for removal at the same 60s mark, an idleTimeoutDecay >= 10 would cause all 10 threads to die immediately at the 1 minute mark; idleTimeoutDecay == 5 would cause 5 threads to die at the 1 minute mark, and 5 to die at the 2 minute mark.

I have some test cases (preparing a plugin variant to tide us over until a release), but wasn't sure where/how you'd want to add test cases here. If you can provide rough guidance on where/how you'd like to see tests added, I'd be happy to take a crack at porting my existing tests over.

@magibney
Copy link
Author

The behavior should be nearly identical for the default case. I say "nearly" because there is a subtle difference:

Previously, the determination of whether a thread was eligible for removal was coordinated via the pool-scoped _lastShrink timestamp. This was arguably sub-optimal (albeit in a way that didn't matter much): a pool with consistent load would still always remove the first thread per idleTimeout interval that didn't immediately receive a job off the queue. This didn't matter much because at worst, you'd end up removing (and being forced to recreate) one thread per idleTimeout interval.

Ideally (even without this PR) one would arguably want pool shrinkage be considered based on how long some thread has been idle (indicating sustained idle capacity in the pool). This distinction matters more when decoupling idleTimeout from pool shrink rate. The approach taken in this PR replaces the theoretical ideal "global sustained idle capacity" indicator with an approximation based on tracking per-thread idleBaseline, where exact behavior is dependent on the order in which jobs are handed out to threads calling take() and poll():

  1. Best-case scenario, jobs are handed out in a LIFO manner, so that idleBaseline would be a perfect indication of sustained idle capacity (and eligibility for thread death).
  2. Worst-case scenario, jobs are handed out in a FIFO manner, and per-thread idleBaseline artificially underestimates the sustained idle capacity of the pool overall.

The worst-case limit of the worst-case scenario would be: many threads, serving very quick jobs, spending the vast majority of their time idle, but each receiving a job just often enough to keep them all alive. Perhaps it is indeed worth explicitly guarding against this somehow ...

@gregw
Copy link
Contributor

gregw commented Feb 22, 2023

I'm just stepping back again on this one. What is the issue of just setting a very short idleTimeout? Threads do nothing on idle but check if they need to shrink, so why have an shrinkTime that is different to the idleTime?

Also, I'm not sure the 10 thread example works they way you think. Even with idleTimeout=60 and idleTimeoutDecay=5, then 60s after a spike, 10 threads idle out. 1 winds the Cas and advances the lastShrink timeout. The another of the 9 could shrink in 5s time, but it now waits 60s for another job, so at that time, 9 threads wake up, 1 shrinks, 8 can shrink in 5s, but they all sleep for 60s.
So firstly verifying that shrinkage happens as we expect is what a unit test is needed for.

Secondly, why not just have an idleTimeout of 5s?

Also, perhaps we need some randomness in the time that threads sleep, so they don't all wakeup at the same time? Perhaps an idle timeout of 10s means 0-10s ? This way peaks will be spread out and it might make sense to have a shrink time different to idle time?

So more work is definitely needed, if only to convince ourselves this is the right approach.

@sbordet
Copy link
Contributor

sbordet commented Feb 22, 2023

I think that the current PR should exhibit the desired behavior.

@magibney no, the current PR does not work as you think it does 😃

I think I'm leaning with @gregw on this: a short idle timeout is all that is needed to recover memory quickly, without the need of introducing another property that we struggle to identify.

@magibney
Copy link
Author

no, the current PR does not work as you think it does

possible ... though I do have tests that verify the behavior as far as I've described it (perhaps could stand to be made more robust?). In any event, agreed to put this on hold while I gather the tests together, which should add clarity/confidence. I'll be doing this either way for a pluggable implementation, and hope to pick this back up before long.

The main downside of short idleTimeout is thrashing -- threads dying and having to be recreated immediately -- which I suppose could be as big or small a deal as the the reason for pooling threads in the first place! (It becomes a bigger deal the shorter the idleTimeout interval, with the current approach). The goal of separating idleTimeout from shrink rate was to "have our cake and eat it too": no thrashing, and quick decay once past the idleTimeout window.

I'll workshop this.a bit though. Thanks again to you all for your consideration and guidance!

@gregw
Copy link
Contributor

gregw commented Feb 23, 2023

@magibney i don't think thrashing is a problem. Either the server is busy, in which case threads wake up with a job. Or it's not and then a timeout every few seconds is no big deal while the pool shrinks to a reasonable size.

@magibney
Copy link
Author

magibney commented Mar 8, 2023

I've addressed the race condition inherent in tracking "idle baseline" per-thread, now precisely tracking idle capacity globally for the pool. The new tests exercise and demonstrate the expected behavior (I think the http3 test failure on the CI build is unrelated?).

Thrashing is not a problem when you use idleTimeout according to the semantics "I want capacity idle for more than idleTimeout millis to be eligible to shrink", because in a "shrink-1-thread-per-idleTimeout-interval" situation, any thread that checks for eligibility to expire is guaranteed to have been idle for idleTimeout millis, so the pool overall is guaranteed to have had at least 1 thread idle for that amount of time. So far, so good (and fwiw I think this is the more intuitive use of idleTimeout).

But with the proposed workaround, we'd be using idleTimeout according to the semantics "I want idle capacity to decay at this rate". If the desired interval of what qualifies as "idle" is longer than the desired decay rate, then thrashing is definitely a possibility.

It's obviously a bigger problem at the limit, e.g.: I want to accommodate spikes from 2k to 10k threads, but beyond a 2-minute window, I want the pool to shrink as quickly as possible (to make memory available to the OS page cache). We're forced to choose between "2-minute window" and "shrink as quickly as possible". If I prioritize the latter, e.g. by setting idleTimeout=500millis, then:

  1. for the "sustained burst" case, within our 2-minute desired "peak hold" window, we expire and are forced to recreate as many as 240 threads (thrashing -- not to mention that for extremely short idleTimeout, the job poll loop becomes quite busy since idleJobPoll() length is determined by idleTimeout)
  2. for the "isolated burst" case it still takes more than an hour to fall back to our 2k baseline, and at 30 minutes past the burst we still have 4k more threads than we want.

The example idleTimeout=500millis clearly prioritizes "shrink as quickly as possible"; but even at that, it's not particularly extreme (consider idleTimeout=100millis?), and still doesn't really get us what we want. Increasing/decreasing is a zero-sum game -- we're forced to compromise somewhere.

@sbordet
Copy link
Contributor

sbordet commented Mar 13, 2023

@magibney this PR has now become quite complicated -- I have reservations.

Can you please detail what would be the problem with a short idle timeout that this PR would solve?
If the main concern is memory utilization, a short idle timeout will do the job efficiently and very simply.

From your last comment, I gather that the case of "I spiked to 10k threads, I would like to keep them around for a couple of minutes, but beyond that shrink as fast as possible" seems really specific, and I gather this PR still won't solve this particular case?

@magibney
Copy link
Author

Thanks for taking a look.

I gather this PR still won't solve this particular case?

This PR definitely solves that case -- if it didn't then there'd be no purpose to this PR at all!

The fundamental problem is that as long as threads are alive and not being used, they consume memory. For large numbers of threads, even a very short timeout adds up to idle threads hanging around way longer than they're useful (to decay all of 8k threads in 2 minutes would currently require an absurdly short idleTimeout=15millis). The shorter you set the timeout, the more you let yourself in for thrashing.

Most of the complexity of this PR now is in the test cases, and to really adequately exercise the common and edge cases I don't see a way around the complexity. The main class changes are actually pretty straightforward. It may also be worth noting that it would be possible to make ShrinkManager a pluggable interface, and provide a default implementation that's literally identical to current behavior.

@sbordet
Copy link
Contributor

sbordet commented Mar 13, 2023

This PR definitely solves that case

Maybe.

I have to say that it's difficult to follow the code for me.
There is also the problem of non-fenced array stores in timestamps and false sharing between head and tail.

I find the calls to ShrinkManager to be in non-intuitive places, and I still cannot wrap my head around the algorithm.
The algorithm is still unclear to me, because there is no relationship between the thread that stores a value in timestamps and the thread that reads from timestamps.

IIUC a random thread writes at a random index its "idle" nanoTime, and another thread reads from a different random index the "idle" nanoTime of another thread, to decide whether to exit or not; it's not intuitive (at least for me).

With the right interleaving, only timestamps[0] (or some other constant index) may ever be written (after the first writes) -- again not intuitive to me.

Sorry!

I think I have a simpler alternative.
I'll detail my proposal tomorrow and I would greatly appreciate your feedback.

@magibney
Copy link
Author

magibney commented Mar 14, 2023

Sounds good; I'm all for a simpler alternative if that's achievable (and again, thoroughly appreciate the caution and energy with which you're approaching this issue)!

wrt this PR "solving that case": I put a considerable amount of effort into the test cases, and am quite confident in the behavior.

there is no relationship between the thread that stores a value in timestamps and the thread that reads from timestamps

That's by design. Idle capacity at the pool level is tracked instead of idle capacity of individual threads, since the latter is dependent on random thread scheduling, and the former is more meaningful anyway.

With the right interleaving, only timestamps[0] (or some other constant index) may ever be written (after the first writes) -- again not intuitive to me.

This also is by design, though I'd emphasize after the first writes. It would be timestamps[n - 1] (at the head of the deque, after n initial writes) that would be the only index written, as threads becoming active pop from the head of the deque, and push a new timestamp back to the head upon when their job is complete. The benefit of this approach -- handling active/idle as a stack (at head), polling idle capacity from the tail, allows idle capacity to age out in a way that's practically independent of thread scheduling. But note that this situation would not persist indefinitely. This pattern would allow the tail entries to age out, at which point the pool would start shrinking up to the point where, if/when new threads were required, they'd increment head and then you'd have a new "constant" index being updated -- in a nutshell: the desired behavior.

the problem of non-fenced array stores in timestamps and false sharing between head and tail

It's good to call this out; I did this knowingly and I think it should be ok. For practical purposes, we can lean on the fact that the stored values are effectively relevant within the tolerance of idleTimeout. Writes only happen at the head, and we don't care about reading those values at all, and the exact values that might overwrite each other are a benign race condition because they'll all be well within the idleTimeout tolerance, so we actually don't care about the exact value that gets written. As far as reading values, we only care about tail, and we only start to pay attention to those tail values once their idleTimout is passed. The race condition at tail does happen with failed CaS; but it's benign here too, because we don't care who is checking a value, so long as someone is. The top priority at the tail is to ensure that the array entries remain in order (within the acceptable tolerance of race conditions), so note that we don't actually ever write to the array at tail, we simply update the tail pointer.

Again, the tests bear all this out, and are designed to stress the system pretty hard, asserting very tight tolerances wrt expected shrink behavior.

magibney added a commit to magibney/jetty.project that referenced this pull request Mar 16, 2023
includes some changes to make it easy to run against current
`main` branch (which doesn't support configurable shrink rate)
@magibney
Copy link
Author

0bf3c16 illustrates an approach that extracts all the shrink logic into a single interface. This helps with clarify (at least as a point of reference for discussion), and also allows for the default shrink rate of 1 to be functionally identical to the existing implementation on main -- i.e., the new DefaultShrinkManager class is basically just a straight refactoring of the existing shrink logic, managed via a single AtomicLong lastShrink field.

@sbordet
Copy link
Contributor

sbordet commented Apr 27, 2023

@magibney closing this as I think we have addressed the issue in #9498.

@sbordet sbordet closed this Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants