rt(threaded): cap LIFO slot polls

As an optimization to improve locality, the multi-threaded scheduler maintains a single slot (LIFO slot). When a task is scheduled, it goes into the LIFO slot. The scheduler will run tasks in the LIFO slot first, before checking the local queue. In ping-ping style workloads where task A notifies task B, which notifies task A again, this can cause starvation as these two tasks will repeatedly schedule the other in the LIFO slot. #5686, a first attempt at solving this problem, consumes a unit of budget each time a task is scheduled from the LIFO slot. However, at the time of this commit, the scheduler allocates 128 units of budget for each chunk of work. This is quite high in situation where tasks do not perform many async operations, yet have meaningful poll times (even 5-10 microsecond poll times can have outsized impact on the scheduler). In an ideal world, the scheduler would adapt to the workload it is executing. However, as a stopgap, this commit limits the number of times the LIFO slot is prioritized per scheduler tick.
tokio-rs · May 23, 2023 · 8b60f58 · 8b60f58
1 parent 3a94eb0
commit 8b60f58
Showing 1 changed file with 27 additions and 2 deletions.
diff --git a/tokio/src/runtime/scheduler/multi_thread/worker.rs b/tokio/src/runtime/scheduler/multi_thread/worker.rs
@@ -90,8 +90,8 @@ struct Core {
     /// When a task is scheduled from a worker, it is stored in this slot. The
     /// worker will check this slot for a task **before** checking the run
     /// queue. This effectively results in the **last** scheduled task to be run
-    /// next (LIFO). This is an optimization for message passing patterns and
-    /// helps to reduce latency.
+    /// next (LIFO). This is an optimization for improving locality which
+    /// benefits message passing patterns and helps to reduce latency.
     lifo_slot: Option<Notified>,
 
     /// The worker-local run queue.
@@ -191,6 +191,8 @@ type Notified = task::Notified<Arc<Handle>>;
 // Tracks thread-local state
 scoped_thread_local!(static CURRENT: Context);
 
+const MAX_LIFO_POLLS_PER_TICK: usize = 3;
+
 pub(super) fn create(
     size: usize,
     park: Parker,
@@ -470,6 +472,7 @@ impl Context {
         // Run the task
         coop::budget(|| {
             task.run();
+            let mut lifo_polls = 0;
 
             // As long as there is budget remaining and a task exists in the
             // `lifo_slot`, then keep running.
@@ -494,6 +497,28 @@ impl Context {
                     None => return Ok(core),
                 };
 
+                // In ping-ping style workloads where task A notifies task B,
+                // which notifies task A again, continuously prioritizing the
+                // LIFO slot can cause starvation as these two tasks will
+                // repeatedly schedule the other. Budget is consumed below,
+                // however, at the time of this comment, the scheduler allocates
+                // 128 units of budget for each chunk of work. This is quite
+                // high in situation where tasks do not perform many async
+                // operations, yet have meaningful poll times (even 5-10
+                // microsecond poll times can have outsized impact on the
+                // scheduler). To mitigate this, we limit the number of times
+                // the LIFO slot is prioritized.
+                if lifo_polls >= MAX_LIFO_POLLS_PER_TICK {
+                    core.run_queue.push_back_or_overflow(
+                        task,
+                        self.worker.inject(),
+                        &mut core.metrics,
+                    );
+                    return Ok(core);
+                }
+
+                lifo_polls += 1;
+
                 // Polling a task doesn't necessarily consume any budget, if it
                 // doesn't use any Tokio leaf futures. To prevent such tasks
                 // from using the lifo slot in an infinite loop, we consume an