[Core] Eliminate parallel worker per-step task scheduling overhead #4894

njhill · 2024-05-18T00:40:20Z

This PR replaces #3763.

Common logic is handled in the DistributedGPUExecutor superclass and used by both Ray and Mutliprocessing executors.

njhill · 2024-05-18T20:38:39Z

This is failing the distributed spec decoding test - TP spec decoding was added since the original PR and I need to look more closely at the flow for it and how to best integrate these changes with that.

rkooo567 · 2024-05-20T14:16:21Z

we found nccl broadcasting overhead is very big now for high tp, and this PR is very important for reducing the gap. I will take a look at the PR tmrw!

njhill · 2024-05-20T15:52:28Z

Thanks @rkooo567! This PR actually doesn't reduce any nccl broadcast overhead, it just eliminates most of the non-torch.distributed RPCs which turn out to be much more significant (see the measurements in the original PR).

We can/should additionally reduce the amount of nccl broadcasting done, #4844 is a first step, I'm working on the next steps now.

I'll try to address the above-mentioned spec decoding TP issue with this PR today.

rkooo567

Hmm wonder if there's a cleaner way to stop the execution loop... (seems like too much impl details are leaked) unfortunately, I couldn't come up with a very good idea, so right now it is probably the best to document things more verbosely.

vllm/executor/distributed_gpu_executor.py

rkooo567 · 2024-05-20T23:31:43Z

vllm/executor/distributed_gpu_executor.py

+        return await self._driver_execute_model_async(execute_model_req)
+
+    async def stop_remote_worker_execution_loop_async(self) -> None:
+        if self.parallel_worker_tasks is None:


maybe assert instead? If this is None, doesn't that mean the state is kind of screwed up?

vllm/executor/distributed_gpu_executor.py

vllm/executor/multiproc_gpu_executor.py

vllm/engine/async_llm_engine.py

vllm/worker/worker.py

vllm/executor/ray_gpu_executor.py

vllm/executor/distributed_gpu_executor.py

rkooo567 · 2024-05-21T00:48:39Z

Btw, I also talked with @cadedaniel. I think this optimization makes spec decoding pretty complicated (and I feel like PP as well). @njhill I am curious if there's a way to disable this when spec decoding is enabled for the short term in a clean way?

andoorve · 2024-05-21T01:21:33Z

Btw, I also talked with @cadedaniel. I think this optimization makes spec decoding pretty complicated (and I feel like PP as well). @njhill I am curious if there's a way to disable this when spec decoding is enabled for the short term in a clean way?

+1 for PP @rkooo567

njhill · 2024-05-21T04:01:16Z

Thanks for the great review comments @rkooo567!

Btw, I also talked with @cadedaniel. I think this optimization makes spec decoding pretty complicated (and I feel like PP as well). @njhill I am curious if there's a way to disable this when spec decoding is enabled for the short term in a clean way?

@rkooo567 @andoorve this would definitely be an option but would also come at the cost of at least some additional complexity. I haven't yet dug in enough but I'm hopeful about adapting it to work with spec decoding too. That could be misplaced optimism and I'm less sure about PP. I'll first look at those and if the complexity is too great then changing it to work only in non- spec decoding / PP case could be the backup plan.

I was actually hoping this could be a stepping-stone to elimination of the "control plane" RPCs altogether. The workers could remain in a permanent torch.distributed broadcast loop, with keepalives when idle if timeouts are an issue.

andoorve · 2024-05-21T04:32:49Z

I was actually hoping this could be a stepping-stone to elimination of the "control plane" RPCs altogether. The workers could remain in a permanent torch.distributed broadcast loop, with keepalives when idle if timeouts are an issue.

@njhill This is a great point, I also committed a change to PyTorch in anticipation of these kinds of optimizations (Replace control-place RPC with torch distributed) for PP in the future: pytorch/pytorch@b96b1e8

It's a real pain to get multiple sends/recvs on a single rank though, so this is definitely a more future item once we get the basic functionality of PP working.

rkooo567 · 2024-05-21T13:05:03Z

@njhill btw, this was the major bottleneck for us now, so lmk if there's any way I can help accelerating the PR merge!!

njhill · 2024-05-21T14:13:03Z

@rkooo567 about to update now, will push within next couple of hours hopefully!

…ne-tp-new

rkooo567 · 2024-05-21T21:42:12Z

@njhill let me know when it is ready to review again!

njhill · 2024-05-22T00:10:20Z

Thanks @rkooo567, I made some updates but had to deal with other things for the remainder of the day. If I don't get a chance to finish debugging tonight I'll do it first thing in the morning (pacific time)

cadedaniel · 2024-05-22T07:43:58Z

Approving spec decode changes

rkooo567

LGTM if tests pass!

vllm/engine/async_llm_engine.py

rkooo567 · 2024-05-22T10:52:48Z

vllm/executor/distributed_gpu_executor.py

+        raise NotImplementedError
+
+    @abstractmethod
+    def _wait_for_tasks_completion(self, parallel_worker_tasks: Any) -> None:


QQ: this can hang forever if the loop is not finished properly. Should we allow timeout here and kill workers if timeout is reached with an exception? (theoretically, I think if it takes > 30s, I think there's something wrong).

@rkooo567 I will check but I'm not sure that this is necessary. At least in the async case there's already an overall per-step timeout that would cover this I think.

vllm/spec_decode/spec_decode_worker.py

andoorve · 2024-05-22T16:51:08Z

Hey @njhill, did you get a chance to look at this? I feel like this would cause quite a few changes to PP in its current form.

@rkooo567 @andoorve this would definitely be an option but would also come at the cost of at least some additional complexity. I haven't yet dug in enough but I'm hopeful about adapting it to work with spec decoding too. That could be misplaced optimism and I'm less sure about PP. I'll first look at those and if the complexity is too great then changing it to work only in non- spec decoding / PP case could be the backup plan.

…ne-tp-new

njhill · 2024-05-22T18:18:40Z

@rkooo567 OK the spec decoding thing was a tiny fix. And I have added some more detail to the code comments per your latest review.

Thanks @cadedaniel for reviewing too, I was going to ping you to ask for review of that part once the tests were passing :)

@andoorve I can help to look at how we can adapt it to work with PP but it sounds like there is some urgency to get this PR merged first (for the upcoming release?)

andoorve · 2024-05-22T18:39:04Z

Hi @njhill,

Got it, don't want to delay this PR, but just wanted to see if there was any way to loosen this assumption (some mechanism or fallback to per-step task scheduling) that only the rank 0 worker is sending over RPC and receiving results.

The main idea of having other workers (non-drivers) in a task loop is thankfully compatible with PP, but it's just this assumption that there's only 1 parallel group that could cause quite a bit of friction. Happy to chat offline as well.

rkooo567 · 2024-05-22T21:17:49Z

@njhill thank you so much for the quick fix!

njhill · 2024-05-22T21:26:38Z

@andoorve I will look more closely at the PP PR tomorrow (and we could chat too if you're free then). Based on what you said and a high level understanding/assumption about how that works I'm fairly confident we can solve it without too much effort.

…llm-project#4894)

[Core] Eliminate parallel worker per-step task scheduling overhead

4f644da

njhill mentioned this pull request May 18, 2024

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

Closed

rkooo567 self-assigned this May 18, 2024

njhill mentioned this pull request May 20, 2024

[Core] Avoid one broadcast op when propagating metadata #4844

Open

rkooo567 approved these changes May 21, 2024

View reviewed changes

njhill added 3 commits May 21, 2024 08:17

Merge remote-tracking branch 'refs/remotes/origin/main' into streamli…

e81f5b8

…ne-tp-new

Address @rkooo567 review comments, add missed stop_execution_loop method

35626d5

Support TP speculative decoding

9d17755

njhill mentioned this pull request May 21, 2024

[Model] LoRA gptbigcode implementation #3949

Merged

Fix arg name

b67ed6c

cadedaniel approved these changes May 22, 2024

View reviewed changes

rkooo567 approved these changes May 22, 2024

View reviewed changes

njhill added 2 commits May 22, 2024 11:11

Merge remote-tracking branch 'refs/remotes/origin/main' into streamli…

104e7fe

…ne-tp-new

Fix spec decoding; add more detail to comments

7722474

njhill mentioned this pull request May 22, 2024

v0.4.3 Release Tracker #4895

Open

6 tasks

rkooo567 merged commit eb6d3c2 into vllm-project:main May 22, 2024
63 checks passed

njhill deleted the streamline-tp-new branch May 22, 2024 21:19

tybalex pushed a commit to tybalex/vllm-function-call that referenced this pull request May 25, 2024

[Core] Eliminate parallel worker per-step task scheduling overhead (v…

727f20a

…llm-project#4894)

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Core] Eliminate parallel worker per-step task scheduling overhead (v…

7cf54ef

…llm-project#4894)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

njhill commented May 18, 2024

njhill commented May 18, 2024

rkooo567 commented May 20, 2024

njhill commented May 20, 2024

rkooo567 left a comment

rkooo567 May 20, 2024

rkooo567 commented May 21, 2024 •

edited

andoorve commented May 21, 2024

njhill commented May 21, 2024

andoorve commented May 21, 2024

rkooo567 commented May 21, 2024

njhill commented May 21, 2024

rkooo567 commented May 21, 2024

njhill commented May 22, 2024

cadedaniel commented May 22, 2024

rkooo567 left a comment

rkooo567 May 22, 2024

njhill May 22, 2024

andoorve commented May 22, 2024

njhill commented May 22, 2024

andoorve commented May 22, 2024

rkooo567 commented May 22, 2024

njhill commented May 22, 2024

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

Conversation

njhill commented May 18, 2024

njhill commented May 18, 2024

rkooo567 commented May 20, 2024

njhill commented May 20, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 May 20, 2024

Choose a reason for hiding this comment

rkooo567 commented May 21, 2024 • edited

andoorve commented May 21, 2024

njhill commented May 21, 2024

andoorve commented May 21, 2024

rkooo567 commented May 21, 2024

njhill commented May 21, 2024

rkooo567 commented May 21, 2024

njhill commented May 22, 2024

cadedaniel commented May 22, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 May 22, 2024

Choose a reason for hiding this comment

njhill May 22, 2024

Choose a reason for hiding this comment

andoorve commented May 22, 2024

njhill commented May 22, 2024

andoorve commented May 22, 2024

rkooo567 commented May 22, 2024

njhill commented May 22, 2024

rkooo567 commented May 21, 2024 •

edited