-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Refactor parallel sampling support #13774
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Nice work cleaning this up!
We should verify that this overhead is negligible with a quick benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @markmc, looks great and I agree doing the aggregation in the output processor is much nicer! Just some minor suggestions to simplify further.
This pull request has merge conflicts that must be resolved before it can be |
1aed2ed
to
75e9d49
Compare
0153a5b
to
60fa08d
Compare
Initial benchmarking results with
Command to measure
Command to measure
|
Async serving benchmarking results with
Command to bring up vLLM v1 engine server:
Benchmark command:
|
Based on the benchmarking results above, there does not appear to be a perf regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @markmc ! I left some feedback.
request = self.processor.process_inputs(request_id, prompt, params, | ||
arrival_time, lora_request, | ||
trace_headers, | ||
prompt_adapter_request, | ||
priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I wondered - can we reuse the output of process_inputs
over multiple child requests? Since I believe process_inputs
operates solely on the prompt (or MM inputs or whatever.)
This could be tricky for multimodal so I don't know if it is in-scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did wonder that too, but didn't dig into it. Seem quite orthogonal to this PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree. Actually we can move this outside of the loop I think, and just update the request_id and sampling_params of the request inside the loop.
Doesn't need to hold up this PR but would be quite a simple change I think.
60fa08d
to
a65f0c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @markmc looks great, just a couple more small things.
gens.append(gen) | ||
# If not streaming, aggregate until all child requests complete | ||
if (final_only and len(request_output.outputs) != self.n): | ||
self.output_aggregator = request_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make more sense to me for this assignment to be in the other if above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you don't want to set it if final_only == False
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, ok ignore my comment then! I'm not sure why I find it slightly confusing to overwrite this for the intermediate outputs.
request = self.processor.process_inputs(request_id, prompt, params, | ||
arrival_time, lora_request, | ||
trace_headers, | ||
prompt_adapter_request, | ||
priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree. Actually we can move this outside of the loop I think, and just update the request_id and sampling_params of the request inside the loop.
Doesn't need to hold up this PR but would be quite a simple change I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The initial implementation went to great efforts to add parallel sampling as a wrapper at the highest layer of abstraction possible. This resulted in a lot of tricky code to post-process RequestOutputs to aggregate them where necessary. Instead, it probably makes sense to implement parallel sampling at the layer that actually creates RequestOutput objects - i.e. in OutputProcessor. To do this, we simply need to allow for fanning out child requests in LLMEngine.add_request(), passing details of the fan-out to OutputProcessor. This adds some overhead to the n=1 case in return for significantly less overhead and complication in the parallel sampling case. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
044ad2f
to
7736b2d
Compare
Add default values for OutputProcessor.add_request() parent_req and index parameters to fix tests. These are the values used in the n==1 case. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Lint and deploy minio setup failing with:
|
As per the original V1 parallel sampling code. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
The initial implementation in #10980 went to great efforts to add parallel sampling as a wrapper at the highest layer of abstraction possible. This resulted in a lot of tricky code to post-process
RequestOutputs
to aggregate them where necessary.Instead, it probably makes sense to implement parallel sampling at the layer that actually creates
RequestOutput
objects - i.e. inOutputProcessor
To do this, we simply need to allow for fanning out child requests in
LLMEngine.add_request()
, passing details of the fan-out toOutputProcessor.
This adds some overhead to the
n=1
case (seeSingularSamplingRequest
) in return for significantly less overhead and complication in the parallel sampling case.