[V1] [Spec Decode] Support random sampling for spec decode #13933

LiuXiaoxuanPKU · 2025-02-26T23:11:03Z

After syncing with @WoosukKwon, we change the scope of this PR,

We will support random sampling for spec decode in this PR.
Since only ngram is supported in vllm V1, we only support ngram random sampling for now. However, the random sampling should be general to other drafting methods.
The PR should support mixed batch cases, where requests within the same batch might some perform spec decode, some do not perform spec decode.
Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling.
We will give a more clear definition of recover token ids, and bonus token ids.
We will create new test cases for V1 rejection sampler instead of reusing V0 tests for cleaner separation.

~~This PR tries to:~~
~~1. Support random sampling in rejection sampler. This should be general to different drafting methods, not limited to ngram spec decode.~~
~~6. Clean up and reuse rejection sampling tests from V0.~~

~~This PR does not:~~
~~1. Change model runner to use rejection sampler with random sampling. We need one extra PR to support ngram with random sampling.~~

github-actions · 2025-02-26T23:11:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

benchislett · 2025-03-12T21:21:27Z

Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling

Could you explain this claim? Why it that the case? Is this a problem with our implementation or a fundamental limitation?

LiuXiaoxuanPKU · 2025-03-12T22:23:08Z

Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling

Could you explain this claim? Why it that the case? Is this a problem with our implementation or a fundamental limitation?

Algorithm-wise, it's unclear. For example, what's the accept criteria? And how to sample from the adjusted the distribution? We need some math here to prove the equivalence.

benchislett · 2025-03-13T13:33:09Z

Pardon my ignorance if I am not fully informed on how we implement sampling for speculative decoding, but the Leviathan paper on speculative decoding talks about "Speculative Sampling", and how sampling techniques (top-k, nucleus) can be emulated by sampling based on the modified logits distribution. Is it possible to do something similar here?

Does vLLM v0 also ignore these sampling parameters?