Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] [Spec Decode] Support random sampling for spec decode #13933

Merged
merged 44 commits into from
Mar 17, 2025

Conversation

LiuXiaoxuanPKU
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU commented Feb 26, 2025

After syncing with @WoosukKwon, we change the scope of this PR,

  1. We will support random sampling for spec decode in this PR.
  2. Since only ngram is supported in vllm V1, we only support ngram random sampling for now. However, the random sampling should be general to other drafting methods.
  3. The PR should support mixed batch cases, where requests within the same batch might some perform spec decode, some do not perform spec decode.
  4. Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling.
  5. We will give a more clear definition of recover token ids, and bonus token ids.
  6. We will create new test cases for V1 rejection sampler instead of reusing V0 tests for cleaner separation.

This PR tries to:
1. Support random sampling in rejection sampler. This should be general to different drafting methods, not limited to ngram spec decode.
6. Clean up and reuse rejection sampling tests from V0.

This PR does not:
1. Change model runner to use rejection sampler with random sampling. We need one extra PR to support ngram with random sampling.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@LiuXiaoxuanPKU LiuXiaoxuanPKU marked this pull request as draft February 26, 2025 23:11
@mergify mergify bot added the v1 label Feb 26, 2025
@LiuXiaoxuanPKU LiuXiaoxuanPKU marked this pull request as ready for review February 27, 2025 07:45
@LiuXiaoxuanPKU LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025
@benchislett
Copy link
Contributor

Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling

Could you explain this claim? Why it that the case? Is this a problem with our implementation or a fundamental limitation?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented Mar 12, 2025

Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling

Could you explain this claim? Why it that the case? Is this a problem with our implementation or a fundamental limitation?

Algorithm-wise, it's unclear. For example, what's the accept criteria? And how to sample from the adjusted the distribution? We need some math here to prove the equivalence.

@benchislett
Copy link
Contributor

Pardon my ignorance if I am not fully informed on how we implement sampling for speculative decoding, but the Leviathan paper on speculative decoding talks about "Speculative Sampling", and how sampling techniques (top-k, nucleus) can be emulated by sampling based on the modified logits distribution. Is it possible to do something similar here?

Does vLLM v0 also ignore these sampling parameters?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

TODO: check quality with humaneval


# 3. Accept or reject the samples.
# [batch_size, max_spec_len]
accepted = uniform_samples <= target_token_probs / draft_token_probs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, please consider division by zero.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after second thoughts, why do we need to avoid division by zero? if draft_token_probs is 0, then target_token_probs / draft_token_probs is inf, so the token will always be accepted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) It may cause error in L187 (sum)
(2) It may cause NaN if target_token_probs is also zero.

LiuXiaoxuanPKU and others added 7 commits March 14, 2025 18:37
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Mar 15, 2025

@LiuXiaoxuanPKU As a sanity check, can you please run a simple perf benchmark? I'm just wondering if we missed anything critical.

@JaheimLee
Copy link

Hi, I always got the following error when my server ran for a long time (a whole night).

ERROR 03-16 09:11:23 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 330, in run_engine_core
ERROR 03-16 09:11:23 [core.py:337]     engine_core.run_busy_loop()
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 364, in run_busy_loop
ERROR 03-16 09:11:23 [core.py:337]     outputs = step_fn()
ERROR 03-16 09:11:23 [core.py:337]               ^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in step
ERROR 03-16 09:11:23 [core.py:337]     scheduler_output = self.scheduler.schedule()
ERROR 03-16 09:11:23 [core.py:337]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/scheduler.py", line 172, in schedule
ERROR 03-16 09:11:23 [core.py:337]     new_blocks = self.kv_cache_manager.allocate_slots(
ERROR 03-16 09:11:23 [core.py:337]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_manager.py", line 243, in allocate_slots
ERROR 03-16 09:11:23 [core.py:337]     self.block_pool.cache_full_blocks(
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/block_pool.py", line 112, in cache_full_blocks
ERROR 03-16 09:11:23 [core.py:337]     assert blk.block_hash is None
ERROR 03-16 09:11:23 [core.py:337]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337] AssertionError

Memory is sufficient with my two 3090 24GB. My config is

 AsyncEngineArgs(
        model=Qwen/Qwen2.5-72B-Instruct-AWQ,
        tensor_parallel_size=2,
        gpu_memory_utilization=0.97,
        enforce_eager=True,
        max_model_len=7000,
        enable_prefix_caching=True,
        enable_chunked_prefill=True,
        speculative_model='[ngram]',
        ngram_prompt_lookup_max=5,
        ngram_prompt_lookup_min=3,
        num_speculative_tokens=3,
        max_num_seqs=128,
        max_num_batched_tokens=2048,
        compilation_config=3,
    )

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented Mar 16, 2025

I did a quick performance check.
Prompt: "Given the code below, could you add one line comment to the return line: {quick_sort_str}"
Max_token = 1024, Batch_size = 1, Hardware: 1x80G H100
Model: meta-llama/Llama-3.1-8B-Instruct

Since the output might be different, we use throughput(tokens/s) metric below. T is the temperature.

Screenshot 2025-03-15 at 9 01 08 PM

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented Mar 16, 2025

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"
lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"
Temperature Accuracy (flexible-extract/strict-match)
w/o SD 0 0.79/0.79
with ngram SD 0 0.77/0.77
w/o SD 1.0 0.63/0.65
with ngram SD 1.0 0.62/0.64

@LiuXiaoxuanPKU
Copy link
Collaborator Author

More results on meta-llama/Llama-3.2-3B-Instruct
Screenshot 2025-03-16 at 4 15 01 PM

@WoosukKwon
Copy link
Collaborator

@LiuXiaoxuanPKU Is the PR ready for merge?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

@LiuXiaoxuanPKU Is the PR ready for merge?

Yes, I checked more about the quality. For greedy, it's steady. For random sampling, it fluctuates (sometimes better, sometimes worse). Overall it looks correct to me.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work! 👍

@LiuXiaoxuanPKU LiuXiaoxuanPKU merged commit 8d6cf89 into vllm-project:main Mar 17, 2025
30 checks passed
DefTruth pushed a commit to DefTruth/vllm that referenced this pull request Mar 17, 2025
…ect#13933)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants