Fix for attention layers to remain unquantized during moe_wn16 quant #12570

srikanthsrnvs · 2025-01-30T04:43:39Z

Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts moe_wn16 uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the get_quant_method must return None for it to work correctly again

github-actions · 2025-01-30T04:43:49Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin

Thank you, makes sense!

…method Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…-project#12560) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…m-project#12564) Signed-off-by: Beim <beim2015@outlook.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…project#12555) Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…caling (vllm-project#11868) Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…2571) Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

@hmellor

It's very annoying when I forgot to add `-s` in `git commit` to sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git push -f` to fix the DCO. This PR adds a hook to sign off commits automatically when `-s` is missing to solve this problem. The only change from the user side is now users have to install 2 hooks, so instead of just ``` pre-commit install ``` Now we need to ``` pre-commit install --hook-type pre-commit --hook-type commit-msg ``` Note that even if users still only install the pre-commit hook, they won't get any error in `git commit`. Just the sign-off hook won't run. cc @hmellor @youkaichao --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

@WoosukKwon

- Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…oject#12603) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Instead of having to create a new build with release version put in as env var. Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

…2563) **[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <ryan.nguyen@centml.ai> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

- Make device tab names more explicit - Add comprehensive list of devices to https://docs.vllm.ai/en/latest/getting_started/installation/index.html - Add `attention` blocks to the intro of all devices that don't have pre-built wheels/images --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

@mgoin

Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) FIX #n/a (no issue created) @kylesayrs and I have discussed a couple additional improvements for the quantization docs. We will revisit at a later date, possibly including: - A section for "choosing the correct quantization scheme/ compression technique" - Additional vision or audio calibration datasets --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX vllm-project#12567(*link existing issues this PR will resolve*) Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

mergify · 2025-02-03T03:16:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @srikanthsrnvs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

srikanthsrnvs · 2025-02-03T05:26:11Z

Anyone know why the Docker image building fails?

DarkLight1337 · 2025-02-03T05:27:01Z

Not sure. It's also a problem on main so it's not related to this PR. We will force-merge if necessary.

…llm-project#12570) Fix to AWQ quant loading of the new R1 model The new optimized MoE kernels for a large number of experts `moe_wn16` uses AWQ quant which requires the attention layers to be in 16bit The current merge has broken this, and the `get_quant_method` must return None for it to work correctly again --------- Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Beim <beim2015@outlook.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Ryan N <ryan.nguyen@centml.ai> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Beim <805908499@qq.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: fade_away <1028552010@qq.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Shawn Du <shawnd200@outlook.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

srikanthsrnvs requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners January 30, 2025 04:43

mgoin approved these changes Jan 31, 2025

View reviewed changes

mgoin added quantization ready labels Jan 31, 2025

srikanthsrnvs and others added 23 commits February 3, 2025 03:14

Fix for attention layers to remain unquantized during moe_wn16 quant …

483b60c

…method Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (vllm-…

f7a4e12

…project#12555) Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[V1][Log] Add max request concurrency log to V1 (vllm-project#12569)

95b49be

Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) s…

b0d7288

…caling (vllm-project#11868) Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[ROCm][AMD][Model] llama 3.2 support upstreaming (vllm-project#12421)

9813962

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[Bugfix] Gracefully handle huggingface hub http error (vllm-project#1…

c4795ce

…2571) Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Format

a5e6700

Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Add favicon to docs (vllm-project#12611)

1ce860b

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[BugFix] Fix Torch.Compile For DeepSeek (vllm-project#12594)

bc9d831

Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[Docs][V1] Prefix caching design (vllm-project#12598)

00df0e4

- Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[release] Add input step to ask for Release version (vllm-project#12631)

fdd86fb

Instead of having to create a new build with release version put in as env var. Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

[V1] Bugfix: Validate Model Input Length (vllm-project#12600)

fd9060b

SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX vllm-project#12567(*link existing issues this PR will resolve*) Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

srikanthsrnvs requested review from njhill, LiuXiaoxuanPKU, KuntaiDu, DarkLight1337, ywang96 and zhuohan123 as code owners February 3, 2025 03:15

mergify bot added documentation ci/build frontend structured-output speculative-decoding labels Feb 3, 2025

mergify bot added v1 needs-rebase labels Feb 3, 2025

Merge branch 'main' into fix-moe-wna16-attention

Loading
Loading status checks…

8b5a0ea

mergify bot removed the needs-rebase label Feb 3, 2025

unused imports

Loading
Loading status checks…

9d09ec0

DarkLight1337 enabled auto-merge (squash) February 3, 2025 05:11

youkaichao disabled auto-merge February 3, 2025 05:46

youkaichao merged commit b998645 into vllm-project:main Feb 3, 2025
24 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

srikanthsrnvs commented Jan 30, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 30, 2025

mgoin left a comment

mergify bot commented Feb 3, 2025

srikanthsrnvs commented Feb 3, 2025

DarkLight1337 commented Feb 3, 2025

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

Conversation

srikanthsrnvs commented Jan 30, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 30, 2025

mgoin left a comment

Choose a reason for hiding this comment

mergify bot commented Feb 3, 2025

srikanthsrnvs commented Feb 3, 2025

DarkLight1337 commented Feb 3, 2025

srikanthsrnvs commented Jan 30, 2025 •

edited by github-actions bot

Loading