Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

Merged
merged 42 commits into from
Feb 3, 2025

Conversation

srikanthsrnvs
Copy link
Contributor

@srikanthsrnvs srikanthsrnvs commented Jan 30, 2025

Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts moe_wn16 uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the get_quant_method must return None for it to work correctly again

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, makes sense!

@mgoin mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels Jan 31, 2025
srikanthsrnvs and others added 23 commits February 3, 2025 03:14
…method

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Verified

This commit was signed with the committer’s verified signature.
Mic92 Jörg Thalheim
…-project#12560)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…m-project#12564)

Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…project#12555)

Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…caling (vllm-project#11868)

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…2571)

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just

```
pre-commit install
```

Now we need to

```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```

Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.

cc @hmellor @youkaichao

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…oject#12603)

This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
        request_id=0,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash1", "hash2"],
    )
    request2 = make_request(
        request_id=1,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash3", "hash2"],
    )
```

---------

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Instead of having to create a new build with release version put in as
env var.

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…DeepSeekV3 (vllm-project#12587)

Integrates the block-quantized kernels introduced in
vllm-project#11868 for use in linear
layers.

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…2563)

**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7)

---------

Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py)

FIX #n/a (no issue created)

@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets

---------

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len

FIX vllm-project#12567(*link existing issues this PR will resolve*)

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Copy link

mergify bot commented Feb 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @srikanthsrnvs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Feb 3, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 3, 2025 05:11
@srikanthsrnvs
Copy link
Contributor Author

Anyone know why the Docker image building fails?

@DarkLight1337
Copy link
Member

Not sure. It's also a problem on main so it's not related to this PR. We will force-merge if necessary.

@youkaichao youkaichao disabled auto-merge February 3, 2025 05:46
@youkaichao youkaichao merged commit b998645 into vllm-project:main Feb 3, 2025
24 of 38 checks passed
sahelib25 pushed a commit to krai/vllm that referenced this pull request Feb 3, 2025
…llm-project#12570)

Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend quantization ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding structured-output v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet