[Attention] MLA with chunked prefill #12639

LucasWilkinson · 2025-02-01T04:43:16Z

Need to do more benchmarking to see if this makes sense to be on by default in V0, but lays the groundwork for a V1 implementation. (#13111 may help performance)

lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=False --task gsm8k --num_fewshot=5 --limit 100

vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=False), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|


lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 100


vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enable_chunked_prefill=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.66|±  |0.0476|
|     |       |strict-match    |     5|exact_match|↑  | 0.66|±  |0.0476|

Shout-out to @pathorn for assisting with hardening this PR

Future work:

Allocate the worst case result of self.kv_b_proj(kv_c_normed) in the profile run
[Attention] MLA with chunked prefill #12639 (comment)
Improved algo for allocating workspace amongst batch elements
Improve how the workspace is allocated

github-actions · 2025-02-01T04:43:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-02-06T05:22:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-02-07T03:56:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/engine/arg_utils.py

vllm/attention/backends/utils.py

csrc/cuda_utils.h

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tlrmchlsmth · 2025-02-19T20:46:17Z

vllm/engine/arg_utils.py

-            if model_config.is_multimodal_model and model_config.use_mla:
+            if model_config.is_multimodal_model or model_config.use_mla:


ok yeah that makes sense for some of the red tests

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2025-02-21T01:57:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ZhongYingMatrix · 2025-02-23T10:09:19Z

Hi @LucasWilkinson thx for ur wonderful work!
I am a little confused on the backend that got from get_attn_backend_cls.
Since we should set VLLM_USE_V1 to use chunked prefill, from here, we would get vllm.v1.attention.backends.flash_attn.FlashAttentionBackend instead of vllm.attention.backends.triton_mla.TritonMLABackend?

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

LucasWilkinson changed the title ~~[Attention] WIP MLA with chunked prefill~~ [WIP][Attention] WIP MLA with chunked prefill Feb 1, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from f939824 to 77be9af Compare February 4, 2025 21:15

pathorn mentioned this pull request Feb 6, 2025

Implement chunked prefill for Triton MLA attention backend #12800

Closed

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 77be9af to bf6a400 Compare February 6, 2025 02:27

mergify bot added the needs-rebase label Feb 6, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch 2 times, most recently from 463e453 to c542cc4 Compare February 6, 2025 05:24

mergify bot added v1 and removed needs-rebase labels Feb 6, 2025

LucasWilkinson changed the title ~~[WIP][Attention] WIP MLA with chunked prefill~~ [Attention] WIP MLA with chunked prefill Feb 6, 2025

LucasWilkinson marked this pull request as ready for review February 6, 2025 05:49

LucasWilkinson requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners February 6, 2025 05:49

mergify bot added the needs-rebase label Feb 7, 2025

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 727b265 to c2d5468 Compare February 7, 2025 16:44

mergify bot removed the needs-rebase label Feb 7, 2025

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

vllm/attention/backends/utils.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 7, 2025

View reviewed changes

csrc/cuda_utils.h Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/chunked-mla branch from 7bffc5c to de3474d Compare February 12, 2025 01:04

LucasWilkinson added 2 commits February 13, 2025 21:47

chunked mla

4267344

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

add gather cache kernel

2821aed

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson added 2 commits February 18, 2025 04:19

format

Loading
Loading status checks…

3a0ae51

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mypy pass

Loading
Loading status checks…

28464b5

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tlrmchlsmth added the ready label Feb 18, 2025

tlrmchlsmth enabled auto-merge (squash) February 18, 2025 13:47

tlrmchlsmth mentioned this pull request Feb 18, 2025

set chunked_prefill off when use mla #13374

Closed

tlrmchlsmth and others added 2 commits February 19, 2025 15:55

Merge branch 'main' into lwilkinson/chunked-mla

Loading
Loading status checks…

609267b

fix basic model test

Loading
Loading status checks…

dfb3ada

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tlrmchlsmth reviewed Feb 19, 2025

View reviewed changes

LucasWilkinson and others added 3 commits February 19, 2025 21:51

attempt to fix AMD build

Loading
Loading status checks…

9ca182b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

attempt 2 fix amd build

Loading
Loading status checks…

d325935

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

Loading
Loading status checks…

6394a8a

LucasWilkinson mentioned this pull request Feb 20, 2025

[WIP][Kernel] Flashinfer MLA support #13630

Draft

mergify bot added the needs-rebase label Feb 21, 2025

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

Loading
Loading status checks…

f17599e

mergify bot removed the needs-rebase label Feb 21, 2025

LucasWilkinson added 2 commits February 21, 2025 17:54

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

Loading
Loading status checks…

c5fbdaa

Merge remote-tracking branch 'origin/main' into lwilkinson/chunked-mla

Loading
Loading status checks…

10c4e54

simon-mo disabled auto-merge February 21, 2025 23:30

simon-mo merged commit 288cc6c into vllm-project:main Feb 21, 2025
47 of 69 checks passed

qli88 mentioned this pull request Feb 23, 2025

[core] MLA performance boost for AMD GPUs and tuned MoE config for MI… #13439

Closed

ZhongYingMatrix mentioned this pull request Feb 23, 2025

[Bug]: Can't deploy DeepSeek R1 with lora failure on vLLM Engine V1 #12891

Closed

1 task

LucasWilkinson mentioned this pull request Feb 26, 2025

[Kernel] FlashMLA integration #13747

Merged

ApostaC mentioned this pull request Mar 1, 2025

[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled" #14069

Open

1 task

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

[Attention] MLA with chunked prefill #12639

[Attention] MLA with chunked prefill #12639

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 1, 2025

mergify bot commented Feb 6, 2025

mergify bot commented Feb 7, 2025

tlrmchlsmth Feb 19, 2025

mergify bot commented Feb 21, 2025

ZhongYingMatrix commented Feb 23, 2025

		if model_config.is_multimodal_model and model_config.use_mla:
		if model_config.is_multimodal_model or model_config.use_mla:

[Attention] MLA with chunked prefill #12639

[Attention] MLA with chunked prefill #12639

Conversation

LucasWilkinson commented Feb 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 1, 2025

mergify bot commented Feb 6, 2025

mergify bot commented Feb 7, 2025

tlrmchlsmth Feb 19, 2025

Choose a reason for hiding this comment

mergify bot commented Feb 21, 2025

ZhongYingMatrix commented Feb 23, 2025

LucasWilkinson commented Feb 1, 2025 •

edited by github-actions bot

Loading