Paged Attention support for FA3 #1268

kadeng · 2024-10-10T16:26:17Z

Adding support for Paged Attention / block table to the Flash-Attention 3 Kernel.

Limits:

~~Not yet integrated with FA3 kvcache + split kv + gqa parallelization #1236 which modifes the same code (and more).~~
No fp8 support yet - as that support also seems missing in the non-paged varlen version so far.

Test Plan:

cd hopper
pytest test_flash_attn.py -k "test_flash_attn_varlen_paged"  -s

Update

Rebased on top of main, which includes FA3 kvcache + split kv + gqa parallelization #1236 now
Original version of this PR ( not rebased ) still available on this branch

kadeng · 2024-10-22T08:19:15Z

Hi Alex, I think that makes sense and should be straightforward to implement. The main difference would be that the version in here is based on the varlen variant and flash_attn_with_kvcache would be based off the fixed length variant + sequence length info for KV. So the difference would be mostly in the memory format of the Q Tensor (varlen packed or not). But I would not put it into this PR, it is large enough already. I want to get this merged first. best, Kai Alex Ng ***@***.***> schrieb am Di., 22. Okt. 2024, 09:11:

…

Hi, do you have plans to support block table of flash_attn_with_kvcache()? — Reply to this email directly, view it on GitHub <#1268 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADVNP6UIGZQPZVGWKVE35TZ4X3DXAVCNFSM6AAAAABPXFJ5JKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRYGQZTMNBVGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

alexngng · 2024-10-28T03:16:32Z

Thank you for your response!

kadeng · 2024-10-29T13:08:08Z

I was investigating a flaky test failure that I was seeing on this PR. I could narrow it down to a preexisting issue: The flash attention varlen implementation does not seem to work properly yet for d=256. I have disabled testing for d=256 for now, similar to how it's handled in hopper/test_flash_attn.py::test_flash_attn_varlen_output

kadeng marked this pull request as ready for review October 10, 2024 16:30

kadeng force-pushed the main branch 2 times, most recently from a5bac6b to 956692c Compare October 17, 2024 12:46

Paged Attention support for FA3

66d43cc

kadeng force-pushed the main branch from 956692c to 66d43cc Compare October 29, 2024 12:56

tridao merged commit b443207 into Dao-AILab:main Nov 10, 2024

simon-mo mentioned this pull request Nov 21, 2024

[Feature]: FlashAttention 3 support vllm-project/vllm#6348

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paged Attention support for FA3 #1268

Paged Attention support for FA3 #1268

kadeng commented Oct 10, 2024 •

edited

Loading

kadeng commented Oct 22, 2024 via email

alexngng commented Oct 28, 2024

kadeng commented Oct 29, 2024

Paged Attention support for FA3 #1268

Paged Attention support for FA3 #1268

Conversation

kadeng commented Oct 10, 2024 • edited Loading

kadeng commented Oct 22, 2024 via email

alexngng commented Oct 28, 2024

kadeng commented Oct 29, 2024

kadeng commented Oct 10, 2024 •

edited

Loading