Commit Graph

19 Commits

Author SHA1 Message Date
elvischenv
8dfb45ca33 [Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133) 2025-07-18 00:35:58 +00:00
Lucas Wilkinson
76b494444f [Attention] Refactor attention metadata builder interface (#20466)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-07-17 04:44:25 +00:00
Cyrus Leung
e8cc53af5e [Misc] Log the reason for falling back to FlexAttention (#20699)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-14 04:16:51 -07:00
Pavani Majety
7bd4c37ae7 [Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: shuw <shuw@nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-11 09:23:23 +00:00
Alexander Matveev
5b032352cc [Attention] MLA - Flashinfer Ragged Prefill (#20034) 2025-07-10 20:17:47 -07:00
Cyrus Leung
9fb52e523a [V1] Support any head size for FlexAttention backend (#20467)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 09:54:36 -07:00
Nicolò Lucchesi
4c8f64faa7 [V1][Kernel] Flashinfer HND KV cache layout (#19280)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-06-17 09:09:22 -04:00
22quinn
0b73736a0d [Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check (#19339)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-06-15 13:43:48 +08:00
Luka Govedič
3597b06a4f [CUDA] Enable full cudagraph for FlashMLA (#18581)
Signed-off-by: luka <luka@neuralmagic.com>
2025-06-13 18:12:26 +00:00
Luka Govedič
f98548b9da [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (#16756)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2025-06-12 08:31:04 -07:00
Rachel Guo
467bef18a3 [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope (#19134)
Signed-off-by: Yunqiu Guo <guorachel@meta.com>
2025-06-10 16:48:51 +08:00
Yong Hoon Shin
bdf13965ab [V1] Support cross-layer KV sharing (#18212)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-06-03 20:33:07 +00:00
Simon Mo
02f0c7b220 [Misc] Add SPDX-FileCopyrightText (#19100)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-06-03 11:20:17 -07:00
Michael Goin
12e6c0b41c [Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig (#18086) 2025-05-13 20:36:17 -07:00
Chen Zhang
950751a987 [v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders (#17483)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-10 16:12:04 -07:00
Chen Zhang
cba31c47c4 [v1] AttentionMetadata for each layer (#17394)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-06 07:58:37 -07:00
Chen Zhang
24e6ad3f16 [V1] Remove num_input_tokens from attn_metadata (#17193)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-29 09:28:41 -07:00
Chen Zhang
838cedade7 [Bugfix] Get a specific type of layer from forward context (#17222)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-27 00:58:05 -07:00
Michael Goin
986537f1c3 [V1] V1 FlashInfer Attention (#16684)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
2025-04-22 00:38:41 +00:00