Commit Graph

624 Commits

Author SHA1 Message Date
Lucas Wilkinson
0ee3b7fc3d [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking (#36178)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
(cherry picked from commit eb47454987)
2026-04-01 01:02:58 -07:00
Andreas Karatzas
43cc5138e5 [ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (#38450)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-28 22:08:03 -07:00
Or Ozeri
7cc302dd87 [kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models (#37853)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-03-27 08:38:33 +03:00
Stig-Arne Grönroos
f26fcdfb9e [Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module (#37547)
Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com>
2026-03-26 19:01:05 +00:00
jennyyyyzhen
a4cf9b22ba [ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model (#37228) (#37228)
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu>
Co-authored-by: yZhen <yZhen@fb.com>
2026-03-26 10:33:39 -07:00
haosdent
0aac2048bf [Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode (#35175)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-26 16:13:39 +00:00
Chuan (Richard) Li
cb2263218e [Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos (#35886)
Signed-off-by: Li <chuali@amd.com>
2026-03-26 11:59:24 -04:00
Chauncey
87f05d6880 [Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder (#38076)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-03-26 01:43:51 +00:00
Sathish Sanjeevi
978fc18bf0 [ROCm] Utilize persistent MLA kernel from AITER (#36574)
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>
2026-03-26 03:00:42 +08:00
Gregory Shtrasberg
189ddefbfd [ROCm] Attention selector reordering (#36702)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2026-03-25 17:42:56 +08:00
Chauncey
09c3dc9186 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#37968)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-03-25 06:19:37 +00:00
liangel-02
8c47fdfdb1 [FlexAttention] allow custom mask mod (#37692)
Signed-off-by: Angel Li <liangel@meta.com>
2026-03-24 16:03:24 -04:00
Wentao Ye
c59a132f96 [V0 Deprecation] Refactor kv cache from list to element (#37487)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-23 20:10:11 -07:00
Ranran
dc6908ac6a [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (#35007)
Signed-off-by: Ranran <1012869439@qq.com>
Signed-off-by: Ranran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-03-23 18:31:14 -04:00
Chuan (Richard) Li
e99fb98867 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs (#36100)
Signed-off-by: Li <chuali@amd.com>
2026-03-23 15:48:31 +08:00
Kaihang Jiang
e5ed6c6c13 [BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks (#37475)
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com>
2026-03-20 16:14:55 -06:00
Lucas Wilkinson
e1d85e5c24 [Attention] Support distinguishing between short extends and decodes (#37303)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-03-20 10:49:36 -07:00
Divakar Verma
4ca3fa6bb4 [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention (#37606)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2026-03-20 00:00:08 +00:00
Elvir Crnčević
ef2c4f778d [Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (#37442)
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-19 00:28:37 +00:00
Andy Lo
577df69b26 [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish (#37054)
Signed-off-by: Andy Lo <andy@mistral.ai>
2026-03-18 23:07:29 +00:00
Divakar Verma
e6c4797704 [ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn (#36927)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2026-03-18 08:49:32 +08:00
Andrey Talman
68f783a727 [Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility (#35673)
Signed-off-by: atalman <atalman@fb.com>
2026-03-17 18:47:59 +00:00
Benjamin Chislett
8a680463fa [Bugfix] Fix NemotronH MTP + Chunked Prefill (#35447) 2026-03-17 07:07:33 +01:00
Vadim Gimpelson
6c1cfbad32 Support non-contiguous KV cache in TRTLLM fp8 dequant kernel (#36867)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Pavani Majety <pavanimajety@gmail.com>
2026-03-16 17:48:42 -07:00
Matthew Bonanni
93f3c8e531 [Misc] Add float16 to CacheDType (#37199)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-16 13:24:48 -07:00
Matthew Bonanni
c88ea8338b [MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible (#36982)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-16 13:51:21 -04:00
haosdent
ca1954d58c [Bugfix] Disable cross-layer KV cache for MLA attention backends (#37090)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
2026-03-16 19:03:10 +02:00
Itay Etelis
5ae685c1c8 [Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout (#34158)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
2026-03-16 11:20:51 -04:00
haosdent
116ed130f4 [Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches (#34871)
Signed-off-by: haosdent <haosdent@gmail.com>
2026-03-16 10:30:23 +01:00
Vadim Gimpelson
8374387bd8 [FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell (#36987)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-03-16 09:04:29 +00:00
Dimitrios Bariamis
367cf5cd3e [Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype (#36931)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
2026-03-13 16:41:16 -07:00
Rohan Potdar
a4ad9db541 Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) (#35786)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
2026-03-13 07:33:22 +00:00
Matthew Bonanni
f444c05c32 [Attention] Use FA4 for MLA prefill (#34732)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-12 12:10:17 -04:00
grimulkan
a1257fd1ea [Kernel] Add FP8 KV cache support to Triton MLA decode attention (#34597)
Signed-off-by: grimulkan <grimulkan@gmail.com>
2026-03-12 08:32:34 -07:00
Wuxun Zhang
e584dce52b Add XPU MLA Sparse backend for DeepSeek v3.2 (#33230)
Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com>
2026-03-11 19:19:15 +08:00
JartX
a40ee486f2 [Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 (#35923)
Signed-off-by: JartX <sagformas@epdcenter.es>
Co-authored-by: akaratza <akaratza@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-03-11 07:45:57 +00:00
pschlan-amd
eac2dc2b41 AITER MLA backend: Avoid CPU sync in _build_decode (#35765)
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
2026-03-11 07:25:00 +00:00
Matthew Bonanni
5f77ef15ae [Misc][Attention] Clean up unused method in CPU_ATTN (#36673)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-10 21:27:22 -07:00
Benjamin Chislett
9040cd40af [DSV3.2][MTP] Optimize Indexer MTP handling (#36723)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2026-03-11 12:16:56 +08:00
Woosuk Kwon
195d1ca3e8 [Minor] Enhance error message for TRTLLM decode uniformity check (#36609)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-10 15:38:45 -07:00
Pleaplusone
82f3f30e26 [ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform (#35719)
Signed-off-by: ganyi <ygan@amd.com>
2026-03-10 09:14:35 -07:00
Matthew Bonanni
9095cbbfb6 [Bugfix][Sparse MLA] report indexer CG support properly (#36519)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-10 09:14:31 -07:00
Vadim Gimpelson
4ff8c3c8f9 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-03-10 03:32:20 -07:00
Woosuk Kwon
006aea17d7 [BugFix] Remove incorrect assert in split_decodes_and_prefills (#36553)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-09 20:02:02 -07:00
Andreas Karatzas
c174d54f86 [ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks (#36292)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-09 12:02:41 -05:00
Roberto L. Castro
580864d81e [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
2026-03-09 09:50:36 -07:00
Matthew Bonanni
77a73458e3 Reapply [Attention] Refactor check_and_update_config (#35122)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-09 07:17:14 -07:00
cong-or
747431044d feat(attention): extract KV-cache update from FlexAttention backend (#36263)
Signed-off-by: cong-or <conchubhar.gannon@gmail.com>
2026-03-08 20:40:12 -07:00
Wei Zhao
379689d533 [Perf] Support FP8 KV cache for Flashinfer MLA Sparse (#35891) 2026-03-07 13:51:54 -08:00
Mengtao (Martin) Yuan
1a9718085c Fix CUDA graph decode capture crash in AITER FlashAttention (#36042)
Signed-off-by: Martin Yuan <myuan@meta.com>
Co-authored-by: Martin Yuan <myuan@meta.com>
2026-03-06 18:12:07 -08:00