biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Lucas Wilkinson	0ee3b7fc3d	[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking (#36178 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit `eb47454987`)	2026-04-01 01:02:58 -07:00
Andreas Karatzas	43cc5138e5	[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (#38450 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-28 22:08:03 -07:00
Or Ozeri	7cc302dd87	[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models (#37853 ) Signed-off-by: Or Ozeri <oro@il.ibm.com>	2026-03-27 08:38:33 +03:00
Stig-Arne Grönroos	f26fcdfb9e	[Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module (#37547 ) Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com>	2026-03-26 19:01:05 +00:00
jennyyyyzhen	a4cf9b22ba	[ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model (#37228 ) (#37228 ) Signed-off-by: jennyyyyzhen <yzhen@hmc.edu> Co-authored-by: yZhen <yZhen@fb.com>	2026-03-26 10:33:39 -07:00
haosdent	0aac2048bf	[Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode (#35175 ) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-26 16:13:39 +00:00
Chuan (Richard) Li	cb2263218e	[Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos (#35886 ) Signed-off-by: Li <chuali@amd.com>	2026-03-26 11:59:24 -04:00
Chauncey	87f05d6880	[Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder (#38076 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-03-26 01:43:51 +00:00
Sathish Sanjeevi	978fc18bf0	[ROCm] Utilize persistent MLA kernel from AITER (#36574 ) Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>	2026-03-26 03:00:42 +08:00
Gregory Shtrasberg	189ddefbfd	[ROCm] Attention selector reordering (#36702 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com>	2026-03-25 17:42:56 +08:00
Chauncey	09c3dc9186	[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#37968 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-03-25 06:19:37 +00:00
liangel-02	8c47fdfdb1	[FlexAttention] allow custom mask mod (#37692 ) Signed-off-by: Angel Li <liangel@meta.com>	2026-03-24 16:03:24 -04:00
Wentao Ye	c59a132f96	[V0 Deprecation] Refactor kv cache from list to element (#37487 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-23 20:10:11 -07:00
Ranran	dc6908ac6a	[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (#35007 ) Signed-off-by: Ranran <1012869439@qq.com> Signed-off-by: Ranran <hzz5361@psu.edu> Signed-off-by: ran <hzz5361@psu.edu> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-03-23 18:31:14 -04:00
Chuan (Richard) Li	e99fb98867	[ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs (#36100 ) Signed-off-by: Li <chuali@amd.com>	2026-03-23 15:48:31 +08:00
Kaihang Jiang	e5ed6c6c13	[BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks (#37475 ) Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com>	2026-03-20 16:14:55 -06:00
Lucas Wilkinson	e1d85e5c24	[Attention] Support distinguishing between short extends and decodes (#37303 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-03-20 10:49:36 -07:00
Divakar Verma	4ca3fa6bb4	[ROCm][Bugfix] fix cache block size mismatch for aiter unified attention (#37606 ) Signed-off-by: Divakar Verma <divakar.verma@amd.com>	2026-03-20 00:00:08 +00:00
Elvir Crnčević	ef2c4f778d	[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (#37442 ) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-19 00:28:37 +00:00
Andy Lo	577df69b26	[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish (#37054 ) Signed-off-by: Andy Lo <andy@mistral.ai>	2026-03-18 23:07:29 +00:00
Divakar Verma	e6c4797704	[ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn (#36927 ) Signed-off-by: Divakar Verma <divakar.verma@amd.com>	2026-03-18 08:49:32 +08:00
Andrey Talman	68f783a727	[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility (#35673 ) Signed-off-by: atalman <atalman@fb.com>	2026-03-17 18:47:59 +00:00
Benjamin Chislett	8a680463fa	[Bugfix] Fix NemotronH MTP + Chunked Prefill (#35447 )	2026-03-17 07:07:33 +01:00
Vadim Gimpelson	6c1cfbad32	Support non-contiguous KV cache in TRTLLM fp8 dequant kernel (#36867 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com>	2026-03-16 17:48:42 -07:00
Matthew Bonanni	93f3c8e531	[Misc] Add `float16` to `CacheDType` (#37199 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-16 13:24:48 -07:00
Matthew Bonanni	c88ea8338b	[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible (#36982 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-16 13:51:21 -04:00
haosdent	ca1954d58c	[Bugfix] Disable cross-layer KV cache for MLA attention backends (#37090 ) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>	2026-03-16 19:03:10 +02:00
Itay Etelis	5ae685c1c8	[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout (#34158 ) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com>	2026-03-16 11:20:51 -04:00
haosdent	116ed130f4	[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches (#34871 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-03-16 10:30:23 +01:00
Vadim Gimpelson	8374387bd8	[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell (#36987 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-03-16 09:04:29 +00:00
Dimitrios Bariamis	367cf5cd3e	[Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype (#36931 ) Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com> Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>	2026-03-13 16:41:16 -07:00
Rohan Potdar	a4ad9db541	Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) (#35786 ) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>	2026-03-13 07:33:22 +00:00
Matthew Bonanni	f444c05c32	[Attention] Use FA4 for MLA prefill (#34732 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-12 12:10:17 -04:00
grimulkan	a1257fd1ea	[Kernel] Add FP8 KV cache support to Triton MLA decode attention (#34597 ) Signed-off-by: grimulkan <grimulkan@gmail.com>	2026-03-12 08:32:34 -07:00
Wuxun Zhang	e584dce52b	Add XPU MLA Sparse backend for DeepSeek v3.2 (#33230 ) Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com>	2026-03-11 19:19:15 +08:00
JartX	a40ee486f2	[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 (#35923 ) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>	2026-03-11 07:45:57 +00:00
pschlan-amd	eac2dc2b41	AITER MLA backend: Avoid CPU sync in _build_decode (#35765 ) Signed-off-by: Patrick Schlangen <pschlan@amd.com>	2026-03-11 07:25:00 +00:00
Matthew Bonanni	5f77ef15ae	[Misc][Attention] Clean up unused method in `CPU_ATTN` (#36673 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-10 21:27:22 -07:00
Benjamin Chislett	9040cd40af	[DSV3.2][MTP] Optimize Indexer MTP handling (#36723 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2026-03-11 12:16:56 +08:00
Woosuk Kwon	195d1ca3e8	[Minor] Enhance error message for TRTLLM decode uniformity check (#36609 ) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>	2026-03-10 15:38:45 -07:00
Pleaplusone	82f3f30e26	[ROCm][Perf] Enable `sparse_mla`'s cudagraph on ROCm platform (#35719 ) Signed-off-by: ganyi <ygan@amd.com>	2026-03-10 09:14:35 -07:00
Matthew Bonanni	9095cbbfb6	[Bugfix][Sparse MLA] report indexer CG support properly (#36519 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-10 09:14:31 -07:00
Vadim Gimpelson	4ff8c3c8f9	[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-03-10 03:32:20 -07:00
Woosuk Kwon	006aea17d7	[BugFix] Remove incorrect assert in split_decodes_and_prefills (#36553 ) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>	2026-03-09 20:02:02 -07:00
Andreas Karatzas	c174d54f86	[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks (#36292 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-09 12:02:41 -05:00
Roberto L. Castro	580864d81e	[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917 ) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>	2026-03-09 09:50:36 -07:00
Matthew Bonanni	77a73458e3	Reapply [Attention] Refactor `check_and_update_config` (#35122 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-09 07:17:14 -07:00
cong-or	747431044d	feat(attention): extract KV-cache update from FlexAttention backend (#36263 ) Signed-off-by: cong-or <conchubhar.gannon@gmail.com>	2026-03-08 20:40:12 -07:00
Wei Zhao	379689d533	[Perf] Support FP8 KV cache for Flashinfer MLA Sparse (#35891 )	2026-03-07 13:51:54 -08:00
Mengtao (Martin) Yuan	1a9718085c	Fix CUDA graph decode capture crash in AITER FlashAttention (#36042 ) Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com>	2026-03-06 18:12:07 -08:00

1 2 3 4 5 ...

624 Commits