biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Elvir Crnčević	ef2c4f778d	[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (#37442 ) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-19 00:28:37 +00:00
Andy Lo	577df69b26	[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish (#37054 ) Signed-off-by: Andy Lo <andy@mistral.ai>	2026-03-18 23:07:29 +00:00
Divakar Verma	e6c4797704	[ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn (#36927 ) Signed-off-by: Divakar Verma <divakar.verma@amd.com>	2026-03-18 08:49:32 +08:00
Andrey Talman	68f783a727	[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility (#35673 ) Signed-off-by: atalman <atalman@fb.com>	2026-03-17 18:47:59 +00:00
Benjamin Chislett	8a680463fa	[Bugfix] Fix NemotronH MTP + Chunked Prefill (#35447 )	2026-03-17 07:07:33 +01:00
Vadim Gimpelson	6c1cfbad32	Support non-contiguous KV cache in TRTLLM fp8 dequant kernel (#36867 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Pavani Majety <pavanimajety@gmail.com>	2026-03-16 17:48:42 -07:00
Matthew Bonanni	93f3c8e531	[Misc] Add `float16` to `CacheDType` (#37199 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-16 13:24:48 -07:00
Matthew Bonanni	c88ea8338b	[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible (#36982 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-16 13:51:21 -04:00
haosdent	ca1954d58c	[Bugfix] Disable cross-layer KV cache for MLA attention backends (#37090 ) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>	2026-03-16 19:03:10 +02:00
Itay Etelis	5ae685c1c8	[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout (#34158 ) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com>	2026-03-16 11:20:51 -04:00
haosdent	116ed130f4	[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches (#34871 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-03-16 10:30:23 +01:00
Vadim Gimpelson	8374387bd8	[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell (#36987 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-03-16 09:04:29 +00:00
Dimitrios Bariamis	367cf5cd3e	[Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype (#36931 ) Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com> Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>	2026-03-13 16:41:16 -07:00
Rohan Potdar	a4ad9db541	Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) (#35786 ) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>	2026-03-13 07:33:22 +00:00
Matthew Bonanni	f444c05c32	[Attention] Use FA4 for MLA prefill (#34732 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-12 12:10:17 -04:00
grimulkan	a1257fd1ea	[Kernel] Add FP8 KV cache support to Triton MLA decode attention (#34597 ) Signed-off-by: grimulkan <grimulkan@gmail.com>	2026-03-12 08:32:34 -07:00
Wuxun Zhang	e584dce52b	Add XPU MLA Sparse backend for DeepSeek v3.2 (#33230 ) Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com>	2026-03-11 19:19:15 +08:00
JartX	a40ee486f2	[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 (#35923 ) Signed-off-by: JartX <sagformas@epdcenter.es> Co-authored-by: akaratza <akaratza@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>	2026-03-11 07:45:57 +00:00
pschlan-amd	eac2dc2b41	AITER MLA backend: Avoid CPU sync in _build_decode (#35765 ) Signed-off-by: Patrick Schlangen <pschlan@amd.com>	2026-03-11 07:25:00 +00:00
Matthew Bonanni	5f77ef15ae	[Misc][Attention] Clean up unused method in `CPU_ATTN` (#36673 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-10 21:27:22 -07:00
Benjamin Chislett	9040cd40af	[DSV3.2][MTP] Optimize Indexer MTP handling (#36723 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2026-03-11 12:16:56 +08:00
Woosuk Kwon	195d1ca3e8	[Minor] Enhance error message for TRTLLM decode uniformity check (#36609 ) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>	2026-03-10 15:38:45 -07:00
Pleaplusone	82f3f30e26	[ROCm][Perf] Enable `sparse_mla`'s cudagraph on ROCm platform (#35719 ) Signed-off-by: ganyi <ygan@amd.com>	2026-03-10 09:14:35 -07:00
Matthew Bonanni	9095cbbfb6	[Bugfix][Sparse MLA] report indexer CG support properly (#36519 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-10 09:14:31 -07:00
Vadim Gimpelson	4ff8c3c8f9	[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-03-10 03:32:20 -07:00
Woosuk Kwon	006aea17d7	[BugFix] Remove incorrect assert in split_decodes_and_prefills (#36553 ) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>	2026-03-09 20:02:02 -07:00
Andreas Karatzas	c174d54f86	[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks (#36292 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-09 12:02:41 -05:00
Roberto L. Castro	580864d81e	[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917 ) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>	2026-03-09 09:50:36 -07:00
Matthew Bonanni	77a73458e3	Reapply [Attention] Refactor `check_and_update_config` (#35122 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-09 07:17:14 -07:00
cong-or	747431044d	feat(attention): extract KV-cache update from FlexAttention backend (#36263 ) Signed-off-by: cong-or <conchubhar.gannon@gmail.com>	2026-03-08 20:40:12 -07:00
Wei Zhao	379689d533	[Perf] Support FP8 KV cache for Flashinfer MLA Sparse (#35891 )	2026-03-07 13:51:54 -08:00
Mengtao (Martin) Yuan	1a9718085c	Fix CUDA graph decode capture crash in AITER FlashAttention (#36042 ) Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com>	2026-03-06 18:12:07 -08:00
Chuan (Richard) Li	c188749bcd	[ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) (#35850 ) Signed-off-by: Li <chuali@amd.com>	2026-03-06 20:24:03 +00:00
Andreas Karatzas	807d680337	[ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance (#35553 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-06 15:15:12 +08:00
Rohan Potdar	c5362c739f	Reenable features for ROCm attention backends (#36185 ) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>	2026-03-05 20:21:06 -08:00
Dor Huri	ebed80a7c8	[Performance] Extract KV-cache update from TreeAttention backend (#35384 ) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il>	2026-03-06 00:22:43 +00:00
Frank Wang	a57c877f18	[BugFix] Fallback from FA4->FA2 for Batch Invariance (#36059 ) Signed-off-by: frankwang28 <frank.wbb@hotmail.com>	2026-03-05 14:05:56 -05:00
Jiayi Yan	6a895197fa	[Bugfix][CI] fix typos (#34934 ) Signed-off-by: 1195343015 <1195343015@qq.com> Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-03-05 17:05:46 +00:00
Sage Moore	8c760b6ab6	[ROCm] Refactor ROCm attention backend selection logic (#35246 ) Signed-off-by: Sage Moore <sage@neuralmagic.com>	2026-03-05 10:51:26 -06:00
Harry Mellor	17dc9c7fc9	[CI] Bump `mypy` version (#34950 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-03-04 20:55:11 +00:00
sungsoo ha	6cb901093f	[Core] Add All-to-All communication backend for DCP (#34883 ) Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com> Signed-off-by: sungsoo ha <hasungsoo@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-03-04 10:01:57 -05:00
Lucas Wilkinson	28ef9ba399	[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA (#34552 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>	2026-03-03 07:21:57 -08:00
ElizaWszola	d9c7730877	[Performance] Extract kv update ops from MLA attention backends (#34627 ) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-03-02 10:43:19 -05:00
wangxiyuan	510bc9e1df	[Misc] Cleanup useless `current_platform` import (#35715 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-02 09:36:54 +00:00
Lucas Wilkinson	8b5014d3dd	[Attention] FA4 integration (#32974 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2026-03-01 23:44:57 +00:00
zhanqiuhu	57a96e26c9	Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled (#33192 )" (#34832 ) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>	2026-03-01 22:32:37 +00:00
Asaf Gardin	bbf81f9a92	[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (#34798 ) Signed-off-by: Josephasafg <ajgard7@gmail.com>	2026-03-01 20:40:23 +08:00
Chauncey	7e08c22b8c	[Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#35271 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-02-28 10:12:00 +00:00
Micah Williamson	0edf101d2b	[ROCm] Add `stablelm` Head Size 80 To Supported Head Sizes For ROCM_ATTN (#35527 ) Signed-off-by: Micah Williamson <micah.williamson@amd.com>	2026-02-28 12:16:34 +08:00
Gregory Shtrasberg	9fa6c68fa6	[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends (#35334 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2026-02-27 21:32:55 +00:00

1 2 3 4 5 ...

606 Commits