Commit Graph

612 Commits

Author SHA1 Message Date
Francesco Fusco
5b55c0bea7 [Attention] Clarify comment explaining attn_logits +1 dimension (#33427)
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
2026-01-31 04:50:30 +00:00
Russell Bryant
a2ef06e1b3 [Misc] offest -> offset in comments and variable names (#33444)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2026-01-30 20:19:22 -08:00
Lucas Wilkinson
0a3c71e7e5 [BugFix] Fix whisper FA2 + full cudagraphs (#33360)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-31 12:15:06 +08:00
Matthew Bonanni
aaa901ad55 [Attention] Move MLA forward from backend to layer (#33284)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-30 19:30:00 -08:00
杨朱 · Kiki
1a7894dbdf [Misc] Replace Optional[X] with X | None syntax (#33332)
Signed-off-by: carlory <baofa.fan@daocloud.io>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 01:56:59 -08:00
Gregory Shtrasberg
ab597c869a [Bugfix] Add missing encoder only guard for do_kv_cache_update (#33269)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2026-01-28 21:25:07 +00:00
Gregory Shtrasberg
22ad649501 [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends (#33106)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2026-01-28 14:36:14 +08:00
Harry Mellor
2eb673a088 Add flake8-implicit-str-concat rules to Ruff (#33191)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-01-28 04:56:10 +00:00
Wentao Ye
3a6d5cbefd [Perf] Optimize dcp allocate tensor (#33102)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-27 17:24:41 -05:00
Matthew Bonanni
a608b4c6c2 [5/N][Attention] Finish eliminating vllm/attention folder (#32064)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-27 10:02:51 -05:00
Nicolò Lucchesi
1f3a2c2944 [Bugfix] Disable CG for Whisper+FA2 (#33164)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-27 21:46:51 +08:00
Strahinja Stamenkovic
c568581ff3 Fix IndexError with encoder-decoder models when using Custom Paged Attention (#33112)
Signed-off-by: sstamenk <strahinja.stamenkovic@amd.com>
2026-01-27 10:33:37 +08:00
ElizaWszola
a28b94e6ef [Performance] Split FlashAttn attention and cache update (#25954)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Luka Govedič <luka.govedic@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <luka.govedic@gmail.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
2026-01-23 17:28:06 -08:00
Markus / Mark
586a57ad7e fix: Add glm4_moe_lite to MLA detection (#32614)
Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2026-01-23 12:38:57 -08:00
Harry Huang
5206e5e28c [V1][Hybrid] Mamba Prefix Caching with align mode (#30877)
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2026-01-23 09:56:48 -08:00
tianshu-Michael-yu
13d8746c54 [Feature]: Remove DtoH Copy for lfm2_vl On Default Stream (#32815)
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>
2026-01-23 13:20:30 +00:00
Nicolò Lucchesi
160c6fa387 [Misc] Add get_name to missing AttentionBackends (#32698)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-23 10:35:44 +00:00
Andreas Karatzas
a8eb1182f1 [CI][Models] Add VLM Support for Sequence Classification Conversion (#32885)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-23 16:22:51 +08:00
Wentao Ye
7ef5873752 [CI] Fix mypy for vllm/v1/structured_output (#32722)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-23 11:55:51 +08:00
Eldar Kurtić
44f08af3a7 Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (#30141)
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>
2026-01-22 13:29:57 -07:00
Matthew Bonanni
955b43a5a5 [Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 (#32795)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-22 19:05:18 +00:00
Lucas Wilkinson
889722f3bf [FlashMLA] Update FlashMLA to expose new arguments (#32810)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-21 22:02:39 -07:00
Nick Hill
24dc30f7ff [ModelRunner V2] Don't pin reused flashinfer tensors (#32799)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-21 13:17:43 -08:00
elvischenv
808d6fd7b9 Bump Flashinfer to v0.6.1 (#30993)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2026-01-21 08:49:50 -08:00
Pleaplusone
6c20e89c02 [ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287)
Signed-off-by: ganyi <ygan@amd.com>
2026-01-21 23:16:30 +08:00
Robert Shaw
42135d6898 [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414) 2026-01-21 08:22:33 -05:00
Lucas Wilkinson
b4f64e5b02 Update FlashMLA (#32491)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-21 13:03:37 +08:00
Tomas Ruiz
4a5299c93f feat: spec decode with draft models (#24322)
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
2026-01-19 16:05:46 -05:00
Vadim Gimpelson
6101a26dc9 [BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 (#32417)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-01-18 16:57:32 -08:00
Wentao Ye
16de822c71 [Refactor] Remove unused file pallas_kv_cache_update.py (#32433)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-18 12:46:39 -08:00
Li Xie
c826c72a96 [Model] Support Step1 Model (#32511)
Signed-off-by: xieli <xieli@stepfun.com>
2026-01-18 10:20:46 +00:00
Isotr0py
8cc26acd8b [Performance] Improve Triton prefill attention kernel's performance (#32403)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-17 20:19:59 -08:00
Guofang.Tang
2b99f210f5 [Misc] Fix typo: seperator -> separator in flashmla_sparse.py (#32411)
Signed-off-by: Guofang Tang <tinggofun@gmail.com>
Co-authored-by: Guofang Tang <tinggofun@gmail.com>
2026-01-17 12:18:30 +00:00
Matthias Gehre
047413375c [Attention][AMD] Make flash-attn optional (#30361)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
2026-01-15 17:18:24 +00:00
Pleaplusone
130d6c9514 [ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887)
Signed-off-by: ganyi <ygan@amd.com>
2026-01-15 15:29:53 +00:00
vllmellm
e27078ea80 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten (#32336)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-01-14 19:32:48 +00:00
Matthew Bonanni
2263d44b68 [4/N][Attention] Move MLA common to model_executor (#32060)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-13 09:08:45 -08:00
Matthew Bonanni
98f60e5acb [6/N][Attention] Move utils to more appropriate locations (#32215)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-13 05:38:52 -08:00
Mickaël Seznec
a5bbbd2f24 [Quantization] fix: overflow with static per-tensor scaling (#29867)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-13 12:56:01 +00:00
cjackal
15b33ff064 [Misc] improve warning/assert messages (#32226)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2026-01-13 03:11:23 +00:00
Matthew Bonanni
20228cb851 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-12 09:13:56 -08:00
Asaf Joseph Gardin
8fb2c135be [Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode (#32118)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
2026-01-12 17:02:38 +00:00
Isotr0py
9dbe1fe960 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation (#32149)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-12 11:13:41 +00:00
Vadim Gimpelson
e15a5ff07b [MISC] Add strict contiguity check for FlashInfer attention tensors (#32008)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
2026-01-10 12:40:05 -08:00
jvlunteren
b8bf5c45bb [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel (#31984)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
2026-01-10 18:13:44 +00:00
Lucas Wilkinson
da6709c9fe [Misc] Delay deprecation of CommonAttentionMetadata properties (#32074)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-09 21:06:44 -08:00
Lucas Kabela
ea6d067a2a [Misc][LLaMa4] Compile LLaMa Vision Encoder (#30709)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2026-01-09 22:01:38 -05:00
Matthew Bonanni
0308901975 [2/N][Attention] Fix pre-commit errors (#32052)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-10 00:27:15 +00:00
Matthew Bonanni
2612ba9285 [1/N][Attention] Restructure attention: move files (#31916)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-09 13:10:24 -08:00
R3hankhan
8e27663b6a [CPU] Add head sizes 80 and 112 with vec16 fallback (#31968)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2026-01-09 22:14:46 +08:00