Commit Graph

29 Commits

Author SHA1 Message Date
haosdent
116ed130f4 [Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches (#34871)
Signed-off-by: haosdent <haosdent@gmail.com>
2026-03-16 10:30:23 +01:00
Harry Mellor
17dc9c7fc9 [CI] Bump mypy version (#34950)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-04 20:55:11 +00:00
Woosuk Kwon
0916e7960b [GDN] Use CPU tensors to build GDN metadata (#34498)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-02-13 01:24:45 -08:00
Vadim Gimpelson
000214c4bb [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-02-10 10:57:11 -05:00
Harry Huang
5206e5e28c [V1][Hybrid] Mamba Prefix Caching with align mode (#30877)
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2026-01-23 09:56:48 -08:00
tianshu-Michael-yu
13d8746c54 [Feature]: Remove DtoH Copy for lfm2_vl On Default Stream (#32815)
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>
2026-01-23 13:20:30 +00:00
Nicolò Lucchesi
160c6fa387 [Misc] Add get_name to missing AttentionBackends (#32698)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-23 10:35:44 +00:00
Matthew Bonanni
20228cb851 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-12 09:13:56 -08:00
Matthew Bonanni
2612ba9285 [1/N][Attention] Restructure attention: move files (#31916)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-09 13:10:24 -08:00
Cyrus Leung
b665bbc2d4 [Chore] Migrate V0 attention utils (#31891)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-07 13:44:36 +00:00
Jack Yang
0a2c2dc3f1 fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465)
Signed-off-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-07 04:08:47 +00:00
Lucas Wilkinson
4c73be14e0 [Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31774)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-06 17:32:14 +00:00
Benjamin Chislett
85aff45e24 [Perf] Remove blocking copy in GDN Attention (#31167)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-12-22 14:25:22 -08:00
drslark
add1b9d3de [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#30632)
Signed-off-by: drslark <slarksblood@qq.com>
2025-12-14 01:32:16 -08:00
Lucas Wilkinson
abe93bce59 [Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode (#29624)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-12-09 17:18:10 -08:00
Matthew Bonanni
1d93f11675 [Attention][CUDAGraph] Remove CG padding from attention backends (#29352)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-02 13:48:08 -05:00
Benjamin Chislett
304419576a [Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (#28479)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-13 01:56:40 +09:00
fhl2000
284cc92275 [MISC] cudagraph_capture_sizes related improvements (#26016)
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-24 05:11:05 -07:00
Vadim Gimpelson
785d8b6410 [PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (#26437)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-10-16 12:18:31 +08:00
Harry Mellor
8fcaaf6a16 Update Optional[x] -> x | None and Union[x, y] to x | y (#26633)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-12 09:51:31 -07:00
Roger Wang
43c146ca42 [Misc] Clean up unnecessary E501 ignore (#26274)
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-10-06 07:29:18 +00:00
Harry Mellor
d6953beb91 Convert formatting to use ruff instead of yapf + isort (#26247)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-05 07:06:22 -07:00
Tao He
99b3a504c5 [Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. (#25743)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2025-09-26 01:18:58 -07:00
Benjamin Chislett
c30b405b8f [Spec Decode] Enable FlashInfer Spec Decoding (#25196)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: lhsjohn <huashuoli@tencent.com>
2025-09-23 22:29:58 -04:00
Thomas Parnell
a903669e10 [V1] Remove V0 code paths for Hybrid models (#25400)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-09-23 08:26:13 -07:00
Vadim Gimpelson
072d7e53e5 [PERF] Add conv1d metadata to GDN attn (#25105)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-09-18 14:27:49 +00:00
Tao He
dd6a910aac [Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. (#24957)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2025-09-17 21:59:09 +08:00
Tao He
8226dd56bf [Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) (#24667)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2025-09-12 22:31:32 +00:00
Tao He
e93f4cc9e3 Add the support for the qwen3 next model (a hybrid attention model). (#24526)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-09-11 15:32:09 +08:00