haosdent
|
116ed130f4
|
[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches (#34871)
Signed-off-by: haosdent <haosdent@gmail.com>
|
2026-03-16 10:30:23 +01:00 |
|
Harry Mellor
|
17dc9c7fc9
|
[CI] Bump mypy version (#34950)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-04 20:55:11 +00:00 |
|
Woosuk Kwon
|
0916e7960b
|
[GDN] Use CPU tensors to build GDN metadata (#34498)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-02-13 01:24:45 -08:00 |
|
Vadim Gimpelson
|
000214c4bb
|
[BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2026-02-10 10:57:11 -05:00 |
|
Harry Huang
|
5206e5e28c
|
[V1][Hybrid] Mamba Prefix Caching with align mode (#30877)
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
|
2026-01-23 09:56:48 -08:00 |
|
tianshu-Michael-yu
|
13d8746c54
|
[Feature]: Remove DtoH Copy for lfm2_vl On Default Stream (#32815)
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>
|
2026-01-23 13:20:30 +00:00 |
|
Nicolò Lucchesi
|
160c6fa387
|
[Misc] Add get_name to missing AttentionBackends (#32698)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2026-01-23 10:35:44 +00:00 |
|
Matthew Bonanni
|
20228cb851
|
[3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-01-12 09:13:56 -08:00 |
|
Matthew Bonanni
|
2612ba9285
|
[1/N][Attention] Restructure attention: move files (#31916)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-01-09 13:10:24 -08:00 |
|
Cyrus Leung
|
b665bbc2d4
|
[Chore] Migrate V0 attention utils (#31891)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-07 13:44:36 +00:00 |
|
Jack Yang
|
0a2c2dc3f1
|
fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465)
Signed-off-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-01-07 04:08:47 +00:00 |
|
Lucas Wilkinson
|
4c73be14e0
|
[Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31774)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2026-01-06 17:32:14 +00:00 |
|
Benjamin Chislett
|
85aff45e24
|
[Perf] Remove blocking copy in GDN Attention (#31167)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2025-12-22 14:25:22 -08:00 |
|
drslark
|
add1b9d3de
|
[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#30632)
Signed-off-by: drslark <slarksblood@qq.com>
|
2025-12-14 01:32:16 -08:00 |
|
Lucas Wilkinson
|
abe93bce59
|
[Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode (#29624)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
|
2025-12-09 17:18:10 -08:00 |
|
Matthew Bonanni
|
1d93f11675
|
[Attention][CUDAGraph] Remove CG padding from attention backends (#29352)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-12-02 13:48:08 -05:00 |
|
Benjamin Chislett
|
304419576a
|
[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (#28479)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2025-11-13 01:56:40 +09:00 |
|
fhl2000
|
284cc92275
|
[MISC] cudagraph_capture_sizes related improvements (#26016)
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-24 05:11:05 -07:00 |
|
Vadim Gimpelson
|
785d8b6410
|
[PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (#26437)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2025-10-16 12:18:31 +08:00 |
|
Harry Mellor
|
8fcaaf6a16
|
Update Optional[x] -> x | None and Union[x, y] to x | y (#26633)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-12 09:51:31 -07:00 |
|
Roger Wang
|
43c146ca42
|
[Misc] Clean up unnecessary E501 ignore (#26274)
Signed-off-by: Roger Wang <hey@rogerw.io>
|
2025-10-06 07:29:18 +00:00 |
|
Harry Mellor
|
d6953beb91
|
Convert formatting to use ruff instead of yapf + isort (#26247)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-05 07:06:22 -07:00 |
|
Tao He
|
99b3a504c5
|
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. (#25743)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
|
2025-09-26 01:18:58 -07:00 |
|
Benjamin Chislett
|
c30b405b8f
|
[Spec Decode] Enable FlashInfer Spec Decoding (#25196)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: lhsjohn <huashuoli@tencent.com>
|
2025-09-23 22:29:58 -04:00 |
|
Thomas Parnell
|
a903669e10
|
[V1] Remove V0 code paths for Hybrid models (#25400)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-09-23 08:26:13 -07:00 |
|
Vadim Gimpelson
|
072d7e53e5
|
[PERF] Add conv1d metadata to GDN attn (#25105)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2025-09-18 14:27:49 +00:00 |
|
Tao He
|
dd6a910aac
|
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. (#24957)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
|
2025-09-17 21:59:09 +08:00 |
|
Tao He
|
8226dd56bf
|
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) (#24667)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
|
2025-09-12 22:31:32 +00:00 |
|
Tao He
|
e93f4cc9e3
|
Add the support for the qwen3 next model (a hybrid attention model). (#24526)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-09-11 15:32:09 +08:00 |
|