biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Lucas Wilkinson	abe93bce59	[Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode (#29624 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>	2025-12-09 17:18:10 -08:00
Matthew Bonanni	1d93f11675	[Attention][CUDAGraph] Remove CG padding from attention backends (#29352 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-12-02 13:48:08 -05:00
Benjamin Chislett	304419576a	[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (#28479 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2025-11-13 01:56:40 +09:00
fhl2000	284cc92275	[MISC] `cudagraph_capture_sizes` related improvements (#26016 ) Signed-off-by: fhl <2410591650@qq.com> Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-10-24 05:11:05 -07:00
Vadim Gimpelson	785d8b6410	[PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (#26437 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2025-10-16 12:18:31 +08:00
Harry Mellor	8fcaaf6a16	Update `Optional[x]` -> `x \| None` and `Union[x, y]` to `x \| y` (#26633 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-10-12 09:51:31 -07:00
Roger Wang	43c146ca42	[Misc] Clean up unnecessary E501 ignore (#26274 ) Signed-off-by: Roger Wang <hey@rogerw.io>	2025-10-06 07:29:18 +00:00
Harry Mellor	d6953beb91	Convert formatting to use `ruff` instead of `yapf` + `isort` (#26247 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-10-05 07:06:22 -07:00
Tao He	99b3a504c5	[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. (#25743 ) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>	2025-09-26 01:18:58 -07:00
Benjamin Chislett	c30b405b8f	[Spec Decode] Enable FlashInfer Spec Decoding (#25196 ) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: lhsjohn <huashuoli@tencent.com>	2025-09-23 22:29:58 -04:00
Thomas Parnell	a903669e10	[V1] Remove V0 code paths for Hybrid models (#25400 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2025-09-23 08:26:13 -07:00
Vadim Gimpelson	072d7e53e5	[PERF] Add `conv1d` metadata to GDN attn (#25105 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2025-09-18 14:27:49 +00:00
Tao He	dd6a910aac	[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. (#24957 ) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>	2025-09-17 21:59:09 +08:00
Tao He	8226dd56bf	[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660 ) (#24667 ) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>	2025-09-12 22:31:32 +00:00
Tao He	e93f4cc9e3	Add the support for the qwen3 next model (a hybrid attention model). (#24526 ) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-09-11 15:32:09 +08:00

15 Commits