biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Wentao Ye	3d2a026fd0	[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement (#33368 ) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2026-02-13 16:38:16 +08:00
Jaewon	4453ba8d9e	[Core] Profiler improvements and lazy initialization (#33198 ) Signed-off-by: Jaewon Lee <jaewon@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>	2026-02-12 16:16:38 -08:00
bnellnm	d1481ba783	[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-02-10 19:51:07 -05:00
Ilya Markov	bb2fc8b5e7	[BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860 ) Signed-off-by: ilmarkov <markovilya197@gmail.com>	2026-02-10 22:34:47 +00:00
J Seppänen	506ad7d7c1	[Bugfix] Fix weights offloading for sleep mode (#32947 ) Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2026-02-10 20:38:17 +00:00
Wentao Ye	67a746e87f	[Log] Optimize duplicate startup log (#33944 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-06 17:49:56 +00:00
emricksini-h	325ab6b0a8	[Feature] OTEL tracing during loading (#31162 )	2026-02-05 16:59:28 -08:00
Aaron Hao	c1858b7ec8	[Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943 ) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: SumanthRH <sumanthrh99@gmail.com>	2026-02-05 12:13:23 -05:00
kourosh hakhamaneshi	2f6d17cb2f	[rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm (#33308 ) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>	2026-02-04 10:09:14 -08:00
jma99_2333	22d9a056d5	Support clear mm and encoder cache (#33452 ) Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>	2026-01-31 15:22:25 +00:00
Kyle Sayers	f857a03f6b	[QeRL] Layerwise Reloading (#32133 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-01-30 08:50:05 -07:00
Chendi.Xue	8c8ebeb941	[BUGFIX][XPU] fix memory check after XPU reuse GPU_worker (#33358 ) Signed-off-by: Chendi Xue <chendi.xue@intel.com>	2026-01-29 09:56:30 -08:00
Nick Hill	6bf3b46d78	[ModelRunner V2] Misc code simplification and cleanup (#33266 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-01-28 14:41:23 -08:00
Reagan Lee	06b557ecd9	feat(benchmark): add encoder forward pass benchmarking to mm-processor (#31655 ) Signed-off-by: Reagan <reaganjlee@gmail.com> Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com> Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com>	2026-01-24 08:24:44 +00:00
Nick Hill	8518b30447	[Model Runner V2] Add KV Connector support (#32742 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-01-23 10:49:17 -08:00
Woosuk Kwon	43fada5360	[Model Runner V2] Refactor `dummy_run` (#32533 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2026-01-19 14:50:59 -08:00
Shanshan Shen	ce0946249d	[Misc] Make mem utils can be reused by other platforms (#32322 ) Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-14 03:46:01 -08:00
Max Hu	6ebe34d6fa	[Feature] Add iteration level logging and enhance nvtx marker (#31193 ) Signed-off-by: Max Hu <maxhu@nvidia.com> Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: Max Hu <maxhu@nvidia.com>	2026-01-09 00:13:39 +00:00
Lucas Wilkinson	6cdf015c3c	[Misc] Fix `Current vLLM config is not set.` warnings, assert to avoid issues in the future (#31747 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-08 15:20:49 -08:00
Nick Hill	a3d909ad2b	[Misc] Tidy up some spec decode logic in GPUModelRunner (#31591 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-01-08 09:10:07 -08:00
Ning Xie	c907d22158	[refactor] refactor memory constants usage (#31865 ) Signed-off-by: Andy Xie <andy.xning@gmail.com>	2026-01-07 18:37:31 +00:00
Cyrus Leung	aafd4d2354	[Chore] Try remove `init_cached_hf_modules` (#31786 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-07 12:34:04 +08:00
Ning Xie	6f5e653383	[Log] add log about gpu worker init snapshot and requested memory (#29493 ) Signed-off-by: Andy Xie <andy.xning@gmail.com>	2026-01-06 17:32:55 +00:00
Wentao Ye	af9a7ec255	[Bug] Revert torch warning fix (#31585 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-01-05 22:31:21 +00:00
wangxiyuan	bb4337b34c	[Platform] Deprecate seed_everything (#31659 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-04 18:34:04 -08:00
Nick Hill	bd877162eb	[BugFix] Support online dense model DP without overhead (#30739 ) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com>	2026-01-02 23:36:38 +08:00
Nick Hill	6c2cfb62ff	[BugFix] Fix async scheduling for pooling models (#31584 ) Signed-off-by: njhill <nickhill123@gmail.com>	2025-12-31 14:48:51 -08:00
Yifan Qiao	52bf066516	[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector (#30166 ) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu>	2025-12-26 18:25:46 -08:00
Michael Goin	8ee90c83f8	Add `--max-model-len auto` to auto-fit context to available memory (#29431 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-12-23 21:37:14 -08:00
Boyuan Feng	8dd0db687b	[UX] improve profiler error message (#31125 ) Signed-off-by: Boyuan Feng <boyuan@meta.com>	2025-12-22 08:45:59 -08:00
Cyrus Leung	2497228ad4	[Chore] Factor out logic for requesting initial memory (#30868 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-12-17 07:32:17 -08:00
Matthew Bonanni	60dbf7d8f1	Update batch invariant to use attention config (#30704 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-15 15:24:16 -05:00
Lucas Wilkinson	3e41992fec	[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 (#27532 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-12-12 05:57:47 -08:00
Wentao Ye	d6464f2679	[Chore] Fix torch precision warning (#30428 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-12-11 04:05:56 +00:00
Benjamin Chislett	e858bfe051	[Cleanup] Refactor profiling env vars into a CLI config (#29912 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-12-09 13:29:33 -05:00
Wentao Ye	83319b44c2	[Compile] Fix torch warning `TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled` (#29897 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-12-09 10:40:37 -05:00
Ilya Markov	4e26d3b09e	[Compile] Conditional compilation. Introduce compile_ranges (#24252 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: ilmarkov <markovilya197@gmail.com> Signed-off-by: Luka Govedič <luka.govedic@gmail.com> Signed-off-by: ProExpertProg <lgovedic@redhat.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Luka Govedič <luka.govedic@gmail.com>	2025-12-05 18:17:32 +00:00
Nick Hill	dc264bcea1	[BugFix] Eagerly abort cancelled final-step requests (#29987 ) Currently, when requests are cancelled while executing their final step, "completion" is handled based on normal stop processing (e.g. length or stop token), so the abort has no effect. This is typically not a problem, but when a kv connector is involved it thinks the request completed successfully rather than being aborted. This is problematic for disaggregated prefill which will free kv cache blocks if the request was aborted but not if it completed successfully—since the cancelled request will never be sent to the decode side, kv cache blocks remain pinned until the fall-back timeout expires. The problem is exacerbated when many requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window is bigger). This PR fixes the problem by processing pending aborts immediately prior to processing model output each step; we process only aborts, not new requests, since it's preferable for latency to process model outputs before new incoming requests. Fixes #26400. Signed-off-by: Nick Hill <nhill@redhat.com>	2025-12-05 17:28:32 +00:00
Yong Hoon Shin	69520bc695	Add logging for cudagraph related info (#29825 ) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>	2025-12-03 01:01:48 -08:00
Arpit Khandelwal	d7284a2604	[Core] Rename PassConfig flags as per RFC #27995 (#29646 ) Signed-off-by: arpitkh101 <arpit5khandelwal@gmail.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2025-12-03 03:38:55 +00:00
Vensen	66b5840287	[Bugfix][sleepmode][fp8 kv cache]: Fix FP8 KV cache + sleep(level=2) gibberish output (#28783 ) Signed-off-by: vensen <vensenmu@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>	2025-11-30 14:24:25 +08:00
Lucas Wilkinson	56539cddac	[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building (#28579 )	2025-11-26 14:07:13 -05:00
Woosuk Kwon	30b44a1598	GPU Model Runner V2 (#25266 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-11-21 08:20:55 -08:00
Wentao Ye	56669c1f29	[CI] Fix mypy for `vllm/v1/worker` (#29037 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-11-21 11:36:07 +08:00
Benjamin Chislett	fcbcba6c70	[Feat] Iteration-level profiling for Torch and CUDA profiler (#28987 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-19 19:17:48 -08:00
Julien Denize	cdeec2e606	[BugFix] Ray with multiple nodes (#28873 ) Signed-off-by: Julien Denize <julien.denize@mistral.ai>	2025-11-19 21:20:58 +00:00
Qiu	2fd893b4ce	[Feature] Prefill Context Parallel (PCP) basic support (#28718 ) Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>	2025-11-19 15:52:44 -05:00
Nick Hill	7765e5ba75	[BugFix] Fix PP performance and PP kv connector output regression (#28768 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-11-17 14:08:50 -08:00
Lucia Fang	b316ac6589	[V1] Support MP Executor for multi node distributed inference (#23691 ) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-11-16 09:01:21 +00:00
rasmith	ba041d980b	[Log] Save profiler results to file instead of stdout (#28144 ) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com>	2025-11-14 23:26:39 +00:00

1 2 3 4

174 Commits