Commit Graph

199 Commits

Author SHA1 Message Date
Matthew Bonanni
116f4be405 [1/N][Cleanup] Standardize on use of is_quantized_kv_cache (#38659)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-04-01 04:08:01 +00:00
BadrBasowid
077a9a8e37 [torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate (#37373)
Signed-off-by: BadrBasowid <badr.basowid@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-03-31 14:15:50 -04:00
Itay Alroy
c57d38d603 elastic_ep: Fix issues with repeated scale up/down cycles (#37131)
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
2026-03-20 23:13:02 +00:00
Peter Pan
79eb9369c5 fix CUDAGraph memory being counted twice (#37426)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-20 17:36:32 +00:00
JartX
e8f9dbc369 [Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling (#36720)
Signed-off-by: JartX <sagformas@epdcenter.es>
2026-03-17 17:55:34 -04:00
Benjamin Chislett
f63ed7b5ac [Bugfix] Fix DP MTP Dummy Run (#35243)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2026-03-17 11:16:48 -04:00
Kunshang Ji
747b068136 [Hardware] Replace memory related torch.cuda APIs (#37031)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
2026-03-16 10:24:48 +00:00
Kunshang Ji
53ec16a705 [Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-12 07:57:47 -07:00
Aaron Hao
d6b61e5166 [BUG] Fix async rlhf tests (#35811)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
2026-03-11 18:06:10 -04:00
Hongxin Xu
bea02cdf93 Fix routed experts capture for hybrid models (Mamba + Attention) (#35744)
Signed-off-by: arlenxu <arlenxu@tencent.com>
Signed-off-by: xhx1022 <1737006628@qq.com>
Co-authored-by: arlenxu <arlenxu@tencent.com>
2026-03-11 08:53:10 -07:00
Hongbin Guo
4bf533623b [Doc] Fix duplicate words in comments (#36713)
Signed-off-by: Hongbin10 <jdmjdm1998@163.com>
2026-03-10 21:28:31 -07:00
Nick Hill
65b2f405dc [Core] Simplify core kv-cache blocks initialization logic (#36521)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-10 20:20:02 +00:00
Vadim Gimpelson
4ff8c3c8f9 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-03-10 03:32:20 -07:00
Matthew Bonanni
ebb9cc5f2b [UX][Startup] Account for CUDA graphs during memory profiling (#30515) 2026-03-07 13:49:23 -08:00
Nick Hill
b354686524 [Model Runner V2] Fix warmup for pipeline parallel (#36280)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-06 16:58:51 -08:00
Shiyan Deng
03a49bb8f0 [Feature] Add --distributed-timeout-seconds CLI option (#36047)
Signed-off-by: Shiyan Deng <dsy842974287@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2026-03-05 20:57:51 -08:00
Nick Hill
a73af584fe [Model Runner V2] Fix warmup for very small kvcache and/or blocksizes (#36176)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-05 14:48:10 -08:00
Kunshang Ji
16d2ad1d38 [Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache (#30681)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-04 09:49:47 +00:00
Nick Hill
d15c3b90fc [Core] Move save_tensorized_model logic to Worker (#35825)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-03 15:31:59 -08:00
Martin Hickey
87c98b0236 [MyPy][BugFix] Check profiler is assigned before calling start() on it (#35505)
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-02 13:23:42 +00:00
Huy Do
7b346ba8ed [Bugfix] Propagate compilation_time from workers to main process for TP>1 (#35503)
Signed-off-by: Huy Do <huydhn@gmail.com>
2026-02-28 05:03:22 +00:00
Itay Alroy
dea268336f [1/N] Elastic EP Milestone 2 (#34861)
Signed-off-by: Yongji Wu <wuyongji317@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com>
Co-authored-by: Yongji Wu <wuyongji317@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
2026-02-28 04:46:42 +00:00
Woosuk Kwon
86ac7bcf84 [Model Runner V2] Support pooling models (#35120)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-02-27 18:03:01 -08:00
Nick Hill
b1d9f5372d [Model Runner V2] Warmup kernels (#35172)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-02-27 10:43:30 -08:00
Nick Hill
876312f0b5 [Core] Fix gpu_worker.py pre-commit errors (#35312)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-02-27 07:54:24 -08:00
Wentao Ye
3d2a026fd0 [Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement (#33368)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2026-02-13 16:38:16 +08:00
Jaewon
4453ba8d9e [Core] Profiler improvements and lazy initialization (#33198)
Signed-off-by: Jaewon Lee <jaewon@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2026-02-12 16:16:38 -08:00
bnellnm
d1481ba783 [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344)
Signed-off-by: Bill Nell <bnell@redhat.com>
2026-02-10 19:51:07 -05:00
Ilya Markov
bb2fc8b5e7 [BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
2026-02-10 22:34:47 +00:00
J Seppänen
506ad7d7c1 [Bugfix] Fix weights offloading for sleep mode (#32947)
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2026-02-10 20:38:17 +00:00
Wentao Ye
67a746e87f [Log] Optimize duplicate startup log (#33944)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-02-06 17:49:56 +00:00
emricksini-h
325ab6b0a8 [Feature] OTEL tracing during loading (#31162) 2026-02-05 16:59:28 -08:00
Aaron Hao
c1858b7ec8 [Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
2026-02-05 12:13:23 -05:00
kourosh hakhamaneshi
2f6d17cb2f [rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm (#33308)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
2026-02-04 10:09:14 -08:00
jma99_2333
22d9a056d5 Support clear mm and encoder cache (#33452)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-01-31 15:22:25 +00:00
Kyle Sayers
f857a03f6b [QeRL] Layerwise Reloading (#32133)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-01-30 08:50:05 -07:00
Chendi.Xue
8c8ebeb941 [BUGFIX][XPU] fix memory check after XPU reuse GPU_worker (#33358)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2026-01-29 09:56:30 -08:00
Nick Hill
6bf3b46d78 [ModelRunner V2] Misc code simplification and cleanup (#33266)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-28 14:41:23 -08:00
Reagan Lee
06b557ecd9 feat(benchmark): add encoder forward pass benchmarking to mm-processor (#31655)
Signed-off-by: Reagan <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com>
2026-01-24 08:24:44 +00:00
Nick Hill
8518b30447 [Model Runner V2] Add KV Connector support (#32742)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-23 10:49:17 -08:00
Woosuk Kwon
43fada5360 [Model Runner V2] Refactor dummy_run (#32533)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-19 14:50:59 -08:00
Shanshan Shen
ce0946249d [Misc] Make mem utils can be reused by other platforms (#32322)
Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-14 03:46:01 -08:00
Max Hu
6ebe34d6fa [Feature] Add iteration level logging and enhance nvtx marker (#31193)
Signed-off-by: Max Hu <maxhu@nvidia.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>
2026-01-09 00:13:39 +00:00
Lucas Wilkinson
6cdf015c3c [Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future (#31747)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-08 15:20:49 -08:00
Nick Hill
a3d909ad2b [Misc] Tidy up some spec decode logic in GPUModelRunner (#31591)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-08 09:10:07 -08:00
Ning Xie
c907d22158 [refactor] refactor memory constants usage (#31865)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-07 18:37:31 +00:00
Cyrus Leung
aafd4d2354 [Chore] Try remove init_cached_hf_modules (#31786)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-07 12:34:04 +08:00
Ning Xie
6f5e653383 [Log] add log about gpu worker init snapshot and requested memory (#29493)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-06 17:32:55 +00:00
Wentao Ye
af9a7ec255 [Bug] Revert torch warning fix (#31585)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-05 22:31:21 +00:00
wangxiyuan
bb4337b34c [Platform] Deprecate seed_everything (#31659)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-04 18:34:04 -08:00