Matthew Bonanni
|
116f4be405
|
[1/N][Cleanup] Standardize on use of is_quantized_kv_cache (#38659)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-04-01 04:08:01 +00:00 |
|
BadrBasowid
|
077a9a8e37
|
[torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate (#37373)
Signed-off-by: BadrBasowid <badr.basowid@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
|
2026-03-31 14:15:50 -04:00 |
|
Itay Alroy
|
c57d38d603
|
elastic_ep: Fix issues with repeated scale up/down cycles (#37131)
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
|
2026-03-20 23:13:02 +00:00 |
|
Peter Pan
|
79eb9369c5
|
fix CUDAGraph memory being counted twice (#37426)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-20 17:36:32 +00:00 |
|
JartX
|
e8f9dbc369
|
[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling (#36720)
Signed-off-by: JartX <sagformas@epdcenter.es>
|
2026-03-17 17:55:34 -04:00 |
|
Benjamin Chislett
|
f63ed7b5ac
|
[Bugfix] Fix DP MTP Dummy Run (#35243)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2026-03-17 11:16:48 -04:00 |
|
Kunshang Ji
|
747b068136
|
[Hardware] Replace memory related torch.cuda APIs (#37031)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
|
2026-03-16 10:24:48 +00:00 |
|
Kunshang Ji
|
53ec16a705
|
[Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2026-03-12 07:57:47 -07:00 |
|
Aaron Hao
|
d6b61e5166
|
[BUG] Fix async rlhf tests (#35811)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
2026-03-11 18:06:10 -04:00 |
|
Hongxin Xu
|
bea02cdf93
|
Fix routed experts capture for hybrid models (Mamba + Attention) (#35744)
Signed-off-by: arlenxu <arlenxu@tencent.com>
Signed-off-by: xhx1022 <1737006628@qq.com>
Co-authored-by: arlenxu <arlenxu@tencent.com>
|
2026-03-11 08:53:10 -07:00 |
|
Hongbin Guo
|
4bf533623b
|
[Doc] Fix duplicate words in comments (#36713)
Signed-off-by: Hongbin10 <jdmjdm1998@163.com>
|
2026-03-10 21:28:31 -07:00 |
|
Nick Hill
|
65b2f405dc
|
[Core] Simplify core kv-cache blocks initialization logic (#36521)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-10 20:20:02 +00:00 |
|
Vadim Gimpelson
|
4ff8c3c8f9
|
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2026-03-10 03:32:20 -07:00 |
|
Matthew Bonanni
|
ebb9cc5f2b
|
[UX][Startup] Account for CUDA graphs during memory profiling (#30515)
|
2026-03-07 13:49:23 -08:00 |
|
Nick Hill
|
b354686524
|
[Model Runner V2] Fix warmup for pipeline parallel (#36280)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-06 16:58:51 -08:00 |
|
Shiyan Deng
|
03a49bb8f0
|
[Feature] Add --distributed-timeout-seconds CLI option (#36047)
Signed-off-by: Shiyan Deng <dsy842974287@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
|
2026-03-05 20:57:51 -08:00 |
|
Nick Hill
|
a73af584fe
|
[Model Runner V2] Fix warmup for very small kvcache and/or blocksizes (#36176)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-05 14:48:10 -08:00 |
|
Kunshang Ji
|
16d2ad1d38
|
[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache (#30681)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-04 09:49:47 +00:00 |
|
Nick Hill
|
d15c3b90fc
|
[Core] Move save_tensorized_model logic to Worker (#35825)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-03 15:31:59 -08:00 |
|
Martin Hickey
|
87c98b0236
|
[MyPy][BugFix] Check profiler is assigned before calling start() on it (#35505)
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-02 13:23:42 +00:00 |
|
Huy Do
|
7b346ba8ed
|
[Bugfix] Propagate compilation_time from workers to main process for TP>1 (#35503)
Signed-off-by: Huy Do <huydhn@gmail.com>
|
2026-02-28 05:03:22 +00:00 |
|
Itay Alroy
|
dea268336f
|
[1/N] Elastic EP Milestone 2 (#34861)
Signed-off-by: Yongji Wu <wuyongji317@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com>
Co-authored-by: Yongji Wu <wuyongji317@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
|
2026-02-28 04:46:42 +00:00 |
|
Woosuk Kwon
|
86ac7bcf84
|
[Model Runner V2] Support pooling models (#35120)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-02-27 18:03:01 -08:00 |
|
Nick Hill
|
b1d9f5372d
|
[Model Runner V2] Warmup kernels (#35172)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-02-27 10:43:30 -08:00 |
|
Nick Hill
|
876312f0b5
|
[Core] Fix gpu_worker.py pre-commit errors (#35312)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-02-27 07:54:24 -08:00 |
|
Wentao Ye
|
3d2a026fd0
|
[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement (#33368)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2026-02-13 16:38:16 +08:00 |
|
Jaewon
|
4453ba8d9e
|
[Core] Profiler improvements and lazy initialization (#33198)
Signed-off-by: Jaewon Lee <jaewon@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
|
2026-02-12 16:16:38 -08:00 |
|
bnellnm
|
d1481ba783
|
[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344)
Signed-off-by: Bill Nell <bnell@redhat.com>
|
2026-02-10 19:51:07 -05:00 |
|
Ilya Markov
|
bb2fc8b5e7
|
[BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
|
2026-02-10 22:34:47 +00:00 |
|
J Seppänen
|
506ad7d7c1
|
[Bugfix] Fix weights offloading for sleep mode (#32947)
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2026-02-10 20:38:17 +00:00 |
|
Wentao Ye
|
67a746e87f
|
[Log] Optimize duplicate startup log (#33944)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-06 17:49:56 +00:00 |
|
emricksini-h
|
325ab6b0a8
|
[Feature] OTEL tracing during loading (#31162)
|
2026-02-05 16:59:28 -08:00 |
|
Aaron Hao
|
c1858b7ec8
|
[Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
|
2026-02-05 12:13:23 -05:00 |
|
kourosh hakhamaneshi
|
2f6d17cb2f
|
[rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm (#33308)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
|
2026-02-04 10:09:14 -08:00 |
|
jma99_2333
|
22d9a056d5
|
Support clear mm and encoder cache (#33452)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
|
2026-01-31 15:22:25 +00:00 |
|
Kyle Sayers
|
f857a03f6b
|
[QeRL] Layerwise Reloading (#32133)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
2026-01-30 08:50:05 -07:00 |
|
Chendi.Xue
|
8c8ebeb941
|
[BUGFIX][XPU] fix memory check after XPU reuse GPU_worker (#33358)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
2026-01-29 09:56:30 -08:00 |
|
Nick Hill
|
6bf3b46d78
|
[ModelRunner V2] Misc code simplification and cleanup (#33266)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-01-28 14:41:23 -08:00 |
|
Reagan Lee
|
06b557ecd9
|
feat(benchmark): add encoder forward pass benchmarking to mm-processor (#31655)
Signed-off-by: Reagan <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com>
|
2026-01-24 08:24:44 +00:00 |
|
Nick Hill
|
8518b30447
|
[Model Runner V2] Add KV Connector support (#32742)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-01-23 10:49:17 -08:00 |
|
Woosuk Kwon
|
43fada5360
|
[Model Runner V2] Refactor dummy_run (#32533)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2026-01-19 14:50:59 -08:00 |
|
Shanshan Shen
|
ce0946249d
|
[Misc] Make mem utils can be reused by other platforms (#32322)
Signed-off-by: shen-shanshan <467638484@qq.com>
|
2026-01-14 03:46:01 -08:00 |
|
Max Hu
|
6ebe34d6fa
|
[Feature] Add iteration level logging and enhance nvtx marker (#31193)
Signed-off-by: Max Hu <maxhu@nvidia.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>
|
2026-01-09 00:13:39 +00:00 |
|
Lucas Wilkinson
|
6cdf015c3c
|
[Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future (#31747)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-01-08 15:20:49 -08:00 |
|
Nick Hill
|
a3d909ad2b
|
[Misc] Tidy up some spec decode logic in GPUModelRunner (#31591)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-01-08 09:10:07 -08:00 |
|
Ning Xie
|
c907d22158
|
[refactor] refactor memory constants usage (#31865)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2026-01-07 18:37:31 +00:00 |
|
Cyrus Leung
|
aafd4d2354
|
[Chore] Try remove init_cached_hf_modules (#31786)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-07 12:34:04 +08:00 |
|
Ning Xie
|
6f5e653383
|
[Log] add log about gpu worker init snapshot and requested memory (#29493)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2026-01-06 17:32:55 +00:00 |
|
Wentao Ye
|
af9a7ec255
|
[Bug] Revert torch warning fix (#31585)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-01-05 22:31:21 +00:00 |
|
wangxiyuan
|
bb4337b34c
|
[Platform] Deprecate seed_everything (#31659)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2026-01-04 18:34:04 -08:00 |
|