Chenguang Zheng
|
4ccffe561f
|
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233)
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
|
2025-11-11 18:58:33 -08:00 |
|
Benjamin Chislett
|
975676d174
|
[Feat] Drop-in Torch CUDA Profiler (#27841)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2025-11-08 14:07:37 -08:00 |
|
Nick Hill
|
67a2da890e
|
[PerfFix] Avoid separate thread for MP executor shm spin (take 2) (#28319)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-11-07 22:11:03 +00:00 |
|
Nicolò Lucchesi
|
68a72a5cc1
|
Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (#28289)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-11-07 15:07:01 +00:00 |
|
Dayeol Lee
|
1767658559
|
[Debugging] Add annotation for easier trace analysis (#22496)
|
2025-11-05 16:52:52 -08:00 |
|
Ilya Markov
|
e50c454672
|
[BugFix] Support EP/DP + EPLB with MTP (#25311)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
|
2025-11-05 15:22:17 +00:00 |
|
Nick Hill
|
c9f66da8fd
|
[PerfFix] Avoid separate thread for MP executor shm spin (#28012)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-11-04 08:33:55 -08:00 |
|
Mark McLoughlin
|
58279c60b5
|
[KV Connector] Make KVCacheConfig an explicit constructor argument (#27887)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-11-03 23:00:49 -08:00 |
|
Nick Hill
|
0cdbe7b744
|
[Core] Async scheduling + structured outputs compatibility (#26866)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-11-01 00:35:04 +00:00 |
|
GuanLuo
|
d6517be3cd
|
[Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (#26338)
Signed-off-by: Guan Luo <gluo@nvidia.com>
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2025-10-31 10:16:00 -07:00 |
|
Ilya Markov
|
60f76baa66
|
[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices (#27564)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
2025-10-30 11:41:44 -04:00 |
|
Cyrus Leung
|
6ebffafbb6
|
[Misc] Clean up more utils (#27567)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-10-27 15:30:38 +00:00 |
|
Kuntai Du
|
b853540388
|
[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector (#25712)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
|
2025-10-24 23:34:18 -07:00 |
|
Zhuohan Li
|
56ed7609a9
|
Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio… (#27502)
|
2025-10-25 05:31:43 +00:00 |
|
Wentao Ye
|
52efc34ebf
|
[Log] Optimize Startup Log (#26740)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-10-24 19:27:04 -04:00 |
|
Ilya Markov
|
237cf6d32a
|
[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (#26709)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
2025-10-23 20:58:39 +08:00 |
|
iAmir97
|
1d165d6d85
|
[Chore] Separate out vllm.utils.mem_utils (#27143)
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-18 10:06:59 +00:00 |
|
Bram Wasti
|
7d8975de84
|
Deepseek-v3 Batch Invariant on 8xH100 (#26609)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-10-15 22:06:02 -07:00 |
|
Harry Mellor
|
8fcaaf6a16
|
Update Optional[x] -> x | None and Union[x, y] to x | y (#26633)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-12 09:51:31 -07:00 |
|
wang.yuqi
|
76852017ea
|
[MISC] Rename the torch profiler filename as instance_id+rank_id for merging the Profiler results of each Rank (#25867)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-10-12 09:29:08 +00:00 |
|
Cyrus Leung
|
ad430a67ca
|
[Metrics] Log multi-modal cache stats and fix reset (#26285)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-10-10 01:45:55 -07:00 |
|
Grant Holmes (Ren)
|
d100d78eb3
|
Optimize KV cache distribution for asymmetric pipeline parallelism (#25164)
Signed-off-by: gholmes829 <g.holmes429@gmail.com>
|
2025-10-07 09:20:30 +00:00 |
|
Harry Mellor
|
d6953beb91
|
Convert formatting to use ruff instead of yapf + isort (#26247)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-05 07:06:22 -07:00 |
|
David Ben-David
|
9a9f48dff7
|
[V1] [P/D] Add Support for KV Load Failure Recovery (#19330)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
|
2025-09-30 14:57:08 -07:00 |
|
Tyler Michael Smith
|
2e5df88c92
|
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning (#25532)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-09-25 15:16:06 +00:00 |
|
kourosh hakhamaneshi
|
abad204be6
|
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting (#25359)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
|
2025-09-23 15:49:09 -07:00 |
|
Wentao Ye
|
846197f505
|
[Log] Optimize kv cache memory log from Bytes to GiB (#25204)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-09-23 12:44:37 -04:00 |
|
Michael Goin
|
2c58742dff
|
[UX] Change kv-cache-memory log level to debug (#25479)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2025-09-23 08:01:24 -07:00 |
|
Matthew Bonanni
|
ac0048c0ae
|
[BugFix] [DP/EP] Fix slow execution when BS <= DP (#25407)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
|
2025-09-22 17:26:17 -07:00 |
|
Nick Hill
|
a25ade5d47
|
[BugFix] Ensure appropriate guards in destructors (#25284)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-09-20 09:06:34 +08:00 |
|
cascade
|
17871983a2
|
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism (#24021)
Signed-off-by: cascade812 <cascade812@outlook.com>
|
2025-09-16 04:32:32 +00:00 |
|
Boyuan Feng
|
94e6b2d55f
|
Allow users to specify kv cache memory size (#21489)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-09-11 13:41:07 +00:00 |
|
Chauncey
|
61aa4b2901
|
[P/D] Add a shutdown method to the Connector API (#22699)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-09-07 23:07:00 -07:00 |
|
yzds
|
ac201a0eaf
|
[Feature] Support Decode Context Parallel (DCP) for MLA (#23734)
Signed-off-by: hongchao <hongchao@msh.team>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: hongchao <hongchao@msh.team>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-06 13:24:05 +08:00 |
|
Benjamin Chislett
|
cee182b297
|
[Perf][V1] Fully overlap model execution (#23569)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
2025-09-05 18:20:17 -07:00 |
|
bnellnm
|
e9b92dcd89
|
[Kernels] Overlap shared experts with send/recv (#23273)
Signed-off-by: Bill Nell <bnell@redhat.com>
|
2025-09-03 12:35:18 -04:00 |
|
Wentao Ye
|
98aee612aa
|
[Log] Only Print Profiler Results on Rank 0 (#23370)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-09-02 18:53:34 +00:00 |
|
Didier Durand
|
9701352e4b
|
[Doc]: fix typos in Python comments (#24001)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
|
2025-08-31 08:21:59 +00:00 |
|
Andy Lo
|
038e9be4eb
|
[LoRA] Much faster startup when LoRA is enabled (#23777)
Signed-off-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-08-30 15:37:39 +00:00 |
|
Nick Hill
|
d90d8eb674
|
[BugFix] Async scheduling and PP compatibility with DP (#23770)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-29 08:17:27 -07:00 |
|
weiliang
|
ae067888d6
|
Update Flashinfer to 0.2.14.post1 (#23537)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: siyuanf <siyuanf@nvidia.com>
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-25 18:30:44 -07:00 |
|
Chaojun Zhang
|
8a044754bd
|
[XPU] Delay BF16 check to worker init for spawn compatibility (#22979)
Signed-off-by: chzhang <chaojun.zhang@intel.com>
|
2025-08-25 13:09:26 -07:00 |
|
22quinn
|
2a167b2eeb
|
[test][RL] Add sleep level 2 test and fix reload with sleep mode (#23521)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
|
2025-08-26 00:25:52 +08:00 |
|
Ning Xie
|
325aa3dee9
|
[Misc] local import code clean (#23420)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-08-22 14:01:35 +00:00 |
|
rongfu.leng
|
4fbda0b20c
|
[Feature] use --eplb_config to set eplb param (#20562)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-08-20 14:07:28 -07:00 |
|
Woosuk Kwon
|
c9b38be8aa
|
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 17:20:38 -07:00 |
|
fhl2000
|
74f441f4b5
|
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer (#20059)
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
|
2025-08-15 10:01:39 -04:00 |
|
nvjullin
|
279a5f31b3
|
[Kernel] Add nvfp4 gemm flashinfer backends (#22346)
Signed-off-by: Julien Lin <jullin@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-08-14 16:03:55 -04:00 |
|
Varun Sundar Rabindranath
|
f703b923f3
|
[Misc] DeepGEMM : Avoid JIT generation in the hot-path (#22215)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-08-08 16:09:59 -07:00 |
|
David Ben-David
|
aefeea0fde
|
[V1] [P/D] Refactor KV Connector Path (#21980)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
|
2025-08-03 04:03:40 -07:00 |
|