SungMinCho
a0b782f9cc
[Metrics] Model FLOPs Utilization estimation ( #30738 )
...
Signed-off-by: SungMinCho <tjdals4565@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-12-18 01:40:51 +00:00
Roger Wang
f5f51e5931
[Core][MM] Optimize encoder cache manager by operating with embeddings only ( #30475 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Sun Kim <sunytokki@gmail.com >
2025-12-16 14:18:17 -08:00
Nicolò Lucchesi
75eb302a2e
[Bugfix] Whisper fix number of allocated CrossAttn blocks per-request ( #30772 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-16 14:20:19 +00:00
Nick Hill
1cec5b7ea9
[Scheduer] Simplify stop checking for pooling models ( #30591 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-13 09:45:26 +00:00
Nicolò Lucchesi
0efd9f867c
[Core] Whisper Enable Encoder Batching ( #29421 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-11 21:06:51 +00:00
Will Eaton
a9e4106f28
[P/D] KV Load Failure Recovery/Abort Configuration ( #26813 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
Signed-off-by: Will Eaton <me@wseaton.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-10 11:00:52 -08:00
Yifan Qiao
1b0482b9d1
[Misc][Core] Remove unused req_index increment in scheduler ( #30176 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2025-12-07 08:39:21 +00:00
Cyrus Leung
e83b7e379c
Revert "[Renderer] Separate out RendererConfig from ModelConfig ( #30145 )" ( #30199 )
2025-12-07 00:00:22 -08:00
Cyrus Leung
27f4c2fd46
[Renderer] Separate out RendererConfig from ModelConfig ( #30145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-06 23:15:42 -08:00
Tova Movshovitz
adb315060c
[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache ( #27170 )
...
Signed-off-by: tovam <tovam@pliops.com >
Signed-off-by: Tova Movshovitz <tovam@pliops.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-05 18:33:26 +00:00
Yong Hoon Shin
69520bc695
Add logging for cudagraph related info ( #29825 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-12-03 01:01:48 -08:00
Zhuohan Li
d0cd728907
[Core] Support reseting all running requests' KV while calling reset_prefix_cache ( #28827 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-02 02:25:05 +00:00
Nick Hill
44822d7ff2
[BugFix] Preserve spec decoding uniform decode when scheduling ( #29759 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-01 17:15:52 -08:00
shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics ( #27793 )
...
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
Two new runtime flags are introduced:
--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com >
2025-12-01 18:27:53 +00:00
Mickaël Seznec
86e178f7c4
[crashfix] Eagle + multimodal can crash on mm cache miss ( #29750 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-12-01 17:29:33 +08:00
Nick Hill
8e7a891602
[BugFix] Fix spec decoding max_tokens scheduling perf issue ( #29542 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-28 20:52:23 +08:00
Nick Hill
4e57c6587f
[Core] Support logprobs with spec decode + async scheduling ( #29223 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-25 12:55:24 -08:00
Yifan Qiao
48ddb02b79
[Hybrid Allocator] Support KV cache groups with different block_size ( #29143 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-11-25 10:30:57 -05:00
wang.yuqi
67fc16cd8c
[Bugfix] If chunked_prefill is disabled, end the scheduling early. ( #28911 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-11-25 16:06:09 +08:00
Cyrus Leung
5a4802588e
[Misc] Further clean up chunked prefill and prefix caching init ( #29186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-22 19:34:15 +08:00
Mark McLoughlin
c6fa3895e9
[KV Connector] Fix async connector prefix cache metrics ( #28585 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2025-11-21 17:45:00 -05:00
Woosuk Kwon
30b44a1598
GPU Model Runner V2 ( #25266 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-21 08:20:55 -08:00
Jialin Ouyang
30b9c67743
Revert "[Redo] #26368 ( #28771 )" ( #29121 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-11-20 21:27:45 -08:00
Qiu
2fd893b4ce
[Feature] Prefill Context Parallel (PCP) basic support ( #28718 )
...
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com >
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com >
Signed-off-by: LookAround <lixushi@huawei.com >
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com >
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com >
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com >
Co-authored-by: LookAround <lixushi@huawei.com >
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com >
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com >
Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com >
2025-11-19 15:52:44 -05:00
Zhuohan Li
552cac95b5
[Misc] Fix wrong comment in scheduler ( #28880 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-11-17 15:32:22 -08:00
Ronald
d8874c61a5
[Core] Async Scheduling X Spec Decoding Compatibility ( #24799 )
...
Signed-off-by: Ronald1995 <ronaldautomobile@163.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-11-17 12:16:20 -08:00
Nick Hill
80b6080ddc
[BugFix] Fix async scheduling + chunked prefill + preemption ( #28787 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-17 06:46:46 +08:00
Cyrus Leung
638e4196d1
[Misc] Make SchedulerConfig.max_model_len init-only ( #28733 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-15 01:59:31 -08:00
Cyrus Leung
98b4d389ed
[Redo] #26368 ( #28771 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-11-14 22:47:41 -08:00
Nick Hill
ac86bff8cb
Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… ( #28773 )
2025-11-14 20:24:00 -08:00
Jialin Ouyang
186352b270
[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization ( #26368 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-11-14 16:04:04 -08:00
Cyrus Leung
e2741f6cbc
[Chore] Rename SchedulerConfig.chunked_prefill_enabled ( #28735 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-14 18:39:57 +00:00
Andy Lo
58ce8d12b7
[BugFix] Priority scheduling and spec tokens preemption ( #28558 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-11-12 20:29:21 +00:00
Chenguang Zheng
4ccffe561f
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation ( #25233 )
...
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com >
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Signed-off-by: herotai214 <herotai214@gmail.com >
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com >
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com >
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: herotai214 <herotai214@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com >
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com >
2025-11-11 18:58:33 -08:00
Wei Wei
bf6a3d0ff5
[Misc] Add more scoping for improved trace ( #28329 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-11-10 21:03:21 +00:00
Andy Lo
47604137a2
[Bugfix] Spec decode + structured output + spec model max len edge case ( #28298 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-11-08 19:44:25 +00:00
Snehlata
e15601789b
[Feature]: Add corrupted request metric to V1 metrics system. ( #27306 )
...
Signed-off-by: atalhens <sneh.lata@nutanix.com >
2025-11-05 13:45:29 -08:00
Nick Hill
938a81692e
[AsyncScheduling] Don't schedule past request max_tokens ( #27922 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-04 17:06:28 +00:00
Mark McLoughlin
58279c60b5
[KV Connector] Make KVCacheConfig an explicit constructor argument ( #27887 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-11-03 23:00:49 -08:00
Nick Hill
0cdbe7b744
[Core] Async scheduling + structured outputs compatibility ( #26866 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-01 00:35:04 +00:00
Tyler Michael Smith
ab98f6556f
[Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) ( #27811 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-10-30 11:52:18 -07:00
Wentao Ye
c01f6e525f
[CI] Fix mypy for vllm/v1/core and vllm/v1/engine ( #27108 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-30 11:32:17 +00:00
Nick Hill
2ce5c5d3d6
[BugFix] Handle unscheduled requests properly when async scheduling ( #27756 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-10-29 21:04:25 -07:00
Or Ozeri
111faf1118
[Core] Scheduler: Publish connector events after output ( #25875 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-10-28 21:01:33 +00:00
usberkeley
69f064062b
Code quality improvements: version update, type annotation enhancement, and enum usage simplification ( #27581 )
...
Signed-off-by: Bradley <bradley.b.pitt@gmail.com >
2025-10-27 17:50:22 +00:00
Kuntai Du
b853540388
[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector ( #25712 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-10-24 23:34:18 -07:00
Tova Movshovitz
88afa11010
[Metrics] [KVConnector] Add connector prefix cache hit rate stats ( #26245 )
...
Signed-off-by: tovam <tovam@pliops.com >
2025-10-23 12:21:08 +02:00
PiteXChen
243ed7d32e
[Bugfix][Core] running queue index leakage exception ( #26754 )
...
Signed-off-by: CLFutureX <chenyongqyl@163.com >
2025-10-22 21:40:12 -07:00
Nick Hill
4aed506b65
[Core] Streamline some structured output related code ( #26737 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-10-14 23:27:44 +00:00
Jialin Ouyang
acaa2c0a4a
[Core] Reuse empty block lists whenever possible in KVCacheBlocks to mitigate GC costs ( #24964 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-10-14 12:58:43 -07:00