Commit Graph

203 Commits

Author SHA1 Message Date
zhongdaor-nv
a0fe7ea2f0 [feat] Add per-block extra_keys to KV events (#33304)
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-02-20 20:11:40 -08:00
Ekagra Ranjan
cd81cdb399 [Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs (#31058)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2026-02-16 11:08:44 +00:00
haosdent
4137c5dfa7 [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode (#34418)
Signed-off-by: haosdent <haosdent@gmail.com>
2026-02-13 00:13:22 -08:00
Cyrus Leung
ea5ff3a1f6 [Refactor] Simplify BOS/EOS token handling (#34435)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-02-12 18:18:24 -08:00
junuxyz
c5a66d1697 [Core][BugFix] Fix PP KV cache sharding memory validation (#33698)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
2026-02-10 10:46:24 -05:00
Krish Gupta
748625cdaf [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220)
Signed-off-by: KrxGu <krishom70@gmail.com>
2026-02-10 13:05:32 +00:00
Chen Zhang
97fa8f6590 [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2026-02-10 07:41:16 +00:00
Roger Wang
8a5e0e2b2b [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183)
Signed-off-by: Roger Wang <hey@rogerw.io>
2026-02-10 13:03:32 +08:00
Cyrus Leung
48312e579a [Misc] Make PlaceholderRange.get_num_embeds a method (#34035)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-02-07 05:30:17 +00:00
Mark McLoughlin
2abd97592f [KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2026-02-05 09:57:27 +02:00
Nick Hill
fa4e0fb028 [Core] Don't schedule spec tokens with prefill chunks (#33652)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-02-04 23:40:22 +00:00
Or Ozeri
8e32690869 [KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads (#32255)
Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-02-04 11:16:34 +00:00
Nick Hill
52ee21021a [BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-02-03 23:34:41 +00:00
Yifan Qiao
a01ef3fa51 [Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
2026-02-02 01:59:58 +00:00
jma99_2333
22d9a056d5 Support clear mm and encoder cache (#33452)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-01-31 15:22:25 +00:00
Cyrus Leung
c6e7404cc5 [Multimodal] Simplify MM input definitions (#33331)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-29 13:32:04 +00:00
Wentao Ye
3e440786af [Feature] Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement (#32618)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
2026-01-28 20:30:32 +00:00
Joshua Deng
91601ff478 [Feature] add session based streaming input support to v1 (#28973)
Signed-off-by: Joshua Deng <joshuakdeng@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
2026-01-24 12:06:28 -08:00
Harry Huang
5206e5e28c [V1][Hybrid] Mamba Prefix Caching with align mode (#30877)
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2026-01-23 09:56:48 -08:00
knlnguyen1802
378385b90c [EC Connector] Optimize remote cache check in scheduler (#32585)
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
2026-01-22 03:30:59 +00:00
Wentao Ye
b34474bf2c [Feature] Support async scheduling + PP (#32359)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-15 12:06:23 -05:00
Lumosis
66652e8082 [BugFix] Assign page_size_padded when unifying kv cache spec. (#32283)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
2026-01-14 20:10:01 +00:00
Or Ozeri
2be765b68a [BugFix] scheduler: Fix ordering preserving of skipped requests (#32173)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-12 18:39:38 +00:00
Or Ozeri
028599739d [BugFix] scheduler: Fix resuming of preempted requests after async load (#31583)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-10 12:39:25 -08:00
Yifan Qiao
cd4a95e3aa [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator (#31707)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
2026-01-09 10:53:20 -08:00
Lumosis
b634e619bb Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. (#31635)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: Lumosis <30372757+Lumosis@users.noreply.github.com>
2026-01-08 09:00:07 +00:00
Nick Hill
32f4e4db00 [Cleanup] Remove deprecated fields from CachedRequestData class (#31734)
Signed-off-by: njhill <nickhill123@gmail.com>
2026-01-05 21:07:14 +00:00
Yifan Qiao
52bf066516 [Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector (#30166)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
2025-12-26 18:25:46 -08:00
Michael Goin
8ee90c83f8 Add --max-model-len auto to auto-fit context to available memory (#29431)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-23 21:37:14 -08:00
Chen Zhang
538e830caa [KVEvent] User request.block_hash for parent block_hash (#30544)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: Yifan Qiao <yifanqiao@berkeley.edu>
2025-12-23 18:23:43 -08:00
Roger Wang
f5f51e5931 [Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Sun Kim <sunytokki@gmail.com>
2025-12-16 14:18:17 -08:00
Or Ozeri
4c6fd25880 kv_transfer: Rename the shared storage connectors (#30201)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-12-08 20:46:09 -08:00
Cyrus Leung
e83b7e379c Revert "[Renderer] Separate out RendererConfig from ModelConfig (#30145)" (#30199) 2025-12-07 00:00:22 -08:00
Cyrus Leung
27f4c2fd46 [Renderer] Separate out RendererConfig from ModelConfig (#30145)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-06 23:15:42 -08:00
rasmith
feecba09af [CI/Build][AMD] Use float16 in test_reset_prefix_cache_e2e to avoid accuracy issues (#29997)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-12-05 08:42:25 +00:00
Mark McLoughlin
899e2ef558 [Core] Fix standalone runs of test_reset_prefix_cache_e2e (#29899)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-12-04 16:22:03 +08:00
Chauncey
0a9caca9f5 [Bugfix] fix --scheduling-policy=priority & n>1 crashes engine (#29764)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-02 22:42:28 +00:00
Harry Mellor
951445a52d Remove default values from InitVars so that they're not stored (#29859)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-02 12:16:37 +00:00
Zhuohan Li
d0cd728907 [Core] Support reseting all running requests' KV while calling reset_prefix_cache (#28827)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-02 02:25:05 +00:00
shivampr
cabc77cc86 [Core][Observability] Add KV cache residency metrics (#27793)
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)

Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-01 18:27:53 +00:00
Marcin Ostrowski
5cfa967efa [Bugfix] TypeError: 'NoneType' object is not callable (#29414)
Signed-off-by: Marcin Ostrowski <marcinx.ostrowski@intel.com>
2025-12-01 13:16:44 +00:00
maang-h
c7ba1f6bc7 [BugFix] Fix ValueError in NewRequestData repr methods (#29392)
Signed-off-by: maang <maang_h@163.com>
2025-11-28 13:42:30 +08:00
Yifan Qiao
48ddb02b79 [Hybrid Allocator] Support KV cache groups with different block_size (#29143)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-11-25 10:30:57 -05:00
wang.yuqi
67fc16cd8c [Bugfix] If chunked_prefill is disabled, end the scheduling early. (#28911)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-11-25 16:06:09 +08:00
Chen Zhang
71df2a57ef [Hybrid Allocator] Better layer padding strategy for gpt-oss eagle (#29303)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-11-24 14:28:32 -08:00
Cyrus Leung
5a4802588e [Misc] Further clean up chunked prefill and prefix caching init (#29186)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-22 19:34:15 +08:00
Mark McLoughlin
c6fa3895e9 [KV Connector] Fix async connector prefix cache metrics (#28585)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-21 17:45:00 -05:00
Jialin Ouyang
30b9c67743 Revert "[Redo] #26368 (#28771)" (#29121)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-20 21:27:45 -08:00
Cyrus Leung
98b4d389ed [Redo] #26368 (#28771)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-14 22:47:41 -08:00
Nick Hill
ac86bff8cb Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… (#28773) 2025-11-14 20:24:00 -08:00