Commit Graph

333 Commits

Author SHA1 Message Date
Cyrus Leung
766bc8162c [Core] Store only the keys for multi-modal data in P0 (#22198)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-07 01:45:04 -07:00
lkchen
f4f4e7ef27 [V0 deprecation][P/D] Deprecate v0 KVConnectorBase code (1/2) (#21785)
Signed-off-by: Linkun Chen <github@lkchen.net>
2025-08-04 19:11:33 -07:00
Woosuk Kwon
7175817637 Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." (#22223) 2025-08-04 18:37:06 -07:00
PiteXChen
2dffac464c [Bugfix] V1 Fix the cursor leakage issue during request scheduling. (#21173)
Signed-off-by: CLFutureX <775523362@qq.com>
2025-08-04 18:34:10 -07:00
David Ben-David
aefeea0fde [V1] [P/D] Refactor KV Connector Path (#21980)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
2025-08-03 04:03:40 -07:00
Zebing Lin
e0f63e4a35 [Core] Avoid repeated len(block_token_ids) check in hash_request_tokens (#21781)
Signed-off-by: linzebing <linzebing1995@gmail.com>
2025-08-01 00:23:29 -07:00
Ruixiang Tan
8f4a1c9a04 [Misc] Improve code readability of KVCacheManager (#21673)
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com>
Signed-off-by: Ruixiang Tan <819464715@qq.com>
Signed-off-by: GitHub <noreply@github.com>
2025-07-30 07:20:43 -07:00
MingzhenHan
b7b23da4d2 [Bugfix] Fix comment typo of get_num_common_prefix_blocks() (#21827)
Signed-off-by: MingzhenHan <hanmingzhen2002@outlook.com>
2025-07-29 20:35:33 -07:00
Chen Zhang
755fa8b657 [KVCache] Make KVCacheSpec hashable (#21791)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-07-29 19:58:29 +08:00
Kuntai Du
b18b417fbf Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" (#21778)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-07-28 20:15:18 +00:00
Adeline
15a72ac478 [V1] Exception Handling when Loading KV Cache from Remote Store (#21534)
Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
2025-07-27 20:34:17 -07:00
Zhou Fang
fc5f756db4 [v1][Core] Clean up usages of SpecializedManager (#21407)
Signed-off-by: Zhou Fang <fang.github@gmail.com>
2025-07-24 00:40:11 -07:00
Raushan Turganbay
f38ee34a0a [feat] Enable mm caching for transformers backend (#21358)
Signed-off-by: raushan <raushan@huggingface.co>
2025-07-22 08:18:46 -07:00
Jialin Ouyang
ed25054577 [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool (#21222)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-07-22 06:17:47 -07:00
Simon Mo
32142b3c62 [Bugfix] Fix eviction cached blocked logic (#21357)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-07-22 01:18:40 -07:00
Jialin Ouyang
af376ca19d [Core] Minimize number of dict lookup in _maybe_evict_cached_block (#21281)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-07-21 22:37:34 -07:00
Lucia Fang
9a9fda1423 [Core] Support Local Chunked Attention for Hybrid KV Cache (#19351)
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@meta.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Lu Fang <fanglu@meta.com>
2025-07-18 20:48:38 -07:00
JialinOuyang-Meta
0f199f197b [Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue (#21005)
Signed-off-by: Jialin Ouyang <jialino@meta.com>
2025-07-18 12:34:40 -07:00
Lucas Wilkinson
89cab4d01f [Attention] Make local attention backend agnostic (#21093) 2025-07-18 00:10:42 -04:00
Christian Pinto
4ffd963fa0 [v1][core] Support for attention free models (#20811)
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
2025-07-15 14:20:01 +00:00
Woosuk Kwon
d4d309409f Implement Async Scheduling (#19970)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-14 23:01:46 -07:00
Maroon Ayoub
66f6fbd393 [Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) (#20511)
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
2025-07-14 02:45:31 +00:00
nopperl
4bbfc36b16 [V1] Hybrid allocator without prefix caching (#20661)
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
2025-07-13 16:55:14 +00:00
Woosuk Kwon
f45a332886 [Sched] Enhance the logic to remove stopped requests from queues (#20739) 2025-07-12 15:33:13 -07:00
Woosuk Kwon
7c12a765aa [Misc] Simplify the prefix caching logic on draft tokens (#20701)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-09 14:48:35 -07:00
Woosuk Kwon
31c5d0a1b7 [Optimize] Don't send token ids when kv connector is not used (#20586)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-07 19:04:54 -07:00
Peter Pan
edd270bc78 [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled (#20486)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2025-07-07 09:41:15 -07:00
Thomas Parnell
2f35a022e6 Enable V1 for Hybrid SSM/Attention Models (#20016)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-07-04 17:46:53 +00:00
Jee Jee Li
1caca5a589 [Misc] Add SPDX-FileCopyrightText (#20428)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00
Woosuk Kwon
7f280d69c9 [Optimization] Cache sampled token ids in model runner (#20291)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-01 11:01:31 -07:00
Woosuk Kwon
0e96cc9b7e [Misc] Minor refactoring for scheduler (#20299)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-01 07:55:32 -07:00
Woosuk Kwon
2863befce3 [Optimization] Use Shared CachedRequestData Instance Across All Requests (#20232)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-30 09:07:50 -07:00
amit
4a0f7888a3 [Core] feat: Implement Priority Scheduling in V1 Engine (#19057)
Signed-off-by: amit <amit.man@gmail.com>
Co-authored-by: Roger Wang <Rogerw0108@gmail.com>
2025-06-22 20:18:08 -07:00
Vlad Tiberiu Mihailescu
2e3e3c86dc Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>
2025-06-20 22:47:16 +08:00
Maximilien de Bayser
799397ee4f Support embedding models in V1 (#16188)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-06-18 21:36:33 -07:00
Chen Zhang
a89209b78d [v1] Support mamba2 (#19327)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-06-18 20:34:15 +00:00
Russell Bryant
5f52a84685 [V1] Add API docs for EncoderCacheManager (#19294)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-06-18 13:37:01 +08:00
Saheli Bhattacharjee
d1e34cc9ac [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (#18354)
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
2025-06-14 11:07:36 +08:00
Nick Hill
7e8d97dd3f [BugFix] Honor enable_caching in connector-delayed kvcache load case (#19435)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-13 09:46:32 +00:00
jmswen
c9280e6346 [Bugfix] Respect num-gpu-blocks-override in v1 (#19503)
Signed-off-by: Jon Swenson <jmswen@gmail.com>
2025-06-12 11:00:23 +00:00
Nick Hill
646d62f636 [Core] Use tuple for kv cache group block ids (#19175)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-10 07:01:17 +02:00
Nicolò Lucchesi
b6a3a9f76d [Core] Fix abrupt request abort (#18485)
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>

Co-authored-by: Nick Hill <nhill@redhat.com>
2025-06-06 16:27:59 -07:00
Nick Hill
aad30bd306 [BugFix] Fix MultiConnector test after HMA changes (#19291)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-06 20:16:24 +00:00
Nick Hill
65c69444b1 [Docs] Improve V1 KVConnector interface documentation (#19172)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-06 16:22:45 +08:00
Jinghui Zhang
90b78ec5f9 [v1][P/D] Fix a edge case in kv cache schedule (#19182)
Co-authored-by: jinghui <jinghui@fb.com>
2025-06-05 23:32:55 -07:00
Chen Zhang
f8a1a2d108 [v1] Hybrid Memory Allocator (#17996)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-06-05 20:47:09 -07:00
Robert Shaw
c56ed8bb0e [Bugfix][Nixl] Fix full prefix cache hit bug (#18632)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-06-05 02:07:32 +00:00
Yan Ru Pei
b712be98c7 feat: add data parallel rank to KVEventBatch (#18925) 2025-06-03 17:14:20 -07:00
Chen Zhang
a8da78eac9 [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (#19029)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-06-04 00:14:06 +00:00
Chen Zhang
b5fd9506c1 [Bugfix] get_num_blocks_to_allocate with null_block (#19031)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-06-03 15:30:55 -07:00