wwl2755
|
463bbb1835
|
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (#15367)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
|
2025-04-03 07:32:10 +00:00 |
|
Nicolò Lucchesi
|
bd7599d34a
|
[V1][TPU] Do not compile sampling more than needed (#15883)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-04-03 01:36:01 +00:00 |
|
Chengji Yao
|
01b6113659
|
[TPU] optimize the all-reduce performance (#15903)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-04-03 00:25:14 +00:00 |
|
Eric Tang
|
ddb94c2605
|
[core] Add tags parameter to wake_up() (#15500)
Signed-off-by: Eric <erictang000@gmail.com>
|
2025-04-02 01:59:27 -07:00 |
|
Woosuk Kwon
|
e75a6301bd
|
[V1][Spec Decode] Implement Eagle Proposer [1/N] (#15729)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-04-01 12:33:16 -07:00 |
|
Chen Zhang
|
3a5f0afcd2
|
[V1] Implement sliding window attention in kv_cache_manager (#14097)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-04-01 00:33:17 -07:00 |
|
Alexander Matveev
|
9a2160fa55
|
[V1] TPU CI - Add basic perf regression test (#15414)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
|
2025-03-31 13:25:20 -04:00 |
|
yihong
|
2de4118243
|
fix: change GB to GiB in logging close #14979 (#15807)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
|
2025-03-31 10:00:50 -07:00 |
|
Cyrus Leung
|
09e974d483
|
[Bugfix] Check dimensions of multimodal embeddings in V1 (#15816)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-31 09:01:35 -07:00 |
|
youkaichao
|
555aa21905
|
[V1] Fully Transparent Implementation of CPU Offloading (#15354)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-03-31 20:22:34 +08:00 |
|
Chengyang LIU
|
18ed3132d2
|
[Misc] update the comments (#15780)
Signed-off-by: chengyang liu <lcy4869@gmail.com>
Co-authored-by: chengyang liu <lcy4869@gmail.com>
|
2025-03-30 19:39:56 -07:00 |
|
Cyrus Leung
|
803d5c35f3
|
[V1] Override mm_counts for dummy data creation (#15703)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-30 03:20:42 -07:00 |
|
Woosuk Kwon
|
2bc4be4e32
|
[V1][Minor] Simplify rejection sampler's parse_output (#15741)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-29 09:25:17 -07:00 |
|
yarongmu-google
|
7c1f760024
|
[Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 (#15659)
Signed-off-by: Yarong Mu <ymu@google.com>
|
2025-03-28 21:13:15 -07:00 |
|
Nicolò Lucchesi
|
da461f3cbf
|
[TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K (#15714)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-28 21:13:06 -07:00 |
|
Alexander Matveev
|
c3f687ac22
|
[V1] TPU - Fix the chunked prompt bug (#15713)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
|
2025-03-28 20:19:04 +00:00 |
|
Robert Shaw
|
038bededba
|
[TPU] [Perf] Improve Memory Usage Estimation (#15671)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2025-03-28 17:37:52 +00:00 |
|
Cyrus Leung
|
c6bc0034d0
|
[Misc] Remove unused utils and clean up imports (#15708)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-28 09:41:16 -07:00 |
|
Cyrus Leung
|
355f66348c
|
[V1] Remove legacy input registry (#15673)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-27 23:34:34 -07:00 |
|
Nicolò Lucchesi
|
4098b72210
|
[Bugfix][TPU][V1] Fix recompilation (#15553)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-27 19:15:06 +00:00 |
|
Cyrus Leung
|
13ac9cab21
|
[Misc] Avoid direct access of global mm_registry in compute_encoder_budget (#15621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-27 17:52:00 +00:00 |
|
Hiroaki Sugiyama
|
8958217ad5
|
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 (#15211)
Signed-off-by: h-sugi <h.sugi@ieee.org>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-27 22:29:29 +08:00 |
|
Chengji Yao
|
619d3de8bd
|
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS (#15583)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-03-26 22:46:26 -07:00 |
|
Cody Yu
|
54aa619459
|
[V1] Refactor num_computed_tokens logic (#15307)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-27 04:54:36 +00:00 |
|
Chengji Yao
|
e74ff409e0
|
[TPU] support disabling xla compilation cache (#15567)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-03-27 00:09:28 +00:00 |
|
Alexander Matveev
|
b2e85e26f4
|
[V1] TPU - Revert to exponential padding by default (#15565)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
|
2025-03-26 21:35:05 +00:00 |
|
Chenyaaang
|
ac3cd6e83c
|
[core] add bucket padding to tpu_model_runner (#14995)
Signed-off-by: Chenyaaang <llccyy1212@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
|
2025-03-25 17:27:22 -04:00 |
|
Nicolò Lucchesi
|
a0dd7dcd49
|
[TPU][V1] Fix Sampler recompilation (#15309)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-25 16:43:54 -04:00 |
|
Siyuan Liu
|
4157f563b4
|
[Hardware][TPU][Bugfix] Fix v1 mp profiler (#15409)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
|
2025-03-25 01:43:00 -07:00 |
|
Woosuk Kwon
|
25f560a62c
|
[V1][Spec Decode] Update target_logits in place for rejection sampling (#15427)
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-24 21:04:41 -07:00 |
|
Nick Hill
|
3aee6573dc
|
[V1] Aggregate chunked prompt logprobs in model runner (#14875)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-24 12:27:57 -04:00 |
|
Woosuk Kwon
|
b9bd76ca14
|
[V1][Spec Decode] Respect prompt_lookup_max (#15348)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-23 10:41:44 -07:00 |
|
shangmingc
|
50c9636d87
|
[V1][Usage] Refactor speculative decoding configuration and tests (#14434)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-03-22 19:28:10 -10:00 |
|
Chen Zhang
|
93a00d7dde
|
[v1] Refactor KVCacheConfig (#14079)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-03-21 04:56:27 -07:00 |
|
Siyuan Liu
|
b15fd2be2a
|
[Hardware][TPU] Add check for no additional graph compilation during runtime (#14710)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
|
2025-03-21 03:05:28 +00:00 |
|
Woosuk Kwon
|
0c6f5023c3
|
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-03-20 17:50:43 -07:00 |
|
Woosuk Kwon
|
2b22290ce0
|
[V1] Add flag to disable cascade attention (#15243)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-20 15:24:16 -07:00 |
|
Mickaël Seznec
|
a597a57595
|
[Attention] Flash Attention 3 - fp8 (#14570)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
|
2025-03-20 01:14:20 -04:00 |
|
Nicolò Lucchesi
|
d8c6d7d6b5
|
[V1][TPU] Support V1 Sampler for ragged attention (#14227)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-19 21:00:39 -07:00 |
|
Woosuk Kwon
|
99abb8b650
|
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-18 14:31:54 -07:00 |
|
Nicolò Lucchesi
|
af35d3a3cc
|
[TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-03-18 07:34:45 -07:00 |
|
Varun Sundar Rabindranath
|
400d483e87
|
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-03-18 09:47:53 +00:00 |
|
iefgnoix
|
b4ad56c1bd
|
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
|
2025-03-17 01:48:28 -07:00 |
|
Cyrus Leung
|
b539222d4e
|
[V1] Remove input cache client (#14864)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2025-03-16 23:42:06 -07:00 |
|
Lily Liu
|
8d6cf89526
|
[V1] [Spec Decode] Support random sampling for spec decode (#13933)
Create Release / Create Release (push) Has been cancelled
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 22:00:20 -07:00 |
|
Woosuk Kwon
|
faa0275730
|
[V1] Optimize the overhead of rewinding (#14905)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 20:19:30 -07:00 |
|
Woosuk Kwon
|
31060b2757
|
[V1][BugFix] Detect interleaved sliding window attention (#14896)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-16 14:53:53 -07:00 |
|
Nick Hill
|
fc1f67715d
|
[BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-03-16 14:53:34 -07:00 |
|
Roger Wang
|
ad19c8a003
|
[V1] Move OOM check into sampler run (#14728)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2025-03-13 20:40:23 -07:00 |
|
Benjamin Chislett
|
5c538c37b2
|
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing (#14645)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
2025-03-11 22:12:41 -07:00 |
|