Commit Graph

165 Commits

Author SHA1 Message Date
Hiroaki Sugiyama
8958217ad5 [Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 (#15211)
Signed-off-by: h-sugi <h.sugi@ieee.org>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-27 22:29:29 +08:00
Chengji Yao
619d3de8bd [TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS (#15583)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-26 22:46:26 -07:00
Cody Yu
54aa619459 [V1] Refactor num_computed_tokens logic (#15307)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-27 04:54:36 +00:00
Chengji Yao
e74ff409e0 [TPU] support disabling xla compilation cache (#15567)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-27 00:09:28 +00:00
Alexander Matveev
b2e85e26f4 [V1] TPU - Revert to exponential padding by default (#15565)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-26 21:35:05 +00:00
Chenyaaang
ac3cd6e83c [core] add bucket padding to tpu_model_runner (#14995)
Signed-off-by: Chenyaaang <llccyy1212@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-25 17:27:22 -04:00
Nicolò Lucchesi
a0dd7dcd49 [TPU][V1] Fix Sampler recompilation (#15309)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-25 16:43:54 -04:00
Siyuan Liu
4157f563b4 [Hardware][TPU][Bugfix] Fix v1 mp profiler (#15409)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-03-25 01:43:00 -07:00
Woosuk Kwon
25f560a62c [V1][Spec Decode] Update target_logits in place for rejection sampling (#15427)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-24 21:04:41 -07:00
Nick Hill
3aee6573dc [V1] Aggregate chunked prompt logprobs in model runner (#14875)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-24 12:27:57 -04:00
Woosuk Kwon
b9bd76ca14 [V1][Spec Decode] Respect prompt_lookup_max (#15348)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-23 10:41:44 -07:00
shangmingc
50c9636d87 [V1][Usage] Refactor speculative decoding configuration and tests (#14434)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-03-22 19:28:10 -10:00
Chen Zhang
93a00d7dde [v1] Refactor KVCacheConfig (#14079)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-03-21 04:56:27 -07:00
Siyuan Liu
b15fd2be2a [Hardware][TPU] Add check for no additional graph compilation during runtime (#14710)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-03-21 03:05:28 +00:00
Woosuk Kwon
0c6f5023c3 [V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-20 17:50:43 -07:00
Woosuk Kwon
2b22290ce0 [V1] Add flag to disable cascade attention (#15243)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-20 15:24:16 -07:00
Mickaël Seznec
a597a57595 [Attention] Flash Attention 3 - fp8 (#14570)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
2025-03-20 01:14:20 -04:00
Nicolò Lucchesi
d8c6d7d6b5 [V1][TPU] Support V1 Sampler for ragged attention (#14227)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-19 21:00:39 -07:00
Woosuk Kwon
99abb8b650 [V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-18 14:31:54 -07:00
Nicolò Lucchesi
af35d3a3cc [TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-18 07:34:45 -07:00
Varun Sundar Rabindranath
400d483e87 [Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-18 09:47:53 +00:00
iefgnoix
b4ad56c1bd [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2025-03-17 01:48:28 -07:00
Cyrus Leung
b539222d4e [V1] Remove input cache client (#14864)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-16 23:42:06 -07:00
Lily Liu
8d6cf89526 [V1] [Spec Decode] Support random sampling for spec decode (#13933)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 22:00:20 -07:00
Woosuk Kwon
faa0275730 [V1] Optimize the overhead of rewinding (#14905)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 20:19:30 -07:00
Woosuk Kwon
31060b2757 [V1][BugFix] Detect interleaved sliding window attention (#14896)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 14:53:53 -07:00
Nick Hill
fc1f67715d [BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-16 14:53:34 -07:00
Roger Wang
ad19c8a003 [V1] Move OOM check into sampler run (#14728)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-03-13 20:40:23 -07:00
Benjamin Chislett
5c538c37b2 [V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing (#14645)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-03-11 22:12:41 -07:00
Cody Yu
b706d898af [Bugfix][V1][PP] Only warmup sampler at last PP rank (#14643)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-11 23:40:07 +00:00
iefgnoix
863d315c86 [V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly (#14597) 2025-03-11 19:12:26 -04:00
Roger Wang
1fc973c0b5 [V1][Core] Fix memory issue with logits & sampling (#14508)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com>
2025-03-11 04:03:41 +00:00
Varun Sundar Rabindranath
5ff0d32580 [V1] LoRA - Add triton kernels for V1 (#13096)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-10 17:27:53 -04:00
Chengji Yao
212007b168 [Hardware][TPU] Fix the recompiling issue in logits processor after warmup (#14510)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-09 05:44:39 -04:00
iefgnoix
10f7552789 [V1][TPU] Remove unnecessary padding for running on TPU. (#14467) 2025-03-08 21:56:04 -05:00
Robert Shaw
5f0b53c6ea Revert "[V1][Core] Fix memory issue with logits & sampling" (#14504)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-08 17:43:37 -08:00
22quinn
eb8b5eb183 [V1] Support bad_words in sampler (#13376)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-08 14:50:26 -08:00
Roger Wang
8d5aa466fb [V1][Core] Fix memory issue with logits & sampling (#13776)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-08 06:11:04 -08:00
Alexander Matveev
cb8bdfade2 [V1] TPU - Add tensor parallel support via Ray (#13618)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-08 08:19:38 -05:00
Tyler Michael Smith
333681408f [Bugfix][V1] Handle MLA in kv_cache_interface (#14462)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-07 22:18:25 -08:00
yarongmu-google
66e16a038e [Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 (#14459)
Signed-off-by: Yarong Mu <ymu@google.com>
2025-03-07 23:17:04 +00:00
Nick Hill
8ed5421aaa [V1] Eagerly remove finished requests from the batch (#14388)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-07 10:56:00 -08:00
Aaron Pham
80e9afb5bc [V1][Core] Support for Structured Outputs (#12388)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-07 07:19:11 -08:00
Chengji Yao
0578e5a462 [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (#14310)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-06 23:31:05 +00:00
Lucas Wilkinson
f6bb18fd9a [BugFix] MLA + V1, illegal memory access and accuracy issues (#14253)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-05 17:10:13 -08:00
Nick Hill
ac60dc7fe1 [V1][BugFix] Fix for mixed top_k batch (#14301)
Signed-off-by: Nick Hill <nhill@redhat.com>


Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com>
2025-03-05 20:43:04 +00:00
Nick Hill
a32c8669ca [V1][Minor] Remove obsolete FIXME comment (#14304)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-05 11:59:23 -08:00
Robert Shaw
257e200a25 [V1][Frontend] Add Testing For V1 Runtime Parameters (#14159)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-03-05 14:18:55 +00:00
Lu Fang
8d6cd32b7b [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-05 08:49:44 +00:00
Tyler Michael Smith
72c62eae5f [V1] EP/TP MoE + DP Attention (#13931) 2025-03-04 21:27:26 -08:00