Chen Zhang
|
6cac54f4d1
|
[v1] Re-init input batch for multiple kv cache groups (#18654)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-06-03 21:41:36 +00:00 |
|
Yong Hoon Shin
|
bdf13965ab
|
[V1] Support cross-layer KV sharing (#18212)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-06-03 20:33:07 +00:00 |
|
Simon Mo
|
02f0c7b220
|
[Misc] Add SPDX-FileCopyrightText (#19100)
Signed-off-by: simon-mo <simon.mo@hey.com>
|
2025-06-03 11:20:17 -07:00 |
|
Siyuan Liu
|
9112b443a0
|
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD (#18011)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Hossein Sarshar <hossein.sarshar@gmail.com>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
|
2025-06-03 00:06:20 +00:00 |
|
22quinn
|
9760fd8f6a
|
[Core] Support inplace model weights loading (#18745)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
|
2025-06-02 17:38:50 +08:00 |
|
zhrrr
|
d6fd3a33b8
|
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context (#18935)
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
|
2025-06-01 19:41:18 +00:00 |
|
Yu Guo
|
7782464a17
|
create util function for batched arange (#18937)
|
2025-05-31 13:50:38 +08:00 |
|
Carol Zheng
|
fba02e3bd1
|
[Bugfix][TPU] Fix tpu model runner testcase failure (#18810)
Signed-off-by: Carol Zheng <cazheng@google.com>
|
2025-05-30 18:04:03 +08:00 |
|
Nicolò Lucchesi
|
32ce3cf7c9
|
[V1] Allocate kv_cache with stride order for V1 (#18775)
Signed-off-by: nicklucche <nlucches@redhat.com>
|
2025-05-29 17:54:16 +00:00 |
|
Varun Sundar Rabindranath
|
7951d78738
|
[Core] Enable CUDA graphs for DP + All2All kernels (#18724)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-05-28 22:55:30 +00:00 |
|
Akshat Tripathi
|
643622ba46
|
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend (#15655)
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Signed-off-by: xihajun <junfan@krai.ai>
Signed-off-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk>
Signed-off-by: Jorge de Freitas <jorge@krai.ai>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: xihajun <junfan@krai.ai>
Co-authored-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk>
Co-authored-by: Jorge de Freitas <jorge@krai.ai>
|
2025-05-28 19:59:09 +00:00 |
|
Aaron Pham
|
a09c7ca9f2
|
[Chore][Spec Decode] Update check NoneType instead of assigning variables (#18836)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2025-05-28 18:57:19 +00:00 |
|
Divakar Verma
|
774c5fde30
|
[V1] fix torch profiling for V1 offline scenarios (#18445)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
|
2025-05-28 04:16:30 +00:00 |
|
Cyrus Leung
|
696259ca01
|
[Core] Automatically cast multi-modal input dtype (#18756)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-05-27 23:45:48 +08:00 |
|
Hyogeun Oh (오효근)
|
a68e293cb9
|
[Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking (#18663)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
|
2025-05-27 01:44:20 -07:00 |
|
qizixi
|
c1e4a4052d
|
[V1][Spec Decode] Support multi-layer eagle draft model (#18030)
Signed-off-by: qizixi <qizixi@meta.com>
|
2025-05-24 09:45:34 +00:00 |
|
qizixi
|
d55e446d13
|
[V1][Spec Decode] Small refactors to improve eagle bookkeeping performance (#18424)
Signed-off-by: qizixi <qizixi@meta.com>
|
2025-05-24 06:51:22 +00:00 |
|
Jiayi Yao
|
2628a69e35
|
[V1] Support Deepseek MTP (#18435)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Co-authored-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-05-23 10:26:28 -07:00 |
|
Chen Zhang
|
6550114c9c
|
[v1] Redo "Support multiple KV cache groups in GPU model runner (#17945)" (#18593)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-05-23 09:39:47 -07:00 |
|
youkaichao
|
6a7988c55b
|
Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-05-23 23:43:43 +08:00 |
|
Harry Mellor
|
2edb533af2
|
Replace {func} with mkdocs style links (#18610)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-05-23 05:51:38 -07:00 |
|
Harry Mellor
|
a1fe24d961
|
Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-05-23 02:09:53 -07:00 |
|
Sanger Steel
|
c32e249a23
|
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization (#17926)
Signed-off-by: Sanger Steel <sangersteel@gmail.com>
|
2025-05-22 18:44:18 -07:00 |
|
Bowen Wang
|
4e04eceb58
|
[Bugfix] Use random hidden states in dummy sampler run (#18543)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
|
2025-05-22 06:48:56 -07:00 |
|
Mark McLoughlin
|
bb0a311213
|
Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) (#18459)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-05-21 10:25:23 -07:00 |
|
cascade
|
9ab2c02ff8
|
Support sequence parallelism combined with pipeline parallelism (#18243)
Signed-off-by: cascade812 <cascade812@outlook.com>
|
2025-05-17 22:47:25 +00:00 |
|
Siyuan Liu
|
48ac2bed5b
|
[Hardware][TPU] Optionally import for TPU backend (#18269)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: Carol Zheng <cazheng@google.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: Hongmin Fan <fanhongmin@google.com>
|
2025-05-17 15:23:12 +08:00 |
|
Lucia Fang
|
3d2779c29a
|
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 (#17827)
Signed-off-by: Lucia Fang <fanglu@fb.com>
|
2025-05-15 22:28:27 -07:00 |
|
Sky Lee
|
f4937a51c1
|
[Model] vLLM v1 supports Medusa (#17956)
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>
Signed-off-by: skylee-01 <497627264@qq.com>
Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com>
|
2025-05-15 21:05:31 -07:00 |
|
Chen Zhang
|
e60f550b38
|
[v1] Support multiple KV cache groups in GPU model runner (#17945)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-05-14 18:54:54 -07:00 |
|
bnellnm
|
f9c069c85e
|
Modularize fused experts and integrate PPLX kernels (#15956)
|
2025-05-14 13:11:54 -07:00 |
|
youkaichao
|
6266c57bae
|
[core][distributed] add ep group and all2all interface (#18077)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-05-14 10:46:49 +08:00 |
|
Roger Wang
|
6e27c6d86b
|
[Misc] Remove unused numpy tensor (#18084)
Signed-off-by: Roger Wang <hey@rogerw.me>
|
2025-05-13 19:33:40 -07:00 |
|
Jin Huang
|
8dd0671bac
|
[Bugfix][V1] Only get input embeddings w/ multi-modal models if first PP (#17916)
Signed-off-by: Jin Huang <jinhun@amazon.com>
Co-authored-by: Jin Huang <jinhun@amazon.com>
|
2025-05-13 15:10:07 +08:00 |
|
Robert Shaw
|
d19110204c
|
[P/D] NIXL Integration (#17751)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Brent Salisbury <bsalisbu@redhat.com>
|
2025-05-12 09:46:16 -07:00 |
|
Siyuan Liu
|
430783018c
|
[Bugfix][TPU] Use np array when updating cache slot_mapping (#17971)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
|
2025-05-12 12:58:33 +08:00 |
|
Chen Zhang
|
950751a987
|
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders (#17483)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-05-10 16:12:04 -07:00 |
|
Chanh Nguyen
|
7ea2adb802
|
[Core] Support full cuda graph in v1 (#16072)
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com>
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
|
2025-05-07 22:30:15 -07:00 |
|
Akshat Tripathi
|
c20ef40fd0
|
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend (#14238)
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
|
2025-05-07 16:28:47 -04:00 |
|
Jee Jee Li
|
822de7fb94
|
[Misc] Split model loader (#17712)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-05-07 12:42:26 +08:00 |
|
Chen Zhang
|
cba31c47c4
|
[v1] AttentionMetadata for each layer (#17394)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-05-06 07:58:37 -07:00 |
|
Li, Jiang
|
a6fed02068
|
[V1][PP] Support PP for MultiprocExecutor (#14219)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
|
2025-05-06 07:58:05 -07:00 |
|
Nicolò Lucchesi
|
5941e0b7ea
|
[TPU][V1] Add support for top-logprobs (#17072)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-05-05 14:20:15 -07:00 |
|
Harry Mellor
|
d6484ef3c3
|
Add full API docs and improve the UX of navigating them (#17485)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-05-03 19:42:43 -07:00 |
|
Benjamin Chislett
|
34120f5acd
|
[V1][Feature] Enable Speculative Decoding with Structured Outputs (#14702)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
|
2025-04-30 00:02:10 +00:00 |
|
Bryan Lu
|
70788bdbdc
|
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE (#17211)
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
|
2025-04-29 21:10:00 +00:00 |
|
Chen Zhang
|
24e6ad3f16
|
[V1] Remove num_input_tokens from attn_metadata (#17193)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-04-29 09:28:41 -07:00 |
|
cascade
|
690fe019f0
|
[Feature] support sequence parallelism using compilation pass (#16155)
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-04-27 06:29:35 -07:00 |
|
Chen Zhang
|
838cedade7
|
[Bugfix] Get a specific type of layer from forward context (#17222)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-04-27 00:58:05 -07:00 |
|
Nick Hill
|
df6f3ce883
|
[Core] Remove prompt string from engine core data structures (#17214)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-04-25 23:41:05 -07:00 |
|