Boyuan Feng
|
94e6b2d55f
|
Allow users to specify kv cache memory size (#21489)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-09-11 13:41:07 +00:00 |
|
Tao He
|
e93f4cc9e3
|
Add the support for the qwen3 next model (a hybrid attention model). (#24526)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-09-11 15:32:09 +08:00 |
|
Nick Hill
|
e2d8c27f68
|
[BugFix] Fix pipeline parallel (#24621)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-09-10 23:05:30 -07:00 |
|
Michael Goin
|
fba7856581
|
[Perf] Warmup FlashInfer attention during startup (#23439)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-09-10 15:03:17 -07:00 |
|
Russell Bryant
|
37e8182bfe
|
[v1] Add Whisper model support (encoder-decoder) (#21088)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
|
2025-09-10 13:53:35 -07:00 |
|
Nick Hill
|
f4f1a8df22
|
[BugFix] Ensure integrity of reused CPU tensors during async scheduling (#24527)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: guoze.lin <guozelin@tencent.com>
|
2025-09-10 21:15:14 +08:00 |
|
Lucas Wilkinson
|
0ae43dbf8c
|
[Attention] add DCP support for FLASH_ATTN_MLA backend (#24453)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-09-10 17:19:26 +08:00 |
|
Micah Williamson
|
1c63a16b65
|
[Core] Run garbage collector after CUDA graph capture to fix throughput regression (#24128)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
2025-09-09 10:38:10 -04:00 |
|
Woosuk Kwon
|
2e5d21378d
|
Skip MM Encoder for non-first PP ranks (#24387)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-09-07 09:38:35 -07:00 |
|
youkaichao
|
558f0907dc
|
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode (#24372)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-09-07 01:18:59 +00:00 |
|
Bangsheng Tang
|
848562bd49
|
break execute_model in gpu_model_runner into sub-functions for custom scopes (#24265)
Co-authored-by: Bangsheng Tang <bangsheng@meta.com>
|
2025-09-06 14:02:47 -07:00 |
|
Andrew Sansom
|
305a1cc0d2
|
refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer (#24345)
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
|
2025-09-05 23:01:23 -07:00 |
|
yzds
|
ac201a0eaf
|
[Feature] Support Decode Context Parallel (DCP) for MLA (#23734)
Signed-off-by: hongchao <hongchao@msh.team>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: hongchao <hongchao@msh.team>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-06 13:24:05 +08:00 |
|
Didier Durand
|
35bf193864
|
[Doc]: fix typos in Python comments (#24294)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-09-05 19:41:12 -07:00 |
|
Benjamin Chislett
|
cee182b297
|
[Perf][V1] Fully overlap model execution (#23569)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
2025-09-05 18:20:17 -07:00 |
|
liuzhenwei
|
e599e2c65e
|
[XPU][P/D] Add XPU support in NixlConnector (#22436)
Signed-off-by: zhenwei <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2025-09-04 21:03:12 -07:00 |
|
co63oc
|
1bd007f234
|
fix some typos (#24071)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
|
2025-09-02 20:44:50 -07:00 |
|
Russell Bryant
|
e32a0e8678
|
Upgrade xgrammar to 0.1.23 (#22988)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-09-03 02:32:59 +00:00 |
|
Maximilien de Bayser
|
d59c986444
|
Remove runtime checks based on pooling params (#24051)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
2025-09-02 11:54:37 +08:00 |
|
Woosuk Kwon
|
b55713683c
|
[Misc] Move fast prefill logic to separate method (#24013)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-09-01 05:40:38 +00:00 |
|
Woosuk Kwon
|
8c742a66d1
|
[Misc] Avoid redundant copy for encoder-only models (#24012)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-09-01 04:02:43 +00:00 |
|
Didier Durand
|
9701352e4b
|
[Doc]: fix typos in Python comments (#24001)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
|
2025-08-31 08:21:59 +00:00 |
|
Andy Lo
|
038e9be4eb
|
[LoRA] Much faster startup when LoRA is enabled (#23777)
Signed-off-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-08-30 15:37:39 +00:00 |
|
wang.yuqi
|
d9e00dbd1f
|
[Performance] V1 Classify Models E2E Performance Optimization (#23541)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-08-29 03:12:32 -07:00 |
|
Yong Hoon Shin
|
cb293f6a79
|
[V1] Enable prefill optimization for Gemma3n (#22628)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-08-28 14:54:30 -07:00 |
|
Wentao Ye
|
321938e9ac
|
[Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue (#23595)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-27 21:52:24 +00:00 |
|
Hyogeun Oh (오효근)
|
4e4d017b6f
|
[Docs] Fix warnings in mkdocs build (continued) (#23743)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
|
2025-08-27 17:17:29 +00:00 |
|
Cyrus Leung
|
52883ed084
|
[Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal (#23749)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-27 10:01:50 -07:00 |
|
Woosuk Kwon
|
04ff1e43fb
|
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py (#23728)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-27 03:25:00 -07:00 |
|
Cyrus Leung
|
69244e67e6
|
[Core] Use key-only cache for BaseMultiModalProcessor (#23018)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-27 14:19:13 +08:00 |
|
Li, Jiang
|
9b0187003e
|
[Bugfix] Fix cuda event usage with CPU model runner (#23643)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-08-26 17:10:42 +00:00 |
|
Chen Zhang
|
2b4fc9bd9b
|
Support FlashAttention Backend for Hybrid SSM Models (#23299)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-08-26 12:41:52 +00:00 |
|
Zijing Liu
|
b395b3b0a3
|
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (#22760)
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com>
|
2025-08-25 21:06:00 -07:00 |
|
Woosuk Kwon
|
0ff902f3b4
|
[Refactor] Refactor persistent buffers with CpuGpuBuffer (#23515)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-25 08:44:48 -07:00 |
|
Ayush Satyam
|
5c4b6e66fe
|
[Attention] Unify mamba and attention backend selection (#23171)
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
|
2025-08-25 09:09:36 +00:00 |
|
Chenguang Zheng
|
d765cf01fe
|
[Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests (#22711)
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
|
2025-08-25 00:41:17 -07:00 |
|
Woosuk Kwon
|
ad78868450
|
[Misc] Remove unused slot_mapping buffer (#23502)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-24 14:03:36 -07:00 |
|
Nick Hill
|
c80c53a30f
|
[BugFix] Fix batch updates for pooling models (#23398)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-23 08:20:41 +08:00 |
|
Woosuk Kwon
|
808d2e9aa0
|
[Misc] Move M-RoPE init logic to _init_mrope_positions (#23422)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-22 03:07:22 -07:00 |
|
Chen Zhang
|
17373dcd93
|
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models (#23154)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-08-22 05:05:59 +00:00 |
|
Nick Hill
|
603fbbbce0
|
[Misc] Misc code cleanup/simplification (#23304)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-21 17:22:55 +00:00 |
|
wang.yuqi
|
d70a16625d
|
[Performance] V1 Pooling Models E2E Performance Optimization (#23162)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-08-21 13:26:09 +00:00 |
|
Cyrus Leung
|
0c6e40bbaa
|
[Refactor] Simplify code for MM budget (#23310)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-21 08:00:16 +00:00 |
|
Woosuk Kwon
|
b029de9902
|
[Optimization] Make new_block_ids None if empty (#23262)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
|
2025-08-20 18:25:56 -07:00 |
|
rongfu.leng
|
4fbda0b20c
|
[Feature] use --eplb_config to set eplb param (#20562)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-08-20 14:07:28 -07:00 |
|
Woosuk Kwon
|
d6d13bd49e
|
[Misc] Add max_seq_len to CommonAttentionMetadata (#23216)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-20 09:05:29 -07:00 |
|
Woosuk Kwon
|
40f26734b9
|
[Misc] Fix seq_lens for graph capture (#23175)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 03:58:16 -07:00 |
|
Woosuk Kwon
|
21bcc8263f
|
[Misc] Avoid accessing req_ids inside a loop (#23159)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 09:39:38 +00:00 |
|
Woosuk Kwon
|
c9b38be8aa
|
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 17:20:38 -07:00 |
|
Woosuk Kwon
|
0dd3f4f5ab
|
[Misc] Minor refactoring for prepare_inputs (#23116)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 16:58:05 -07:00 |
|