Varun Sundar Rabindranath
|
0b1cfa6180
|
[Kernel] LoRA - Enable CUDAGraphs for V1 (#14626)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-03-13 20:42:04 -07:00 |
|
Cyrus Leung
|
f53a0586b9
|
[Bugfix] Fix prompt format of GLM4V (#14539)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-13 11:37:17 +00:00 |
|
Mathis Felardos
|
1bd32bc8dd
|
[Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config (#14367)
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
|
2025-03-12 20:15:20 -07:00 |
|
Woosuk Kwon
|
53be4a8634
|
[V1] Allow sliding window + prefix caching (#13069)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-03-12 11:21:19 -07:00 |
|
Sage Moore
|
d9f83d6206
|
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm (#14316)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
2025-03-12 15:51:20 +00:00 |
|
Woosuk Kwon
|
c0c25e25fa
|
[Model] Add support for Gemma 3 (#14660)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-12 08:36:33 -07:00 |
|
Pavani Majety
|
debd6bbf09
|
[Kernel] Add ModelOpt FP4 Checkpoint Support (#12520)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-03-12 05:13:11 +00:00 |
|
Roger Wang
|
1fc973c0b5
|
[V1][Core] Fix memory issue with logits & sampling (#14508)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com>
|
2025-03-11 04:03:41 +00:00 |
|
Harry Mellor
|
3b352a2f92
|
Correct capitalisation: VLLM -> vLLM (#14562)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-10 16:36:21 +00:00 |
|
Aaron Pham
|
0b7f06b447
|
[Misc] add use_tqdm_on_load to reduce logs (#14407)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2025-03-08 05:57:46 -08:00 |
|
Harry Mellor
|
47512b3200
|
Default to generation_config from model (#12622)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-08 14:46:15 +08:00 |
|
Cyrus Leung
|
05fb6718f0
|
[Bugfix] Clean up multi-modal processors (#14417)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-03-07 10:33:38 +00:00 |
|
Tyler Michael Smith
|
cc2f9b32c8
|
[Distributed] Add enable_expert_parallel arg (#14305)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-03-06 18:54:45 +00:00 |
|
youkaichao
|
151b08e0fe
|
[RLHF] use worker_extension_cls for compatibility with V0 and V1 (#14185)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-03-07 00:32:46 +08:00 |
|
Congcong Chen
|
0a995d5434
|
[Model] New model support for Phi-4-multimodal-instruct (#14119)
|
2025-03-04 20:57:01 -08:00 |
|
Harry Mellor
|
cf069aa8aa
|
Update deprecated Python 3.8 typing (#13971)
|
2025-03-02 17:34:51 -08:00 |
|
Ce Gao
|
bf33700ecd
|
[v0][structured output] Support reasoning output (#12955)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
|
2025-03-02 14:49:42 -05:00 |
|
Luka Govedič
|
bd56c983d6
|
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902)
Signed-off-by: luka <luka@neuralmagic.com>
|
2025-02-28 16:20:11 -07:00 |
|
Roger Wang
|
6c85da3a18
|
[V1]SupportsV0Only protocol for model definitions (#13959)
Signed-off-by: Roger Wang <ywang@roblox.com>
|
2025-02-27 20:02:15 -05:00 |
|
Benjamin Chislett
|
9804145cac
|
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict (#13626)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
2025-02-27 15:28:08 -08:00 |
|
Cyrus Leung
|
a2dd48c386
|
[VLM] Deprecate legacy input mapper for OOT multimodal models (#13979)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-02-27 19:14:55 +00:00 |
|
Szymon Ożóg
|
7f0be2aa24
|
[Model] Deepseek GGUF support (#13167)
|
2025-02-27 02:08:35 -08:00 |
|
Sage Moore
|
1d35662e6d
|
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms (#13844)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
2025-02-26 14:56:58 +08:00 |
|
cjackal
|
51010a1807
|
[Misc] set single whitespace between log sentences (#13771)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
|
2025-02-25 10:26:12 +08:00 |
|
Robert Shaw
|
f61528d46d
|
[Misc][Chore] Clean Up AsyncOutputProcessing Logs (#13780)
|
2025-02-24 16:39:07 -08:00 |
|
Robert Shaw
|
1f0ae3ed0a
|
[Misc] Clean Up EngineArgs.create_engine_config (#13734)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2025-02-24 13:52:21 -05:00 |
|
Nicolò Lucchesi
|
444b0f0f62
|
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set (#12513)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-02-24 10:43:21 -05:00 |
|
Jongseok Park
|
781096e385
|
Expert Parallelism (EP) Support for DeepSeek V2 (#12583)
|
2025-02-24 07:33:20 -08:00 |
|
youkaichao
|
eb24dc4a45
|
[v1] torchrun compatibility (#13642)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-23 22:47:24 +08:00 |
|
youkaichao
|
2382ad29d1
|
[ci] fix linter (#13701)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-22 20:28:59 +08:00 |
|
youkaichao
|
3e472d882a
|
[core] set up data parallel communication (#13591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-22 19:28:59 +08:00 |
|
Mark McLoughlin
|
2cb8c1540e
|
[Metrics] Add --show-hidden-metrics-for-version CLI arg (#13295)
|
2025-02-22 00:20:45 -08:00 |
|
Mark McLoughlin
|
1cd981da4f
|
[V1][Metrics] Support vllm:cache_config_info (#13299)
|
2025-02-22 00:20:00 -08:00 |
|
Lucas Wilkinson
|
288cc6c234
|
[Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-02-21 15:30:12 -08:00 |
|
Michael Goin
|
71face8540
|
[Bugfix] Fix max_num_batched_tokens for MLA (#13620)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-02-20 17:45:20 -08:00 |
|
Joe Runde
|
bfbc0b32c6
|
[Frontend] Add backend-specific options for guided decoding (#13505)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2025-02-20 15:07:58 -05:00 |
|
Yannick Schnider
|
423330263b
|
[Feature] Pluggable platform-specific scheduler (#13161)
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
|
2025-02-19 17:16:38 +08:00 |
|
Lucia Fang
|
f525c0be8b
|
[Model][Speculative Decoding] DeepSeek MTP spec decode (#12755)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
|
2025-02-19 17:06:23 +08:00 |
|
Kevin H. Luu
|
d5d214ac7f
|
[1/n][CI] Load models in CI from S3 instead of HF (#13205)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
|
2025-02-19 07:34:59 +00:00 |
|
shangmingc
|
46cdd59577
|
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-02-16 19:32:26 -08:00 |
|
Joe Runde
|
3bcb8c75da
|
[Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-14 15:36:07 -08:00 |
|
Michael Goin
|
f0b2da72a8
|
Expand MLA to support most types of quantization (#13181)
|
2025-02-13 22:19:22 -08:00 |
|
Nicolò Lucchesi
|
d84cef76eb
|
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909)
|
2025-02-13 07:23:45 -08:00 |
|
Keyun Tong
|
3ee696a63d
|
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (#12518)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
|
2025-02-12 12:25:58 +08:00 |
|
wangxiyuan
|
2e3b969ec0
|
[Platform] add pre_register_and_update function (#12432)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-02-11 22:06:46 +08:00 |
|
youkaichao
|
91dd8f7aa6
|
[bugfix] respect distributed_executor_backend in world_size=1 (#12934)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-08 16:17:08 +08:00 |
|
youkaichao
|
09b95e36ab
|
[torch.compile] PyTorch 2.6 and nightly compatibility (#12393)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-07 01:09:07 +08:00 |
|
Michael Goin
|
449d1bce02
|
[Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793)
|
2025-02-05 23:16:20 -08:00 |
|
Kyle Sayers
|
4896d0c2dd
|
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711)
|
2025-02-03 23:27:11 -08:00 |
|
Kyle Sayers
|
6dd5e52823
|
Squelch MLA warning for Compressed-Tensors Models (#12704)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
2025-02-03 13:29:56 -08:00 |
|