Commit Graph

15344 Commits

Author SHA1 Message Date
Mark McLoughlin
e38817fadb [Core][KV Connector] Remove use of num_cached_tokens in error handling (#38096)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2026-03-25 18:20:48 +00:00
Nick Hill
72cad44d3c [Frontend] Move APIServerProcessManager target server fn (#38115)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-25 18:14:41 +00:00
Cyrus Leung
ba2f0acc2d [Misc] Reorganize inputs (#35182)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-25 10:22:54 -07:00
Yongye Zhu
678b3c99e8 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration (#38050) 2026-03-25 10:16:40 -07:00
mikaylagawarecki
bf4cc9ed2d [2/n] Migrate per_token_group_quant to torch stable ABI (#36058)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
2026-03-25 10:15:13 -07:00
Ben Browning
1ac2ef2e53 [CI/Docs] Improve aarch64/DGX Spark support for dev setup (#38057)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 09:24:42 -07:00
Richard Zou
6e37c46b35 [compile] Add some more startup tests for top models (#38046)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2026-03-25 12:02:22 -04:00
Wentao Ye
1bf2ddd0ee [Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR (#38048)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-25 11:41:44 -04:00
Necofish
e7221180e1 [Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM (#37970)
Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-03-25 08:20:04 -07:00
RobTand
4a76ad12e0 [Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell (#37725)
Signed-off-by: Rob Tand <robert.tand@icloud.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
2026-03-25 08:18:25 -07:00
Wentao Ye
d7e93e13fb [Feature] EPLB Support for GPU Model Runner v2 (#37488)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-25 08:16:39 -07:00
Andrii Skliar
cd7643015e [Feature] Support per-draft-model MoE backend via --speculative-config (#37880)
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Signed-off-by: [Andrii Skliar] <askliar@nvidia.com>
Co-authored-by: Andrii Skliar <askliar@nvidia.com>
2026-03-25 14:31:52 +00:00
Ben Browning
a1a2566447 [Docs] Add guide for editing agent instruction files (#37819)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
2026-03-25 13:54:09 +00:00
yjz
b745e8b5d3 [KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector (#36869)
Signed-off-by: JianDan0212 <zhangyj0212@gmail.com>
2026-03-25 14:24:07 +01:00
Harry Mellor
d215d1efca [Mypy] Better fixes for the mypy issues in vllm/config (#37902)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 06:14:43 -07:00
Fadi Arafeh
34d317dcec [CPU][UX][Perf] Enable tcmalloc by default (#37607)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-03-25 20:39:57 +08:00
grYe99
7ac48fd357 [Model] Add AutoWeightsLoader support for jais (#38074)
Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
2026-03-25 12:38:40 +00:00
Harry Mellor
d6bb2a9d9a Fix Plamo 2/3 & LFM2 for Transformers v5 (#38090)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 12:29:49 +00:00
Harry Mellor
1e673a43ce Better weight tying check for multimodal models (#38035)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 12:07:23 +00:00
Andreas Karatzas
04417ecd5f [ROCm][CI] Rename filepath test to point to correct file (#38102)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 20:05:46 +08:00
R0CKSTAR
242c93f744 [Docs] Adds vllm-musa to custom_op.md (#37840)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2026-03-25 11:54:36 +00:00
Matthias Gehre
a889b7f584 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 (#37280)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
2026-03-25 11:42:58 +00:00
Harry Mellor
ba2910f73a Fix offline mode test for Transformers v5 (#38095)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 11:39:48 +00:00
Andreas Karatzas
f262a62aa1 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test (#37616)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 10:55:51 +00:00
Andreas Karatzas
9ac2fcafbb [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors (#37483)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 11:24:33 +01:00
Kunshang Ji
e9ae3f8077 [Hardware][XPU] Align memory usage with cuda on xpu (#37029)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-25 18:14:29 +08:00
Andreas Karatzas
04cec4f927 [ROCm][CI] Increase OpenAPI schema test timeouts (#38088)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 18:06:58 +08:00
Kunshang Ji
14771f7150 [XPU] support MLA model on Intel GPU (#37143)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-25 17:43:42 +08:00
Gregory Shtrasberg
189ddefbfd [ROCm] Attention selector reordering (#36702)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2026-03-25 17:42:56 +08:00
Chauncey
09c3dc9186 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#37968)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-03-25 06:19:37 +00:00
vllmellm
42e9547976 [ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test (#37640)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-03-25 05:06:15 +00:00
Chauncey
a32783bb35 [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser (#37958)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-03-25 12:06:21 +08:00
Baorun (Lauren) Mu
9d0351c91d [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc (#37914)
Signed-off-by: Baorun Mu <bmu@nvidia.com>
2026-03-24 19:53:24 -07:00
Artem Perevedentsev
a93a53f8a1 [Performance] Auto-enable prefetch on NFS with RAM guard (#37673)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
2026-03-24 17:31:14 -07:00
Andreas Karatzas
679c6a3ecc [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 (#37787)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 08:17:33 +08:00
Andreas Karatzas
8bbb7c7f20 [ROCm][CI][PD] Add Hybrid SSM integration tests to CI (#37924)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 07:58:39 +08:00
Kevin H. Luu
af945615b5 [release] Move the rest of release jobs to release queue (#38044)
Signed-off-by: khluu <khluu000@gmail.com>
2026-03-24 16:40:58 -07:00
Terry Gao
82580b10ac [Perf] Disable inductor runtime asserts by default for serving perfor… (#37485)
Signed-off-by: tianrengao <terrygao87@gmail.com>
Co-authored-by: Tianren Gao <tianren@fb.com>
2026-03-24 19:37:51 -04:00
Netanel Haber
a0d487b2e1 nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths (#37903)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2026-03-24 23:25:56 +00:00
Junhao
b73b5b0629 Make microbatch optimization (DBO) work with general models (#37926)
Signed-off-by: Junhao Li <junhao@ubicloud.com>
2026-03-24 14:40:08 -07:00
Michael Goin
0f0e03890e [UX] Add flashinfer-cubin as CUDA default dep (#37233)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-03-24 14:13:08 -07:00
Woosuk Kwon
4b53740d7f [MRV2] Fix for DS v3.2 (#38030)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-24 14:03:24 -07:00
Nick Hill
4e824d1c83 [Model Runner V2][Minor] Simplify PP logic (#38031)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-24 13:57:17 -07:00
amey asgaonkar
0c1809c806 Add Ubuntu 24.04 support for Docker builds (#35386)
Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com>
2026-03-24 13:34:44 -07:00
liangel-02
8c47fdfdb1 [FlexAttention] allow custom mask mod (#37692)
Signed-off-by: Angel Li <liangel@meta.com>
2026-03-24 16:03:24 -04:00
Javier De Jesus
54b0578ada [Bugfix] Pass hf_token through config loading paths for gated model support (#37920)
Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
2026-03-24 15:22:05 -04:00
Richard Zou
89f572dbc0 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 (#38015)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2026-03-24 19:08:26 +00:00
Richard Zou
71a4a2fbd0 [BugFix] Fix order of compile logging (#38012)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2026-03-24 18:58:18 +00:00
Nick Cao
935c46dd9b [Model] Add Granite 4.0 1B speech to supported models (#38019)
Signed-off-by: Nick Cao <ncao@redhat.com>
2026-03-24 18:23:41 +00:00
Willy Hardy
057fc94cbd [Bugfix] Fix structured output crash on CPU due to pin_memory=True (#37706)
Signed-off-by: Willy Hardy <whardy@redhat.com>
Signed-off-by: Will Hardy <whardy@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 17:44:17 +00:00