Commit Graph

4938 Commits

Author SHA1 Message Date
Matthias Gehre
a889b7f584 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 (#37280)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
2026-03-25 11:42:58 +00:00
Harry Mellor
ba2910f73a Fix offline mode test for Transformers v5 (#38095)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 11:39:48 +00:00
Andreas Karatzas
f262a62aa1 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test (#37616)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 10:55:51 +00:00
Andreas Karatzas
9ac2fcafbb [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors (#37483)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 11:24:33 +01:00
Andreas Karatzas
04cec4f927 [ROCm][CI] Increase OpenAPI schema test timeouts (#38088)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 18:06:58 +08:00
Gregory Shtrasberg
189ddefbfd [ROCm] Attention selector reordering (#36702)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2026-03-25 17:42:56 +08:00
vllmellm
42e9547976 [ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test (#37640)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-03-25 05:06:15 +00:00
Andreas Karatzas
679c6a3ecc [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 (#37787)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 08:17:33 +08:00
Terry Gao
82580b10ac [Perf] Disable inductor runtime asserts by default for serving perfor… (#37485)
Signed-off-by: tianrengao <terrygao87@gmail.com>
Co-authored-by: Tianren Gao <tianren@fb.com>
2026-03-24 19:37:51 -04:00
liangel-02
8c47fdfdb1 [FlexAttention] allow custom mask mod (#37692)
Signed-off-by: Angel Li <liangel@meta.com>
2026-03-24 16:03:24 -04:00
Richard Zou
89f572dbc0 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 (#38015)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2026-03-24 19:08:26 +00:00
Nick Cao
935c46dd9b [Model] Add Granite 4.0 1B speech to supported models (#38019)
Signed-off-by: Nick Cao <ncao@redhat.com>
2026-03-24 18:23:41 +00:00
Dhruv Singal
4df5fa7439 [Bugfix] Force continuous usage stats when CLI override is enabled (#37923)
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: OpenCode <noreply@openai.com>
2026-03-24 10:29:50 -07:00
Sungjae Lee
4731884796 [Feature] limit thinking tokens (hard limit) (#20859)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-24 09:53:07 -07:00
wang.yuqi
1b6cb920e6 [Deprecate] Deprecate pooling multi task support. (#37956)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-03-24 14:07:47 +00:00
Ronen Schaffer
e3c6c10cad [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package (#37874)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
2026-03-24 07:02:51 +02:00
Wentao Ye
c59a132f96 [V0 Deprecation] Refactor kv cache from list to element (#37487)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-23 20:10:11 -07:00
roikoren755
56777b5c89 [Test] E2E Nemotron-3-Super tests (#36803)
Signed-off-by: Roi Koren <roik@nvidia.com>
2026-03-23 17:49:56 -07:00
Ranran
dc6908ac6a [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (#35007)
Signed-off-by: Ranran <1012869439@qq.com>
Signed-off-by: Ranran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-03-23 18:31:14 -04:00
Kyle Sayers
38364a7e32 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-03-23 16:03:29 -04:00
Matthew Bonanni
fafe76b4af [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (#32951)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
2026-03-23 15:37:22 -04:00
Nicolò Lucchesi
1cbbcfe8a3 [CI][PD] Add Hybrid SSM integration tests to CI (#37657)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-03-23 23:58:19 +08:00
Baorun (Lauren) Mu
f85e479e66 [Feature] ViT Full CUDA Graph (#35963)
Signed-off-by: Baorun Mu <bmu@nvidia.com>
2026-03-23 13:01:10 +08:00
Jee Jee Li
1f0d210641 [CI/Build][LoRA] Update Qwen35 LoRA testing (#37816)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-03-23 12:55:49 +08:00
Ben Browning
3bbe2e1e6e [Test] Consolidate tool parser unit tests to tests/tool_parsers (#37834)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
2026-03-23 04:24:25 +00:00
Augusto Yao
6e04e79326 always use embed&token_classify for bge-m3 (#37632)
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-03-23 03:10:57 +00:00
Lasha Koroshinadze
e7767eccae Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling (#37643)
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
2026-03-23 10:29:07 +08:00
Wentao Ye
eaf4978621 [Test] Only Run MLA model when user explicitly set for batch invariance (#37719)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-22 09:09:12 -04:00
Andreas Karatzas
cd1242d82a [ROCm][CI] Stabilize ROCm speech-to-text translation test with lower min acc threshold (#37723)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-22 17:32:08 +08:00
Robert Shaw
4383f1532e [MoE] Move PF Methods to Folder (#35927) 2026-03-22 02:42:59 -06:00
Andreas Karatzas
6ecba840d7 [ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 (#37764)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-22 16:02:21 +08:00
Andreas Karatzas
3b06c55c78 [ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support (#37763)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-22 16:02:03 +08:00
Andreas Karatzas
c862481c02 [CI] Skip ISAAC multimodal tests due to broken upstream HF model weights (#37781)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-22 13:23:32 +08:00
Andreas Karatzas
c86b17cfe6 [ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm (#37717)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-22 12:25:16 +08:00
Andreas Karatzas
66f927f205 [Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing (#37775)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-22 03:22:24 +00:00
Robert Shaw
6b2fa3a762 [MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ (#37759)
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
2026-03-21 19:15:16 -04:00
Robert Shaw
eeee5b262d [Quantization][Deprecation] Remove PTPC FP8 (#32700)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-03-21 22:10:16 +00:00
Brandon Pelfrey
80b70884eb Add tensor IPC transfer mechanism for multimodal data (#32104)
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
Signed-off-by: Brandon Pelfrey <brandonpelfrey@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
2026-03-21 20:10:20 +00:00
Francesco Fusco
298e510848 [Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() (#37318)
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
2026-03-21 09:29:43 +00:00
Andreas Karatzas
0d50fa1db6 [ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 (#37610)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-21 12:57:25 +08:00
Simon Mo
1fa1e53a73 Revert "[compile] Initialize passes at VllmBackend init" (#37733) 2026-03-20 21:35:49 -07:00
Andreas Karatzas
3ffa52009f [ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds (#37617)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-21 11:58:58 +08:00
Yongye Zhu
87bd91892f [MoE Refactor] Mxfp4 oracle rebased (#37128)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 03:37:04 +00:00
Isotr0py
c7f98b4d0a [Frontend] Remove librosa from audio dependency (#37058)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-03-21 11:36:15 +08:00
Santino Ramos
85f671b8e1 [Model Runner V2] Support Streaming Inputs (#37028)
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com>
2026-03-20 20:42:25 +00:00
Vadim Gimpelson
4f16ebbbd3 [Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591) (#37605)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-03-20 12:19:26 -07:00
Angela Yi
12fd17eb51 [compile] Initialize passes at VllmBackend init (#35216)
Signed-off-by: angelayi <yiangela7@gmail.com>
2026-03-20 11:40:33 -07:00
Lucas Wilkinson
e1d85e5c24 [Attention] Support distinguishing between short extends and decodes (#37303)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-03-20 10:49:36 -07:00
Xin Yang
d0532bf38d [Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels (#37683)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-03-20 11:28:41 -06:00
Andreas Karatzas
fb4e8bf442 [ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests (#37613)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-20 10:16:59 -07:00