Commit Graph

14685 Commits

Author SHA1 Message Date
elvischenv
7d6abdd022 [Fix] Use torch.empty for output in attention+quant fusion (#31785)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2026-03-10 21:26:14 -07:00
Wentao Ye
a8ff2cca92 [Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement (#35781)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
2026-03-10 21:25:30 -07:00
tunglinwood
42fadebecb [Model] Add support for moonshotai/Kimi-Audio-7B-Instruct (#36127)
Signed-off-by: tunglinwood <tunglinwood@gmail.com>
Signed-off-by: tunglinwood <tomwu.tunglin@gmail.com>
Signed-off-by: tunglinwood <113751333+tunglinwood@users.noreply.github.com>
2026-03-10 21:24:48 -07:00
tianshu-Michael-yu
a197eda9c3 Add tuned H100 MoE configs for LFM2 8B and 24B (#36699) 2026-03-10 21:22:02 -07:00
Kevin H. Luu
82b110d50e [ci] Bound nvidia-cudnn-frontend version (#36719)
Signed-off-by: khluu <khluu000@gmail.com>
2026-03-11 12:17:35 +08:00
Benjamin Chislett
9040cd40af [DSV3.2][MTP] Optimize Indexer MTP handling (#36723)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2026-03-11 12:16:56 +08:00
fangyuchu
fa0d353acf [Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks (#35194)
Signed-off-by: fangyuchu <fangyuchu@qq.com>
2026-03-11 03:22:21 +00:00
Augusto Yao
b386bb3d7c fix bugs when token_classify & classify run concurrently (#36614)
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
2026-03-10 20:16:34 -07:00
Ning Xie
fe714dd507 [openapi server] log exception in exception handler(2/N) (#36201)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-03-10 20:16:30 -07:00
Matthew Bonanni
8ab3d7427c [Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling (#36691)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-11 03:01:07 +00:00
Wei Zhao
84e436ed1c [Bug] Fix TRTLLM Block FP8 MoE Monolithic (#36296)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-03-10 22:04:47 -04:00
Andreas Karatzas
81939e7733 [ROCm][CI] Making some tests optional to reduce workload (#36090)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-10 16:45:27 -07:00
Woosuk Kwon
195d1ca3e8 [Minor] Enhance error message for TRTLLM decode uniformity check (#36609)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-10 15:38:45 -07:00
Nick Hill
8d983d7cd6 [Model Runner V2] Add initial CI tests (#36041)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-10 14:55:21 -07:00
Nick Hill
65b2f405dc [Core] Simplify core kv-cache blocks initialization logic (#36521)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-10 20:20:02 +00:00
Nick Hill
2a68464c5b [Test] test_async_scheduling.py improvements (#36340)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-10 11:17:26 -07:00
Zhengxu Chen
bdd8981dab [compile] Apply stored functorch config while finalizing loaded artifacts. (#36582)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
2026-03-10 09:34:35 -07:00
Woosuk Kwon
f088a831dd [Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata (#36626)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-10 09:30:56 -07:00
Harry Mellor
f83b933b84 [CI] Bump mypy version to 1.19.1 (#36104)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
v0.17.1rc0
2026-03-10 09:18:28 -07:00
Pleaplusone
82f3f30e26 [ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform (#35719)
Signed-off-by: ganyi <ygan@amd.com>
2026-03-10 09:14:35 -07:00
Matthew Bonanni
9095cbbfb6 [Bugfix][Sparse MLA] report indexer CG support properly (#36519)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-10 09:14:31 -07:00
Hashem Hashemi
721ae79f50 Improvements to wvSplitKrc skinny GEMM solution (#34304)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
2026-03-10 09:14:27 -07:00
AllenDou
aefc59f088 FunASR model bugfix (#36633)
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
2026-03-10 08:14:21 -07:00
Harry Mellor
d88f28da05 Fix hf_override_fn when it modifies model_type (#35200)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-10 15:03:18 +00:00
Srinivasoo7
106ff69c4e feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency (#35342)
Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com>
Signed-off-by: Sriusa4414@gmail.com
Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com>
Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
2026-03-10 14:43:40 +00:00
Jiangyun Zhu
ca5fb4bbd8 [Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs (#36595)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2026-03-10 07:39:01 -07:00
Alvin Tang
cf88b23749 fix: check HTTP status in batch read_file to prevent silent failures (#36397)
Signed-off-by: gambletan <ethanchang32@gmail.com>
Co-authored-by: gambletan <ethanchang32@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 07:22:40 -07:00
wang.yuqi
a3189a08b0 [Model] Consolidate score logic by introduce score_type (#36479)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-03-10 13:32:25 +00:00
SoluMilken
409c4e632d [Misc] fix typo: homogenous-> homogeneous (2 lines change) (#36508)
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
2026-03-10 06:25:37 -07:00
Raushan Turganbay
8850738b70 [Bugfix] Fix processor signature (#36630)
Signed-off-by: raushan <raushan@huggingface.co>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-10 06:20:47 -07:00
Mark McLoughlin
234860399b [Frontend][Core] Revert "Add shutdown timeout" (#34730 and #36270) (#36628)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2026-03-10 06:20:41 -07:00
Harry Mellor
c88510083b Fix Qwen2.5-VL test for Transformers v5 (#36532)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-10 12:05:34 +00:00
Vadim Gimpelson
4ff8c3c8f9 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-03-10 03:32:20 -07:00
Chang Su
507ddbe992 feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add --grpc flag to vllm serve (#36169)
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2026-03-10 03:29:59 -07:00
Nick Hill
ddbb0d230a [Model Runner V2] Fix mm input embeddings lookup (#36588)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-10 00:24:58 -07:00
Nick Hill
9efc3bdcd6 [Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill (#36580)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-03-10 00:23:42 -07:00
amirkl94
156e33553c Fix: Re-Enable EP for trtllm MoE FP8 backend (#36494)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2026-03-09 23:11:27 -07:00
hallerite
d0cd736caa [Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. (#36557)
Signed-off-by: hallerite <hallerite@users.noreply.github.com>
Signed-off-by: hallerite <git@hallerite.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-03-09 22:30:51 -07:00
Harry Mellor
195c997203 Fix LFM2 MoE test for Transformers v5 (#36534)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-09 22:29:17 -07:00
Zhuohan Li
04b67d8f62 Remove unused disable_fallback field (#36546) 2026-03-09 20:56:54 -07:00
Wentao Ye
7279374f91 [Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement (#36159)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-09 20:55:58 -07:00
Woosuk Kwon
006aea17d7 [BugFix] Remove incorrect assert in split_decodes_and_prefills (#36553)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-09 20:02:02 -07:00
Hojin Yang
0836be3b03 [Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support (#31471)
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-03-10 10:59:19 +08:00
Ajay Anubolu
4e95ec111c [Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 (#36242)
Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
2026-03-09 19:16:26 -07:00
Andreas Karatzas
179547d62c [ROCm][CI] Fix ROCm GPT-OSS Eval test group (#36179)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-09 17:55:20 -07:00
youkaichao
f85b4eda3a [bugfix] fix nvlink for nixl/ucx (#36475)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2026-03-10 07:49:47 +08:00
Woosuk Kwon
2a194ddd72 [Model Runner V2] Add model_state inputs to CUDA graph capture (#36544)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-09 15:14:51 -07:00
Shaun Kotek
203a7f27da add nemotron v3 reasoning parser (#36393)
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com>
Co-authored-by: root <root@gpu-259.slurm-workers-slurm.slurm.svc.cluster.local>
2026-03-09 15:11:41 -07:00
Lucas Wilkinson
483463f735 [MRV2] Extensible CG dispatch rework (#35959)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-03-09 13:58:45 -07:00
Matthew Bonanni
4e571ce643 [MTP][Misc] Clean up dead code (#36507)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-09 14:43:06 -04:00