15652 Commits

Author SHA1 Message Date
Yongye Zhu
e8ebbdde83 [Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE (#38251)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-04-06 11:57:53 -07:00
namgyu-youn
94fbb09894 [EASY] Drop duplicate KV-cache initialization (#38799)
Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>
2026-04-06 18:05:39 +00:00
Wentao Ye
419e73cdfa [Bug] Fix mistral version dependency (#39086)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-04-06 13:31:19 -04:00
bnellnm
f01482408c [MoE Refactor][Test] FusedMoE layer test (#24675)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-06 17:17:23 +00:00
zhanqiuhu
bfdc0a3a99 [NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer (#37635) 2026-04-06 19:07:02 +02:00
bnellnm
93bada494f [MoE Refactor] Split of DefaultMoERunner class (#35326)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-06 12:41:59 -04:00
Frederik Gossen
608914de30 [Core] Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) (#38944)
Signed-off-by: Frederik Gossen <frgossen@meta.com>
2026-04-06 09:37:13 -07:00
Wentao Ye
4ae218c122 [Refactor] Remove unused dead code (#38842)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-04-06 11:52:05 -04:00
Lukas Geiger
f40d9879f2 [Models][GDN] Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding (#38047)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2026-04-06 15:39:37 +00:00
Lucas Wilkinson
47e605092b [Gemma4] Enable Fast Prefill Optimization (#38879)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-04-06 11:19:39 -04:00
Walter Beller-Morales
e69a265135 [Feat][Core] safely abort requests when FSM fails to advance (#38663)
Signed-off-by: walterbm <walter.beller.morales@gmail.com>
2026-04-06 08:00:16 -07:00
Julien Denize
fef56c1855 [Mistral Grammar] Support Grammar Factory (#38150)
Signed-off-by: juliendenize <julien.denize@mistral.ai>
2026-04-06 10:28:51 -04:00
bhargav-patel-29
c5e3454e5a [Model] Add support for BharatGen's Param2MoE model (#38000)
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-06 16:19:56 +08:00
liuchenbing2026
f6983f01de MiniMax-M2: add Eagle3 speculative decoding support (#37512)
Signed-off-by: liuchenbing <chenliumail@163.com>
Signed-off-by: liucb <liuchengbao_work@163.com>
Co-authored-by: liuchenbing <chenliumail@163.com>
2026-04-05 19:50:18 -07:00
Andreas Karatzas
780ba37458 [ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel (#38501)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-04-06 09:42:10 +08:00
Micah Williamson
9570654c6d [ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness (#38184)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2026-04-06 09:42:02 +08:00
Netanel Haber
d56e952239 nano_nemotron_vl: fix tensor device mismatch exception when video profiling (#39029)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2026-04-05 22:23:45 +00:00
Kevin H. Luu
56de443db1 [ci] Switch some CI jobs to H200 MIG slices (#38956) 2026-04-05 13:26:11 -07:00
Greg Pereira
4dd49b06f8 [Bug] Fix Import paths for encoder_cudagraph modules (#38997)
Signed-off-by: greg pereira <grpereir@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-05 19:11:58 +00:00
Greg Pereira
f53fa26e05 [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992)
Signed-off-by: greg pereira <grpereir@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-05 17:11:18 +00:00
Wei Zhao
1af6f78ae5 [Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout (#38993)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-05 10:54:31 -04:00
Martin Vit
228023b3a5 [Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap (#38990)
Signed-off-by: Martin Vit <martin@voipmonitor.org>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-05 10:28:31 -04:00
Aaron Batilo
9a528260ef [Bugfix][Spec Decode] Fix extract_hidden_states for VLM models (#38987)
Signed-off-by: Aaron Batilo <abatilo@coreweave.com>
2026-04-05 02:41:54 -07:00
Robert Shaw
968ed02ace [Quantization][Deprecation] Remove Petit NVFP4 (#32694)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-04-05 00:07:45 +00:00
Robert Shaw
7d266abb22 Revert "[vLLM IR] gemma_rms_norm" (#38998) 2026-04-04 17:48:08 -04:00
Xiaoshuang Wang
156405d243 [vLLM IR] gemma_rms_norm (#38780)
Signed-off-by: Icey <1790571317@qq.com>
2026-04-04 13:55:52 -04:00
Artem Perevedentsev
99e5539a67 [Perf][GDN] Align TMA usage with upstream FLA (#38981)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-05 00:38:02 +08:00
Linkun
a88ce94bbb [IR][RmsNorm] pass None if not has_weight (#38961)
Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-04-04 11:02:30 -04:00
Ziming Qi
2a36d8fb72 [Bugfix][CPU] Fix macOS compatibility broken by #36487 (#38970)
Signed-off-by: Ziming (2imi9) <148090931+2imi9@users.noreply.github.com>
2026-04-04 14:05:58 +00:00
lalit10
93726b2a1c Refactor Arctic loading to use AutoWeightsLoader (#38955)
Signed-off-by: Lalit Laxminarayan Bangad <lalitbangad@gmail.com>
Co-authored-by: Lalit Laxminarayan Bangad <lalitbangad@meta.com>
2026-04-04 05:01:09 +00:00
Yongye Zhu
8617f8676b [Bugfix] Fix DSV32 weight loading (#38870)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2026-04-03 19:57:52 -07:00
Andreas Karatzas
06fd9ffcc4 [ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers (#38959)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-04-04 10:41:41 +08:00
Wentao Ye
cab4064cd5 [Bug] Fix workspace manager _current_workspaces size (#38853)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-04-04 01:29:45 +00:00
Wentao Ye
062f1a2d70 [Bug] Fix compile error for swap_blocks_batch in CUDA 13 (#38915) 2026-04-03 16:56:38 -07:00
elenalil-aws
81994e1d0e [Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… (#38927)
Signed-off-by: elenalil-aws <elenalil@amazon.com>
2026-04-03 23:30:09 +00:00
Andreas Karatzas
4b506ff90a [ROCm][CI] Minor missing import patch (#38951)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-04-03 23:01:20 +00:00
Andreas Karatzas
5875bb2e9c [ROCm][CI] Added back missing common deps (#38937)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-04-03 15:58:57 -07:00
Kevin H. Luu
f0d3ad9f3e [ci] Remove soft fail for AMD image build job (#38941)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2026-04-03 20:42:33 +00:00
Divin Honnappa
121ea5a21f Removed GPU state confirmation and cleanup steps. (#38238)
Signed-off-by: Divin Honnappa <divin.honnappa@amd.com>
2026-04-03 13:11:08 -07:00
Jeffrey Wang
ab79863e6c Remove MQ multi-node tests (#38934)
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
2026-04-03 20:00:08 +00:00
Nick Hill
5f1de2b14b [Model Runner V2] Add config validation for not-yet-supported features (#38758)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-04-03 12:08:08 -07:00
yzong-rh
a5a623d961 [Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts (#38859)
Signed-off-by: Yifan Zong <yzong@redhat.com>
2026-04-04 01:48:17 +08:00
Xiaoshuang Wang
f8c3af2d85 [vLLM IR] add import_ir_kernels() to support OOT platforms (#38807)
Signed-off-by: Icey <1790571317@qq.com>
2026-04-03 17:25:19 +00:00
danisereb
50cd5674b3 Fix invalid logprobs with MTP enabled and sync scheduling (#38711)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
2026-04-03 12:24:37 -04:00
Vasiliy Kuznetsov
7b1a7423be [Frontend] new online quantization frontend (#38138)
Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
2026-04-03 11:58:39 -04:00
Nicolò Lucchesi
97f92c6b47 [KVConnector] Skip register_kv_caches on profiling (#38558)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-04-03 15:40:16 +00:00
Yusuf Mohammad
46f02e00f2 [Bugfix] Fix AWQ models batch invariance issues (#38670)
Signed-off-by: yusuf <yusuf@deeplearningmachine.mynet>
Signed-off-by: <>
Co-authored-by: yusuf <yusuf@deeplearningmachine.mynet>
2026-04-03 14:54:15 +00:00
Qiming Zhang
6b4872240f [XPU] bump up xpu-kernel v0.1.5, transpose moe weights (#38342)
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: Qiming Zhang <qiming1.zhang@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-04-03 14:10:02 +00:00
Necofish
580090db6b [Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325) 2026-04-03 15:49:59 +02:00
Artem Perevedentsev
cb10b7e80b [GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill (#38361)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
2026-04-03 13:38:02 +00:00