biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Benjamin Chislett	8332078cfd	[Bugfix] FlashInfer MXINT4 MoE crashes, missing do_finalize (#39315 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-08 20:36:33 -04:00
Kai Song	f3c7941ec8	[Bugfix]Fix EP precision for Qwen3.5, Qwen3-Next (#39181 ) Signed-off-by: Song Kai <songkai05@baidu.com>	2026-04-09 01:47:48 +04:00
Wentao Ye	2018137242	[Feature] Batch invariant nvfp4 linear support (#39322 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-04-08 16:29:13 -04:00
Jackmin801	a776a48b1c	[MoE] Move DEEP_GEMM into experts/ subdirectory (#39005 ) Signed-off-by: Jackmin801 <ongjackm@gmail.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-04-08 19:23:08 +00:00
Roberto L. Castro	b55d830ec7	[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode (#37421 ) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>	2026-04-08 13:35:57 -04:00
rasmith	78434b923c	[CI][AMD][BugFix][Kernel] Cast induction variable to int64 on MI350 for chunk_gated_delta_rule_fwd_kernel_h_blockdim64 to avoid illegal memory access (#39087 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2026-04-08 16:57:18 +08:00
Andrey Talman	2111997f96	[release 2.11] Update to torch 2.11 (#34644 )	2026-04-07 18:55:48 -07:00
zofia	ad3304425b	[XPU] add xpu backend implementation of mxfp8 quant (#38682 ) Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2026-04-08 08:30:35 +08:00
Lucas Wilkinson	70406eb1dc	[Attention][V0 Deprecation] Deprecate accept output buffer (#39125 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-04-07 17:14:58 -04:00
Yubo Wang	08bfedc152	[Bugfix] Fix extract_hidden_states crash with quantized KV cache dtype (#39160 ) Signed-off-by: Yubo Wang <yubowang2019@gmail.com>	2026-04-07 11:18:33 -07:00
rasmith	83d09d36b5	[CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 (#36993 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2026-04-08 00:37:16 +08:00
Chendi.Xue	92b9afeecd	[XPU] Quick fix for TritonMLA to remove cuda hardcode (#39088 ) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2026-04-08 00:17:58 +08:00
Jinzhen Lin	7310555482	[Bugfix] Fix marlin nvfp4 rescaling (#37502 ) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>	2026-04-07 08:57:17 -07:00
kkyyxhll	98e1a43af7	[Bugfix][Quantization] Fix PerTensorScale loading with tuple shard_id in MergedColumnParallelLinear (#38517 ) Signed-off-by: loukang <loukang@xiaohongshu.com>	2026-04-07 11:16:26 -04:00
Wei Zhao	0be9516ea4	[Bug] Fix Trtllm Fp8 MoE Weight Shuffle Memory Fragamentation (#39054 ) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>	2026-04-07 08:04:08 -04:00
Jiangyun Zhu	8060bb0333	[vLLM IR] rework gemma_rms_norm (#39014 ) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-04-07 01:37:00 -07:00
Rishapveer Singh	da4c0e4db9	[Model] Use AutoWeightsLoader for FalconH1 (#39092 ) Signed-off-by: Rishapveer Singh <215205492+rishaps@users.noreply.github.com>	2026-04-07 16:25:17 +08:00
Netanel Haber	a9a0e0551f	nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len (#38727 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2026-04-07 00:23:29 -07:00
Andreas Karatzas	2df2c85be4	[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path (#38504 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-04-07 10:57:09 +08:00
bnellnm	b2b2c5239e	[MoE Refactor] Split up compressed_tensors_moe.py (#38960 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-04-06 20:07:54 -04:00
fxmarty-amd	00d7b497b3	[NVFP4] Support NVFP4 dense models from `modelopt` and `compressed-tensors` on AMD Instinct MI300, MI355X and Hopper through emulation (#35733 ) Signed-off-by: Felix Marty <Felix.Marty@amd.com> Signed-off-by: fxmarty-amd <felmarty@amd.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>	2026-04-06 16:18:27 -06:00
Netanel Haber	dfa5062a8f	NemotronH default mamba_ssm_cache_dtype=float32; enable auto-hook for NemotronHNanoVLV2Config (#39032 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2026-04-06 19:47:46 +00:00
Yongye Zhu	e8ebbdde83	[Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE (#38251 ) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io>	2026-04-06 11:57:53 -07:00
namgyu-youn	94fbb09894	[EASY] Drop duplicate KV-cache initialization (#38799 ) Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>	2026-04-06 18:05:39 +00:00
bnellnm	f01482408c	[MoE Refactor][Test] FusedMoE layer test (#24675 ) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-04-06 17:17:23 +00:00
bnellnm	93bada494f	[MoE Refactor] Split of DefaultMoERunner class (#35326 ) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-04-06 12:41:59 -04:00
Wentao Ye	4ae218c122	[Refactor] Remove unused dead code (#38842 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-04-06 11:52:05 -04:00
Lucas Wilkinson	47e605092b	[Gemma4] Enable Fast Prefill Optimization (#38879 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-04-06 11:19:39 -04:00
bhargav-patel-29	c5e3454e5a	[Model] Add support for BharatGen's Param2MoE model (#38000 ) Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-06 16:19:56 +08:00
liuchenbing2026	f6983f01de	MiniMax-M2: add Eagle3 speculative decoding support (#37512 ) Signed-off-by: liuchenbing <chenliumail@163.com> Signed-off-by: liucb <liuchengbao_work@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-04-05 19:50:18 -07:00
Andreas Karatzas	780ba37458	[ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel (#38501 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-04-06 09:42:10 +08:00
Netanel Haber	d56e952239	nano_nemotron_vl: fix tensor device mismatch exception when video profiling (#39029 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2026-04-05 22:23:45 +00:00
Greg Pereira	4dd49b06f8	[Bug] Fix Import paths for `encoder_cudagraph` modules (#38997 ) Signed-off-by: greg pereira <grpereir@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-04-05 19:11:58 +00:00
Wei Zhao	1af6f78ae5	[Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout (#38993 ) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-04-05 10:54:31 -04:00
Martin Vit	228023b3a5	[Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap (#38990 ) Signed-off-by: Martin Vit <martin@voipmonitor.org> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-04-05 10:28:31 -04:00
Robert Shaw	968ed02ace	[Quantization][Deprecation] Remove Petit NVFP4 (#32694 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-04-05 00:07:45 +00:00
Robert Shaw	7d266abb22	Revert "[vLLM IR] gemma_rms_norm" (#38998 )	2026-04-04 17:48:08 -04:00
Xiaoshuang Wang	156405d243	[vLLM IR] gemma_rms_norm (#38780 ) Signed-off-by: Icey <1790571317@qq.com>	2026-04-04 13:55:52 -04:00
Artem Perevedentsev	99e5539a67	[Perf][GDN] Align TMA usage with upstream FLA (#38981 ) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-05 00:38:02 +08:00
Linkun	a88ce94bbb	[IR][RmsNorm] pass None if not has_weight (#38961 ) Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-04-04 11:02:30 -04:00
lalit10	93726b2a1c	Refactor Arctic loading to use AutoWeightsLoader (#38955 ) Signed-off-by: Lalit Laxminarayan Bangad <lalitbangad@gmail.com> Co-authored-by: Lalit Laxminarayan Bangad <lalitbangad@meta.com>	2026-04-04 05:01:09 +00:00
Yongye Zhu	8617f8676b	[Bugfix] Fix DSV32 weight loading (#38870 ) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>	2026-04-03 19:57:52 -07:00
elenalil-aws	81994e1d0e	[Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… (#38927 ) Signed-off-by: elenalil-aws <elenalil@amazon.com>	2026-04-03 23:30:09 +00:00
yzong-rh	a5a623d961	[Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts (#38859 ) Signed-off-by: Yifan Zong <yzong@redhat.com>	2026-04-04 01:48:17 +08:00
Vasiliy Kuznetsov	7b1a7423be	[Frontend] new online quantization frontend (#38138 ) Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>	2026-04-03 11:58:39 -04:00
Yusuf Mohammad	46f02e00f2	[Bugfix] Fix AWQ models batch invariance issues (#38670 ) Signed-off-by: yusuf <yusuf@deeplearningmachine.mynet> Signed-off-by: <> Co-authored-by: yusuf <yusuf@deeplearningmachine.mynet>	2026-04-03 14:54:15 +00:00
Qiming Zhang	6b4872240f	[XPU] bump up xpu-kernel v0.1.5, transpose moe weights (#38342 ) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Qiming Zhang <qiming1.zhang@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2026-04-03 14:10:02 +00:00
Artem Perevedentsev	cb10b7e80b	[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill (#38361 ) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>	2026-04-03 13:38:02 +00:00
Mieszko Dziadowiec	bf8b022e60	[Intel][Triton] Support `round_int8` for Intel backend (#38825 ) Signed-off-by: Mieszko Dziadowiec <mdziadowiec@habana.ai> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2026-04-03 20:47:35 +08:00
Netanel Haber	fa9e68022d	Fix Nano Nemotron VL regressions (#38655 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2026-04-03 15:22:06 +08:00

1 2 3 4 5 ...

4609 Commits