Zhewen Li
be1a85b7a2
Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" ( #38050 ) ( #38169 )
...
Co-authored-by: Zhewen Li <zhewenli@inferact.ai >
2026-03-26 07:59:09 -07:00
Vadim Gimpelson
52069012fe
[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell ( #38083 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-26 01:21:47 -07:00
BadrBasowid
e6bf9f15ec
[Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format ( #38092 )
...
Signed-off-by: BadrBasowid <Badr.Basowid@gmail.com >
Signed-off-by: BadrBasowid <61441185+BadrBasowid@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-25 21:11:43 -07:00
Jacob Platin
d7d51a7ee5
[Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU ( #37348 )
...
Signed-off-by: Jacob Platin <jacobplatin@google.com >
2026-03-26 00:46:01 +00:00
Andreas Karatzas
7d6917bef5
[ROCm] Fix MoE kernel test failures on gfx950 ( #37833 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-03-25 13:46:40 -05:00
Yongye Zhu
678b3c99e8
[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration ( #38050 )
2026-03-25 10:16:40 -07:00
Kunshang Ji
14771f7150
[XPU] support MLA model on Intel GPU ( #37143 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 17:43:42 +08:00
Chauncey
09c3dc9186
[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #37968 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 06:19:37 +00:00
Andreas Karatzas
679c6a3ecc
[Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 ( #37787 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 08:17:33 +08:00
Harry Mellor
b3601da6e7
[Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) ( #37904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 17:14:01 +00:00
Li, Jiang
352b90c4a4
[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU ( #37987 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-24 07:00:20 -07:00
Wentao Ye
c59a132f96
[V0 Deprecation] Refactor kv cache from list to element ( #37487 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 20:10:11 -07:00
Ranran
dc6908ac6a
[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning ( #35007 )
...
Signed-off-by: Ranran <1012869439@qq.com >
Signed-off-by: Ranran <hzz5361@psu.edu >
Signed-off-by: ran <hzz5361@psu.edu >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-23 18:31:14 -04:00
yzong-rh
e85f8f0932
[Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts ( #36728 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-23 17:02:57 -04:00
Robert Shaw
5bf3c42d4c
[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision ( #36725 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-23 20:19:06 +00:00
Kyle Sayers
38364a7e32
[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels ( #36799 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-23 16:03:29 -04:00
Kunshang Ji
debd6e768c
[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle ( #37784 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 11:10:41 +00:00
Kunshang Ji
27d5ee3e6f
[FP8]add FP8 WoQ kernel abstraction. ( #32929 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-03-23 09:47:47 +00:00
Chuan (Richard) Li
e99fb98867
[ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs ( #36100 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-23 15:48:31 +08:00
Artem Perevedentsev
a16133a0f1
[Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 ( #37338 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-23 00:37:58 -07:00
Matthias Gehre
410d300893
[ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel ( #36505 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-23 15:36:08 +08:00
Yongye Zhu
c058ff44d4
[Bigfix]fix lora test by pass padded size back to the layer ( #37811 )
2026-03-22 13:20:13 -06:00
Wentao Ye
77d24c4bfe
[Bug] Fix fp8 deepgemm batch invariant ( #37718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 08:57:20 -04:00
Robert Shaw
4383f1532e
[MoE] Move PF Methods to Folder ( #35927 )
2026-03-22 02:42:59 -06:00
Robert Shaw
6b2fa3a762
[MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ ( #37759 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 19:15:16 -04:00
Robert Shaw
eeee5b262d
[Quantization][Deprecation] Remove PTPC FP8 ( #32700 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-21 22:10:16 +00:00
Robert Shaw
5ad0446572
Revert "Consolidate AWQ quantization into single awq_marlin.py file" ( #37768 )
2026-03-21 17:20:41 -04:00
Robert Shaw
8cc700dd6a
Consolidate AWQ quantization into single awq_marlin.py file
...
Merge awq.py and awq_marlin.py into a single file, eliminating the
circular import between them. awq.py becomes a backward-compat shim.
Follows the same structure as gptq_marlin.py.
Co-authored-by: Claude
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 17:09:17 -04:00
Chaitanya Sri Krishna Lolla
3982bc2cd0
[ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. ( #34692 )
...
Signed-off-by: Tej Kiran <vpolamre@amd.com >
Co-authored-by: Tej Kiran <vpolamre@amd.com >
2026-03-21 00:32:31 -07:00
Yongye Zhu
87bd91892f
[MoE Refactor] Mxfp4 oracle rebased ( #37128 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-21 03:37:04 +00:00
Wentao Ye
b3d0b37908
[Refactor] Remove unused dead code ( #36171 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 16:12:51 -06:00
Vadim Gimpelson
4f16ebbbd3
[Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing ( #37591 ) ( #37605 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-20 12:19:26 -07:00
Xin Yang
d0532bf38d
[Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels ( #37683 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-20 11:28:41 -06:00
L.B.R.
1779c09898
[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode ( #34709 )
...
Signed-off-by: L.B.R. <lbr@mmonad.com >
Co-authored-by: L.B.R. <lbr@mmonad.com >
2026-03-20 10:11:23 -05:00
xuebwang-amd
44eea10f68
[ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization ( #36232 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-03-20 10:10:03 -05:00
wang.yuqi
ed359c497a
[Model] Deprecate the score task (this will not affect users). ( #37537 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-20 08:07:56 +00:00
bnellnm
91be5f9be3
[MoE Refactor] Rename "naive" all2all backend ( #36294 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:50:34 -04:00
bnellnm
9279c59a0e
[MoE Refactor] DefaultMoERunner simplifcation ( #33049 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:07:44 -04:00
Wei Zhao
e27b8ba3d1
[Bug] Fix fp8 trtllm MoE modular kernel supported routing methods ( #37346 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-19 11:43:06 -04:00
Duyi-Wang
6a9cceb219
[Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant ( #37418 )
...
Signed-off-by: Duyi-Wang <duyi.wang@amd.com >
2026-03-19 09:49:27 +00:00
Michael Goin
9482b0b085
[Bugfix] Remove assertion for NVFP4 scale dynamic range ( #37465 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-18 15:37:49 -07:00
Wentao Ye
0d81a1fe61
[V0 Deprecation] Deprecate virtual engine ( #37195 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:30:14 -07:00
Wentao Ye
0ef7f79054
[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement ( #37340 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:18:34 -04:00
Xin Yang
b1169d7be8
[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 08:15:56 -07:00
elvischenv
296839a1b0
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE ( #30647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-18 15:01:26 +00:00
Wentao Ye
c373b5c00d
[Log] Reduce duplicate log ( #37313 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 10:57:44 -04:00
Michael Goin
09e4576f65
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE ( #37320 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 18:12:04 -04:00
Chao-Ju Chen
245758992e
[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow ( #34577 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 20:48:42 +00:00
Andrey Talman
68f783a727
[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility ( #35673 )
...
Signed-off-by: atalman <atalman@fb.com >
2026-03-17 18:47:59 +00:00
Augusto Yao
9c7cab5ebb
[Feature]: Support for multiple embedding types in a single inference call ( #35829 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-17 17:05:42 +08:00