biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Zhewen Li	be1a85b7a2	Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" (#38050 ) (#38169 ) Co-authored-by: Zhewen Li <zhewenli@inferact.ai>	2026-03-26 07:59:09 -07:00
Vadim Gimpelson	52069012fe	[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-03-26 01:21:47 -07:00
BadrBasowid	e6bf9f15ec	[Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format (#38092 ) Signed-off-by: BadrBasowid <Badr.Basowid@gmail.com> Signed-off-by: BadrBasowid <61441185+BadrBasowid@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-25 21:11:43 -07:00
Jacob Platin	d7d51a7ee5	[Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU (#37348 ) Signed-off-by: Jacob Platin <jacobplatin@google.com>	2026-03-26 00:46:01 +00:00
Andreas Karatzas	7d6917bef5	[ROCm] Fix MoE kernel test failures on gfx950 (#37833 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-03-25 13:46:40 -05:00
Yongye Zhu	678b3c99e8	[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration (#38050 )	2026-03-25 10:16:40 -07:00
Kunshang Ji	14771f7150	[XPU] support MLA model on Intel GPU (#37143 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-03-25 17:43:42 +08:00
Chauncey	09c3dc9186	[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#37968 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-03-25 06:19:37 +00:00
Andreas Karatzas	679c6a3ecc	[Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 (#37787 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-25 08:17:33 +08:00
Harry Mellor	b3601da6e7	[Mypy] Fix mypy for `vllm/model_executor` (except `vllm/model_executor/layers`) (#37904 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-03-24 17:14:01 +00:00
Li, Jiang	352b90c4a4	[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU (#37987 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2026-03-24 07:00:20 -07:00
Wentao Ye	c59a132f96	[V0 Deprecation] Refactor kv cache from list to element (#37487 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-23 20:10:11 -07:00
Ranran	dc6908ac6a	[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (#35007 ) Signed-off-by: Ranran <1012869439@qq.com> Signed-off-by: Ranran <hzz5361@psu.edu> Signed-off-by: ran <hzz5361@psu.edu> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-03-23 18:31:14 -04:00
yzong-rh	e85f8f0932	[Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts (#36728 ) Signed-off-by: Yifan Zong <yzong@redhat.com>	2026-03-23 17:02:57 -04:00
Robert Shaw	5bf3c42d4c	[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision (#36725 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-03-23 20:19:06 +00:00
Kyle Sayers	38364a7e32	[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-03-23 16:03:29 -04:00
Kunshang Ji	debd6e768c	[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle (#37784 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-03-23 11:10:41 +00:00
Kunshang Ji	27d5ee3e6f	[FP8]add FP8 WoQ kernel abstraction. (#32929 ) Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>	2026-03-23 09:47:47 +00:00
Chuan (Richard) Li	e99fb98867	[ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs (#36100 ) Signed-off-by: Li <chuali@amd.com>	2026-03-23 15:48:31 +08:00
Artem Perevedentsev	a16133a0f1	[Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 (#37338 ) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>	2026-03-23 00:37:58 -07:00
Matthias Gehre	410d300893	[ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel (#36505 ) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-03-23 15:36:08 +08:00
Yongye Zhu	c058ff44d4	[Bigfix]fix lora test by pass padded size back to the layer (#37811 )	2026-03-22 13:20:13 -06:00
Wentao Ye	77d24c4bfe	[Bug] Fix fp8 deepgemm batch invariant (#37718 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-22 08:57:20 -04:00
Robert Shaw	4383f1532e	[MoE] Move PF Methods to Folder (#35927 )	2026-03-22 02:42:59 -06:00
Robert Shaw	6b2fa3a762	[MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ (#37759 ) Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>	2026-03-21 19:15:16 -04:00
Robert Shaw	eeee5b262d	[Quantization][Deprecation] Remove PTPC FP8 (#32700 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-03-21 22:10:16 +00:00
Robert Shaw	5ad0446572	Revert "Consolidate AWQ quantization into single awq_marlin.py file" (#37768 )	2026-03-21 17:20:41 -04:00
Robert Shaw	8cc700dd6a	Consolidate AWQ quantization into single awq_marlin.py file Merge awq.py and awq_marlin.py into a single file, eliminating the circular import between them. awq.py becomes a backward-compat shim. Follows the same structure as gptq_marlin.py. Co-authored-by: Claude Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>	2026-03-21 17:09:17 -04:00
Chaitanya Sri Krishna Lolla	3982bc2cd0	[ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. (#34692 ) Signed-off-by: Tej Kiran <vpolamre@amd.com> Co-authored-by: Tej Kiran <vpolamre@amd.com>	2026-03-21 00:32:31 -07:00
Yongye Zhu	87bd91892f	[MoE Refactor] Mxfp4 oracle rebased (#37128 ) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 03:37:04 +00:00
Wentao Ye	b3d0b37908	[Refactor] Remove unused dead code (#36171 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-20 16:12:51 -06:00
Vadim Gimpelson	4f16ebbbd3	[Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591 ) (#37605 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-03-20 12:19:26 -07:00
Xin Yang	d0532bf38d	[Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels (#37683 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-03-20 11:28:41 -06:00
L.B.R.	1779c09898	[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode (#34709 ) Signed-off-by: L.B.R. <lbr@mmonad.com> Co-authored-by: L.B.R. <lbr@mmonad.com>	2026-03-20 10:11:23 -05:00
xuebwang-amd	44eea10f68	[ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization (#36232 ) Signed-off-by: xuebwang-amd <xuebwang@amd.com>	2026-03-20 10:10:03 -05:00
wang.yuqi	ed359c497a	[Model] Deprecate the score task (this will not affect users). (#37537 ) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-03-20 08:07:56 +00:00
bnellnm	91be5f9be3	[MoE Refactor] Rename "naive" all2all backend (#36294 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-03-19 15:50:34 -04:00
bnellnm	9279c59a0e	[MoE Refactor] DefaultMoERunner simplifcation (#33049 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-03-19 15:07:44 -04:00
Wei Zhao	e27b8ba3d1	[Bug] Fix fp8 trtllm MoE modular kernel supported routing methods (#37346 ) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>	2026-03-19 11:43:06 -04:00
Duyi-Wang	6a9cceb219	[Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant (#37418 ) Signed-off-by: Duyi-Wang <duyi.wang@amd.com>	2026-03-19 09:49:27 +00:00
Michael Goin	9482b0b085	[Bugfix] Remove assertion for NVFP4 scale dynamic range (#37465 ) Signed-off-by: Michael Goin <mgoin64@gmail.com>	2026-03-18 15:37:49 -07:00
Wentao Ye	0d81a1fe61	[V0 Deprecation] Deprecate virtual engine (#37195 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-18 14:30:14 -07:00
Wentao Ye	0ef7f79054	[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement (#37340 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-18 14:18:34 -04:00
Xin Yang	b1169d7be8	[Kernel] Add gpt-oss Router GEMM kernel (#37205 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-03-18 08:15:56 -07:00
elvischenv	296839a1b0	[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE (#30647 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2026-03-18 15:01:26 +00:00
Wentao Ye	c373b5c00d	[Log] Reduce duplicate log (#37313 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-03-18 10:57:44 -04:00
Michael Goin	09e4576f65	[Kernel] Add non-gated support for NVFP4 CUTLASS MoE (#37320 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-03-17 18:12:04 -04:00
Chao-Ju Chen	245758992e	[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow (#34577 ) Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-03-17 20:48:42 +00:00
Andrey Talman	68f783a727	[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility (#35673 ) Signed-off-by: atalman <atalman@fb.com>	2026-03-17 18:47:59 +00:00
Augusto Yao	9c7cab5ebb	[Feature]: Support for multiple embedding types in a single inference call (#35829 ) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>	2026-03-17 17:05:42 +08:00

1 2 3 4 5 ...

2088 Commits