biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
csy0225	8b45c58fe9	[Models] Step-3.5-Flash (#33523 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: i-zhangmingming <i-zhangmingming@stepfun.com> Co-authored-by: xiewuxun <xiewuxun@stepfun.com> Co-authored-by: zetaohong <i-hongzetao@stepfun.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit `c3b40dc3e7`)	2026-02-02 02:16:23 -08:00
Gregory Shtrasberg	5f45b0b7e0	[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> (cherry picked from commit `31aedfe7d6`)	2026-02-02 00:13:45 -08:00
Roberto L. Castro	fcb9df99bd	[Perf][Kernel] Optimize FP4 quantization kernels (SM100F) (#32520 ) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>	2026-01-24 18:45:27 -07:00
7. Sun	0ccecf8833	[Tests] Standardize RNG seed utility across test files (#32982 ) Signed-off-by: 7. Sun <jhao.sun@gmail.com>	2026-01-24 06:47:14 +00:00
Michael Goin	4561f13985	[Refactor] Rename `gptq_marlin` to `marlin` to match MoE (#32952 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-01-23 16:48:12 -05:00
Matt	305e53ade8	[Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test (#32904 ) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-01-23 16:24:26 +00:00
Fadi Arafeh	aac0b817fa	[CPU Backend][BugFix] Fix failing CPU MoE test (#32876 ) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>	2026-01-23 12:06:51 +00:00
Karan Bansal	fa6e599a61	[Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set (#32777 ) Signed-off-by: Karan Bansal <karanb192@gmail.com>	2026-01-23 08:22:37 +00:00
Luka Govedič	5e4e0e51f4	[torch.compile] Compile `CustomOp.forward_native` for `SiluAndMul` and `QuantFP8` to avoid raw torch ops inside opaque custom ops (#32806 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-22 19:52:26 -08:00
bnellnm	dc917cceb8	[MoE Refactor] Move `select_experts` from `FusedMoEQuantMethod` -> `FusedMoE` (#31996 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-01-22 18:21:35 -05:00
Xin Yang	d08b356ee0	[Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper (#32619 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-01-22 15:47:04 -05:00
Eldar Kurtić	44f08af3a7	Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (#30141 ) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>	2026-01-22 13:29:57 -07:00
Richard Zou	654a71fc3c	[torch.compile] Improve Cold Start for MoEs (#32805 ) Signed-off-by: Richard Zou <zou3519@gmail.com>	2026-01-22 10:44:40 -05:00
Or Ozeri	421012b63a	OffloadingConnector: Support kernel_block_size != block_size (#30692 ) Signed-off-by: Or Ozeri <oro@il.ibm.com>	2026-01-22 12:30:04 +00:00
Alex Sun	49a1262267	[AMD][ROCm] MoRI EP: a high-performance all2all backend (#28664 ) Signed-off-by: Alex Sun <alex.s@amd.com>	2026-01-22 16:33:18 +08:00
Xin Yang	63227accf5	[Kernel] Add topk_sigmoid kernel (#31246 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-01-21 22:49:51 +00:00
elvischenv	808d6fd7b9	Bump Flashinfer to v0.6.1 (#30993 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2026-01-21 08:49:50 -08:00
whx	1861ae8aae	[PluggableLayer][1/N] Define PluggableLayer (Fix ci) (#32744 ) Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-01-21 11:38:04 -05:00
Robert Shaw	85f55c943c	[Quantization][Deprecation] Deprecate HQQ (#32681 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-01-21 09:32:40 -05:00
Robert Shaw	42135d6898	[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414 )	2026-01-21 08:22:33 -05:00
Lucas Wilkinson	b4f64e5b02	Update FlashMLA (#32491 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-21 13:03:37 +08:00
linhaifeng	7901109ea5	[Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation (#32603 ) Signed-off-by: linhaifeng <1371675203@qq.com>	2026-01-20 11:13:39 -05:00
vllmellm	148117ea2e	[Refactor] Make FP8 Linear Ops use kernel abstraction (#27814 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-20 14:48:20 +08:00
Yanan Cao	9d1e611f0e	[CI] Add Helion as an optional dependency (#32482 ) Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>	2026-01-19 19:09:56 +00:00
Matt	11bbf86f6a	[CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused (#32408 ) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-01-19 08:25:47 +00:00
bnellnm	327a02d8db	[MoE Refactor] Separate Router into OO Classes (#30623 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-01-18 11:40:49 -05:00
Robert Shaw	4a6af8813f	[MoE Refactor] Move Test Impl into Test Dirs (#32129 ) Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2026-01-18 12:16:59 +08:00
Hashem Hashemi	7a1030431a	Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (#29843 ) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>	2026-01-16 11:45:04 -06:00
rasmith	6ca4f400d8	[CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm (#32444 ) Signed-off-by: Randall Smith <ransmith@amd.com>	2026-01-16 16:22:53 +08:00
TomerBN-Nvidia	c277fbdf31	[Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors (#32257 ) Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Tomer Natan <tbarnatan@ipp1-1429.ipp1a1.colossus.nvidia.com>	2026-01-15 16:15:05 -08:00
rasmith	8853a50af2	[CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm (#32372 ) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com>	2026-01-15 19:05:54 +08:00
rasmith	3c2685645e	[CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol (#32201 ) Signed-off-by: Randall Smith <ransmith@amd.com>	2026-01-15 05:04:34 +00:00
Hongxia Yang	048bb59728	AMD CI Test - unskip moe_sum test and moe_align_block_size tests (#32039 ) Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>	2026-01-13 23:25:10 -08:00
Roberto L. Castro	8ef50d9a6b	[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding (#30885 ) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>	2026-01-13 15:22:53 -08:00
Rabi Mishra	69f8a0ea37	fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe (#31711 ) Signed-off-by: rabi <ramishra@redhat.com>	2026-01-13 19:11:54 +00:00
danielafrimi	3f72639d36	[FIX] Add NO_MUL activation support for modular kernel path (#31528 ) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>	2026-01-12 11:55:49 -05:00
Jiangyun Zhu	3df619ac94	[CI] fix `test_concat_and_cache_mla_rope_fused` (#32117 ) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>	2026-01-11 15:11:11 +00:00
shyeh25	1c46dea001	Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… (#31617 ) Signed-off-by: shyeh25 <206795756+shyeh25@users.noreply.github.com>	2026-01-10 12:39:59 -08:00
PatrykSaffer	80fead8bf6	Fuse RoPE and MLA KV-cache write (#25774 ) Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai> Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-09 19:18:37 -08:00
Matthew Bonanni	2612ba9285	[1/N][Attention] Restructure attention: move files (#31916 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-09 13:10:24 -08:00
Lucas Wilkinson	0a0aa07747	[Quant] Make static quant support all group shapes (#30833 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-09 12:49:27 -08:00
Runkai Tao	a4d5d663e2	Add unpermute-aware fused MoE path and small-batch fallback (#29354 ) Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2026-01-09 12:58:39 -07:00
vllmellm	1a19e9cd87	[Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten (#31380 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-09 19:28:02 +08:00
Lucas Wilkinson	6cdf015c3c	[Misc] Fix `Current vLLM config is not set.` warnings, assert to avoid issues in the future (#31747 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-08 15:20:49 -08:00
Michael Goin	87e07a6b46	Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" (#31978 )	2026-01-08 11:31:53 -08:00
Robert Shaw	9f6dcb71ae	[MoE Refactor][16/N] Apply Refactor to NVFP4 (#31692 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>	2026-01-08 03:46:27 +00:00
Rabi Mishra	25eef3dc2e	feat(moe): Add is_act_and_mul=False support for Triton MoE kernels (#31645 ) Signed-off-by: rabi <ramishra@redhat.com>	2026-01-08 10:27:09 +08:00
Robert Shaw	5dcd7ef1f2	[MoE Refactor][15/N] Apply Refactor to Fp8 (#31415 )	2026-01-07 19:42:33 -05:00
Xin Yang	0ada960a20	[Kernel] Support bias type in grouped_topk kernel (#31781 ) Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-07 12:16:32 -08:00
Kate Cheng	cc6dafaef2	[Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) (#29213 ) Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2026-01-07 10:53:54 -05:00

1 2 3 4 5 ...

682 Commits