biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Matthew Bonanni	bcf2333cd6	[CI] Fix LM Eval Large Models (H100) (#32423 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-16 00:52:49 +00:00
Michael Goin	83239ff19a	Add thread_n=64 support to Marlin MoE (#32360 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-01-15 16:45:44 -08:00
TomerBN-Nvidia	c277fbdf31	[Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors (#32257 ) Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Tomer Natan <tbarnatan@ipp1-1429.ipp1a1.colossus.nvidia.com>	2026-01-15 16:15:05 -08:00
Yongye Zhu	31c29257c8	[MoE Refactor][17/N] Apply Refactor to Bf16 (#31827 ) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-01-15 12:53:40 -08:00
Aleksandr Malyshev	8c11001ba2	[ROCM] DSfp4 mla projection gemms weight dynamic quantization (#32238 ) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>	2026-01-15 14:13:08 -06:00
Lucas Wilkinson	c36ba69bda	[BugFix] Fix `assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1]` in Blackwell Quantized MoE Test (#32362 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-01-15 10:19:12 -08:00
Dipika Sikka	361dfdc9d8	[Quant] Support MXFP4 W4A16 for compressed-tensors MoE models (#32285 ) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-15 07:25:55 -08:00
Matthew Bonanni	8ebfacaa75	[Attention][MLA] Make `FLASHINFER_MLA` the default MLA backend on Blackwell, and TRTLLM the default prefill (#32339 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-01-15 09:49:57 -05:00
brian033	b89275d018	[ROCm] Improve error handling while loading quantized model on gfx120… (#31715 ) Signed-off-by: brian033 <85883730+brian033@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>	2026-01-15 04:16:00 -08:00
Lucas Wilkinson	2c9b4cf5bf	[BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes (#32361 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>	2026-01-15 06:32:22 +00:00
rasmith	3c2685645e	[CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol (#32201 ) Signed-off-by: Randall Smith <ransmith@amd.com>	2026-01-15 05:04:34 +00:00
Michael Goin	6388b50058	[Docs] Add docs about OOT Quantization Plugins (#32035 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-01-14 15:25:45 +08:00
Yi Liu	50632adc58	Consolidate Intel Quantization Toolkit Integration in vLLM (#31716 ) Signed-off-by: yiliu30 <yi4.liu@intel.com>	2026-01-14 07:11:30 +00:00
Roberto L. Castro	8ef50d9a6b	[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding (#30885 ) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>	2026-01-13 15:22:53 -08:00
Matthew Bonanni	2263d44b68	[4/N][Attention] Move MLA common to model_executor (#32060 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-13 09:08:45 -08:00
Matthew Bonanni	98f60e5acb	[6/N][Attention] Move utils to more appropriate locations (#32215 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-13 05:38:52 -08:00
Mickaël Seznec	a5bbbd2f24	[Quantization] fix: overflow with static per-tensor scaling (#29867 ) Signed-off-by: Mickael Seznec <mickael@mistral.ai> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-13 12:56:01 +00:00
Cyrus Leung	232214b2ae	[Bugfix] Replace `PoolingParams.normalize` with `use_activation` (#32243 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-13 10:45:42 +00:00
Andreas Karatzas	11b6af5280	[ROCm][Bugfix] Fix Mamba batched decode producing incorrect output (#32099 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-01-13 05:46:53 +00:00
xuebwang-amd	629584bfc9	[Kernel][MoE] fix computation order of MoE weight multiplication and improve flow (#31962 ) Signed-off-by: xuebwang-amd <xuebwang@amd.com>	2026-01-12 17:17:30 -05:00
Lucas Kabela	ad8818bb5e	[Misc][BE] Type coverage for vllm/compilation [3/3] (#31748 ) Signed-off-by: Lucas Kabela <lucaskabela@meta.com>	2026-01-12 19:24:38 +00:00
Matthew Bonanni	20228cb851	[3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-12 09:13:56 -08:00
Cyrus Leung	8863c2b25c	[Model] Standardize pooling heads (#32148 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-12 17:01:49 +00:00
danielafrimi	3f72639d36	[FIX] Add NO_MUL activation support for modular kernel path (#31528 ) Signed-off-by: dafrimi <dafrimi@nvidia.com> Signed-off-by: <> Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: root <root@pool0-01777.cm.cluster>	2026-01-12 11:55:49 -05:00
Hongxin Xu	49e6b86c91	[Feature] Support recording expert indices for rollout router replay (#28284 ) Signed-off-by: xhx1022 <1737006628@qq.com> Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com> Signed-off-by: arlenxu <arlenxu@tencent.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: arlenxu <arlenxu@tencent.com>	2026-01-12 06:23:04 -08:00
Matt	bde57ab2ed	[Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group (#31713 ) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-01-10 23:19:46 -08:00
Vensen	6ea001cfb7	[Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 (#31637 ) Signed-off-by: vensen <vensenmu@gmail.com>	2026-01-10 12:40:02 -08:00
gnovack	d1fd802fa3	fused_moe_kernel - cast accumulator after applying router weights (#32002 ) Signed-off-by: gnovack <gnovack@amazon.com>	2026-01-11 04:36:45 +08:00
Michael Goin	e6c6f2c79d	[Quant] Support MXFP4 W4A16 for compressed-tensors dense models (#31926 ) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>	2026-01-10 06:44:35 -08:00
Cyrus Leung	583a90e005	[Refactor] Separate sequence and token pooling types (#32026 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-10 04:53:24 +00:00
maang	52d428295d	[Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward (#31939 ) Signed-off-by: maang <maang_h@163.com>	2026-01-10 04:19:49 +00:00
Lucas Kabela	ea6d067a2a	[Misc][LLaMa4] Compile LLaMa Vision Encoder (#30709 ) Signed-off-by: Lucas Kabela <lucaskabela@meta.com>	2026-01-09 22:01:38 -05:00
Matthew Bonanni	2612ba9285	[1/N][Attention] Restructure attention: move files (#31916 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-09 13:10:24 -08:00
Lucas Wilkinson	0a0aa07747	[Quant] Make static quant support all group shapes (#30833 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-09 12:49:27 -08:00
jiahanc	f9e2a75a1e	[fix] add cutedsl to global sf (#32001 ) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>	2026-01-09 12:03:02 -08:00
Runkai Tao	a4d5d663e2	Add unpermute-aware fused MoE path and small-batch fallback (#29354 ) Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2026-01-09 12:58:39 -07:00
Wentao Ye	308feab33f	[Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement (#31830 ) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-09 11:13:43 -08:00
Michael Goin	d5ec6c056f	[UX] Add vLLM model inspection view (#29450 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-01-09 10:12:35 -07:00
Shanshan Shen	08d954f036	[Doc] Add developer guide for CustomOp (#30886 ) Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-09 16:21:11 +00:00
Michael Goin	34cd32fe30	[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (#31832 ) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>	2026-01-09 07:40:33 -07:00
Xin Yang	e7b68f4d6c	[Bugfix] Fix Triton FusedMoE LoRA (#30585 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-01-09 11:46:59 +00:00
Cyrus Leung	c8ed39b9dd	[Model] Reorganize pooling layers (#31973 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-09 11:02:14 +00:00
Robert Shaw	0fa8dd24d2	[Bugfix] Fix Typo from NVFP4 Refactor (#31977 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-01-08 16:18:50 -08:00
Robert Shaw	5825bbc1f7	[Quantization] Deprecate Long Tail of Schemes (#31688 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-01-08 19:07:45 -05:00
Yongye Zhu	d62cfe546d	[MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection (#31752 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-01-08 19:01:30 -05:00
Lucas Wilkinson	6cdf015c3c	[Misc] Fix `Current vLLM config is not set.` warnings, assert to avoid issues in the future (#31747 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-08 15:20:49 -08:00
Dipika Sikka	5d3b6097ad	[Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs (#30881 )	2026-01-08 17:45:17 -05:00
bnellnm	e74698c27a	[Misc][Refactor] Add FusedMoERouter object (#30519 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2026-01-08 20:52:55 +00:00
Michael Goin	87e07a6b46	Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" (#31978 )	2026-01-08 11:31:53 -08:00
danisereb	b8112c1d85	[Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 (#31960 ) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>	2026-01-08 16:08:37 +00:00

1 2 3 4 5 ...

1731 Commits