biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
fxmarty-amd	00d7b497b3	[NVFP4] Support NVFP4 dense models from `modelopt` and `compressed-tensors` on AMD Instinct MI300, MI355X and Hopper through emulation (#35733 ) Signed-off-by: Felix Marty <Felix.Marty@amd.com> Signed-off-by: fxmarty-amd <felmarty@amd.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>	2026-04-06 16:18:27 -06:00
Vasiliy Kuznetsov	7b1a7423be	[Frontend] new online quantization frontend (#38138 ) Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>	2026-04-03 11:58:39 -04:00
JartX	2ce3d0ce36	[Feature] KV cache per-token-head INT8/FP8 quantization (#38378 ) Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: yangyang4991 <yangyang4991@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2026-04-02 08:13:26 -04:00
yzong-rh	d9b90a07ac	[MoE Refactor] Migrate Unquantized to Full Oracle Flow (#36286 ) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: yzong-rh <yzong@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-03-31 15:43:33 -04:00
Hongxia Yang	dbdd9ae067	[ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 (#37698 ) Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>	2026-03-30 15:49:23 +00:00
Matthias Gehre	e8b055a5ac	[Bugfix] Handle ParallelLMHead in compressed-tensors get_quant_method (#37291 ) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-03-30 07:30:52 -07:00
Jonas M. Kübler	98e7f223b9	enable skipping of SW attention layers when using FP8 KV cache (#33695 ) Signed-off-by: Jonas Kuebler <kuebj@amazon.com>	2026-03-27 07:25:02 -06:00
Li, Jiang	becaed6ec8	[CPU] Support CT W4A16 on CPU MP kernel (#38219 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2026-03-27 14:15:28 +08:00
Kyle Sayers	38364a7e32	[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-03-23 16:03:29 -04:00
Robert Shaw	eeee5b262d	[Quantization][Deprecation] Remove PTPC FP8 (#32700 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-03-21 22:10:16 +00:00
Andreas Karatzas	040a505ff5	[ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline (#34839 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-03-19 14:30:58 -05:00
Kunshang Ji	53ec16a705	[Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145 ) Signed-off-by: Kunshang Ji <jikunshang95@gmail.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-03-12 07:57:47 -07:00
Harry Mellor	f83b933b84	[CI] Bump `mypy` version to 1.19.1 (#36104 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-03-10 09:18:28 -07:00
Jiayi Yan	6a895197fa	[Bugfix][CI] fix typos (#34934 ) Signed-off-by: 1195343015 <1195343015@qq.com> Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-03-05 17:05:46 +00:00
Kunshang Ji	66a2209645	[Hardware] Replace `torch.cuda.synchronize()` api with `torch.accelerator.synchronize` (#36085 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-03-05 10:36:39 +00:00
Robert Shaw	97995f6376	[MoE Refactor] Create MK for TRTLLM Kernels (#32564 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2026-03-03 10:39:50 -08:00
Hongxia Yang	f26650d649	[ROCm] add amd-quark package in requirements for rocm to use quantized models (#35658 ) Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>	2026-03-02 06:02:43 +00:00
Jesse Cai	a60985b07e	Fix deprecated v1 config tests (#35327 ) Signed-off-by: Jesse Cai <jessecai@fb.com>	2026-03-01 20:32:03 -05:00
Michael Goin	de527e1cec	[UX] Add `--moe-backend` arg for explicit kernel selection (#33807 ) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-25 17:44:44 -08:00
Matthias Gehre	4e2c7caf2d	[Bugfix] Add regression test for MoE quant_config under torch.compile (#34335 ) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>	2026-02-20 13:27:26 +08:00
Linda	275e0d2a99	[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE (#33715 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>	2026-02-11 12:38:11 +00:00
Kunshang Ji	cb9574eb85	[XPU][9/N] clean up existing ipex code/doc (#34111 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-02-11 00:27:15 -08:00
Vasiliy Kuznetsov	0130223bd9	fix memory for online fp8 quantization with streaming weight load (#31914 ) Signed-off-by: vasiliy <vasiliy@fb.com>	2026-02-02 14:17:42 -05:00
Michael Goin	67ebaff528	Refactor NVFP4 Linear utils for ModelOpt and CT (#33201 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-01-30 16:37:42 -08:00
Kyle Sayers	f857a03f6b	[QeRL] Layerwise Reloading (#32133 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-01-30 08:50:05 -07:00
Robert Shaw	af9b69f977	[Quantization][Deprecation] Remove Marlin 24 (#32688 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-01-28 15:54:59 +00:00
Eldar Kurtić	44f08af3a7	Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (#30141 ) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>	2026-01-22 13:29:57 -07:00
Robert Shaw	4e31b7f228	[Quantization][Deprecation] Remove RTN (#32697 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>	2026-01-21 16:34:42 +00:00
Vasiliy Kuznetsov	d2389c1262	fp8 online quant: split out Fp8OnlineLinearMethod (#32189 )	2026-01-20 18:13:22 -05:00
vllmellm	148117ea2e	[Refactor] Make FP8 Linear Ops use kernel abstraction (#27814 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-20 14:48:20 +08:00
Yi Liu	50632adc58	Consolidate Intel Quantization Toolkit Integration in vLLM (#31716 ) Signed-off-by: yiliu30 <yi4.liu@intel.com>	2026-01-14 07:11:30 +00:00
Matt	bde57ab2ed	[Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group (#31713 ) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-01-10 23:19:46 -08:00
Robert Shaw	5825bbc1f7	[Quantization] Deprecate Long Tail of Schemes (#31688 ) Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-01-08 19:07:45 -05:00
Lucas Wilkinson	6cdf015c3c	[Misc] Fix `Current vLLM config is not set.` warnings, assert to avoid issues in the future (#31747 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-08 15:20:49 -08:00
Zhiwei	573a1d1119	[ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch (#31905 ) Signed-off-by: ZhiweiYan-96 <zhiwei.yan@amd.com>	2026-01-08 15:47:44 +08:00
Robert Shaw	5dcd7ef1f2	[MoE Refactor][15/N] Apply Refactor to Fp8 (#31415 )	2026-01-07 19:42:33 -05:00
Li, Jiang	8becf146bd	[Quantization][Refactor] Move CPU GPTQ kernel into MP linear (#31801 ) Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-01-06 19:10:18 +00:00
Kevin McKay	42b42824ae	[Misc] Fix grammar errors in comments and messages (#31115 ) Signed-off-by: c0de128 <kevin.mckay@outlook.com>	2025-12-21 21:14:02 -08:00
Kevin McKay	ec58c10ce1	[Misc] Fix quantization-related typos (#31116 ) Signed-off-by: c0de128 <kevin.mckay@outlook.com>	2025-12-21 21:13:48 -08:00
CedricHuang	19cc9468fd	[Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM (#30957 )	2025-12-21 22:34:49 -05:00
Robert Shaw	b471092d3a	[MoE Refactor][4/N] Marlin Fp8 Mk (#31036 )	2025-12-21 12:37:42 -05:00
Roberto L. Castro	4fa7ce46f3	[Feature] Add SM103 (Blackwell Ultra) Support to vLLM (#30484 ) Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-12-12 19:34:23 -08:00
Cyrus Leung	5a87d8b9b1	[Deprecation] Remove deprecated plugin and compilation fields for v0.13 release (#30396 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-12-10 19:59:35 -08:00
Kyle Sayers	fccd532587	[Quantization] FP8 Weight Reloading for Quantized RL Rollout (#28480 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2025-12-09 13:54:32 -08:00
HDCharles	df01eda4dc	[Bugfix] Make compressed-tensors MoEs respect ignored layers (#28878 ) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>	2025-11-26 21:35:13 -05:00
liangel-02	1d642872a2	[torchao] fix safetensors for sharding (#28169 ) Signed-off-by: Angel Li <liangel@meta.com>	2025-11-19 16:39:45 -08:00
Li, Jiang	20852c8f4c	[CPU] Refactor CPU WNA16 (#28826 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-11-19 10:32:00 +08:00
Alex	f6aa122698	[CI Sprint] Quantization CI Cleanup (#24130 ) Signed-off-by: Alex Yun <alexyun04@gmail.com>	2025-11-18 09:21:48 -05:00
Hank_	4d5943bda6	[quantization][config] enable override existing quant_config (#28510 ) Signed-off-by: Hank <hcc.mayday@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-11-14 01:24:10 +00:00
Andreas Karatzas	9f0247cfa4	`VLLM_USE_TRITON_FLASH_ATTN` V0 variable deprecation (#27611 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andreas Karatzas <Andreas.Karatzas@amd.com>	2025-11-11 18:34:36 -08:00

1 2 3 4 5

211 Commits