Commit Graph

4495 Commits

Author SHA1 Message Date
Kyle Sayers
648edcf729 [QeRL] Compose online quantization with quantized reloading (#38032)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-03-27 13:22:33 -07:00
Gregory Shtrasberg
b8665383df [ROCm] Fix GPT-OSS import for triton 3.6 (#37453)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2026-03-27 18:00:57 +00:00
Rohan Potdar
0e9358c11d {ROCm]: gpt-oss fusion/padding fixes (#38043)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
2026-03-27 12:19:15 -04:00
Jonas M. Kübler
98e7f223b9 enable skipping of SW attention layers when using FP8 KV cache (#33695)
Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
2026-03-27 07:25:02 -06:00
Bowen Bao
0ae89f18fd [Refactor] Move FusedMoE hidden_size roundup to quant_method (#34285)
Signed-off-by: Bowen Bao <bowenbao@amd.com>
2026-03-26 23:38:26 -07:00
Li, Jiang
becaed6ec8 [CPU] Support CT W4A16 on CPU MP kernel (#38219)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-03-27 14:15:28 +08:00
Xiaoshuang Wang
a8eab8f30d [Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 (#37975)
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
2026-03-27 14:13:21 +08:00
Andreas Karatzas
a8e48a7b85 [CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM (#38178)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-26 11:46:03 -05:00
Chuan (Richard) Li
cb2263218e [Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos (#35886)
Signed-off-by: Li <chuali@amd.com>
2026-03-26 11:59:24 -04:00
zhang-prog
0f5b526040 [Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility (#38232)
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
2026-03-26 15:34:49 +00:00
Zhewen Li
be1a85b7a2 Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" (#38050) (#38169)
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
2026-03-26 07:59:09 -07:00
Jared Wen
757eafcf37 [bug-fix] GLM OCR Patch Merger context_dim (#37962)
Signed-off-by: JaredforReal <w13431838023@gmail.com>
2026-03-26 05:11:21 -07:00
Andreas Karatzas
f2d16207c7 [ROCm][CI] Fix flaky GPTQ compile correctness test (#38161)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-26 19:57:00 +08:00
Cyrus Leung
502c41a8f6 [Model] Use helper function to run MM processors with token inputs (where applicable) (#38018)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-26 16:44:04 +08:00
Vadim Gimpelson
52069012fe [Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-03-26 01:21:47 -07:00
Terry Gao
38de822310 [Model] Add torch.compile support for InternVL vision encoder (#38049)
Signed-off-by: tianrengao <terrygao87@gmail.com>
2026-03-25 23:52:29 -07:00
BadrBasowid
e6bf9f15ec [Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format (#38092)
Signed-off-by: BadrBasowid <Badr.Basowid@gmail.com>
Signed-off-by: BadrBasowid <61441185+BadrBasowid@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-25 21:11:43 -07:00
Xin Yang
9704a5c310 Disable dual stream execution of input projection for Qwen3 (#38152)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-03-26 01:20:39 +00:00
Wei Zhao
74056039b7 Fix minimax m2.5 nvfp4 kv scales weight loading (#37214)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
2026-03-26 00:48:06 +00:00
Jacob Platin
d7d51a7ee5 [Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU (#37348)
Signed-off-by: Jacob Platin <jacobplatin@google.com>
2026-03-26 00:46:01 +00:00
Harry Mellor
3c3c084240 Various Transformers v5 fixes (#38127)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-26 00:10:08 +00:00
Ekagra Ranjan
7b54f60db0 [Cohere] Enable Cohere-Transcribe (#38120)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
2026-03-25 16:13:51 -07:00
Andreas Karatzas
7d6917bef5 [ROCm] Fix MoE kernel test failures on gfx950 (#37833)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
2026-03-25 13:46:40 -05:00
Cyrus Leung
ba2f0acc2d [Misc] Reorganize inputs (#35182)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-25 10:22:54 -07:00
Yongye Zhu
678b3c99e8 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration (#38050) 2026-03-25 10:16:40 -07:00
grYe99
7ac48fd357 [Model] Add AutoWeightsLoader support for jais (#38074)
Signed-off-by: grYe99 <guorongye99@gmail.com>
Co-authored-by: grYe99 <guorongye99@gmail.com>
2026-03-25 12:38:40 +00:00
Harry Mellor
d6bb2a9d9a Fix Plamo 2/3 & LFM2 for Transformers v5 (#38090)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 12:29:49 +00:00
Matthias Gehre
a889b7f584 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 (#37280)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
2026-03-25 11:42:58 +00:00
Kunshang Ji
14771f7150 [XPU] support MLA model on Intel GPU (#37143)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-25 17:43:42 +08:00
Chauncey
09c3dc9186 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#37968)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-03-25 06:19:37 +00:00
Artem Perevedentsev
a93a53f8a1 [Performance] Auto-enable prefetch on NFS with RAM guard (#37673)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
2026-03-24 17:31:14 -07:00
Andreas Karatzas
679c6a3ecc [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 (#37787)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-25 08:17:33 +08:00
Nick Cao
935c46dd9b [Model] Add Granite 4.0 1B speech to supported models (#38019)
Signed-off-by: Nick Cao <ncao@redhat.com>
2026-03-24 18:23:41 +00:00
Harry Mellor
b3601da6e7 [Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) (#37904)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-24 17:14:01 +00:00
Li, Jiang
352b90c4a4 [Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU (#37987)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-03-24 07:00:20 -07:00
Wentao Ye
c59a132f96 [V0 Deprecation] Refactor kv cache from list to element (#37487)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-23 20:10:11 -07:00
Ranran
dc6908ac6a [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (#35007)
Signed-off-by: Ranran <1012869439@qq.com>
Signed-off-by: Ranran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-03-23 18:31:14 -04:00
yzong-rh
e85f8f0932 [Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts (#36728)
Signed-off-by: Yifan Zong <yzong@redhat.com>
2026-03-23 17:02:57 -04:00
Robert Shaw
5bf3c42d4c [Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision (#36725)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-03-23 20:19:06 +00:00
Kyle Sayers
38364a7e32 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-03-23 16:03:29 -04:00
Yufeng He
ec2280611a [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (#37884) 2026-03-23 15:15:12 +00:00
Kunshang Ji
debd6e768c [XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle (#37784)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-23 11:10:41 +00:00
Kunshang Ji
27d5ee3e6f [FP8]add FP8 WoQ kernel abstraction. (#32929)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
2026-03-23 09:47:47 +00:00
Chuan (Richard) Li
e99fb98867 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs (#36100)
Signed-off-by: Li <chuali@amd.com>
2026-03-23 15:48:31 +08:00
Artem Perevedentsev
a16133a0f1 [Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 (#37338)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
2026-03-23 00:37:58 -07:00
Hojin Yang
54ab804e87 [Bugfix] Store Qwen3Next A_log in fp32 (#37810)
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-03-23 15:36:57 +08:00
r266-tech
02e6efe56d [Bugfix] JAIS: Only apply ALiBi when position_embedding_type='alibi' (#37820)
Co-authored-by: r266-tech <r266-tech@users.noreply.github.com>
2026-03-23 07:36:34 +00:00
Matthias Gehre
410d300893 [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel (#36505)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-03-23 15:36:08 +08:00
Baorun (Lauren) Mu
f85e479e66 [Feature] ViT Full CUDA Graph (#35963)
Signed-off-by: Baorun Mu <bmu@nvidia.com>
2026-03-23 13:01:10 +08:00
Lasha Koroshinadze
e7767eccae Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling (#37643)
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
2026-03-23 10:29:07 +08:00