Commit Graph

3978 Commits

Author SHA1 Message Date
Yueqian Lin
f8516a1ab9 [Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni (#33605)
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-02-04 12:15:29 +00:00
Vadim Gimpelson
824058076c [PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#33291)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-02-04 11:20:52 +00:00
Kunshang Ji
f79f777803 [XPU][2/N] add support unquantized moe support for xpu (#33659)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-02-04 02:12:25 -08:00
Frank Wang
45f8fd6f97 [Feature] Enable TRITON_ATTN for Batch Invariance (#33688)
Signed-off-by: frankwang28 <frank.wbb@hotmail.com>
2026-02-04 13:27:34 +08:00
Michael Goin
eb5ed20743 [Bugfix] Define router_logits_dtype for remaining MoE models (#33737)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-02-04 13:24:14 +08:00
Shanshan Shen
9fb27dd3b3 [MM] Align the prefix of MMEncoderAttention with Attention (#33750)
Signed-off-by: shen-shanshan <467638484@qq.com>
2026-02-04 04:07:30 +00:00
Matthew Bonanni
bd8da29a66 [Bugfix] Fix sparse MLA metadata building (#33579)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-02-03 15:29:48 -08:00
Michael Goin
2a99c5a6c8 [Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 (#33613)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-02-03 13:26:51 -08:00
Vadim Gimpelson
a372f3f40a [MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-02-03 15:10:31 -05:00
Patrick von Platen
f0d5251715 [Voxtral models] Skip warm-up to skip confusing error message in warm-up (#33576)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-02-03 07:22:34 -08:00
Shanshan Shen
5c4f2dd6ef [MM] Pass prefix parameter to MMEncoderAttention (#33674)
Signed-off-by: shen-shanshan <467638484@qq.com>
2026-02-03 06:47:41 -08:00
Harry Mellor
2a8d84e66d Fix Gemma3n audio encoder for Transformers v5 (#33673)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-02-03 05:49:49 -08:00
zxy
a3acfa1071 [Models] Intern-S1-Pro (#33636)
Signed-off-by: zxy <zhou0493@e.ntu.edu.sg>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-02-03 05:49:45 -08:00
Song Zhixin
ceab70c89d [Bugfix] fix qwen3-asr response error (#33644)
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-02-03 03:33:56 -08:00
Michael Goin
e346e2d056 [Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] TRTLLM per-tensor FP8 MoE (#33620)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-02-03 10:37:15 +00:00
Kunshang Ji
e10604480b [XPU][1/N] Deprecate ipex and switch to vllm-xpu-kernels for xpu platform (#33379)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-02-02 22:46:10 -08:00
Shengliang Xu
f1cb9b5544 Fix quantized Falcon-H1 model loading issues (#32728)
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-02-02 22:31:27 -08:00
Patrick von Platen
5019c59dd2 [Voxtral Realtime] Introduce global log mel max (#33574)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 17:01:47 -05:00
Vasiliy Kuznetsov
0130223bd9 fix memory for online fp8 quantization with streaming weight load (#31914)
Signed-off-by: vasiliy <vasiliy@fb.com>
2026-02-02 14:17:42 -05:00
Yang Liu
199e3cb476 [Model] Use mm_position to compute mrope positions for GLM-4.xV (#33039)
Signed-off-by: Yang <lymailforjob@gmail.com>
2026-02-02 16:55:48 +00:00
Isotr0py
4061dcf4c5 [Bugfix] Enable Kimi k25 processor test (#33562)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-02-02 14:25:25 +00:00
danielafrimi
0aca8b8c62 [MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) (#32790)
Signed-off-by: dafrimi <dafrimi@nvidia.com>
2026-02-02 09:18:50 -05:00
Cyrus Leung
b10d05b8a8 [Model] Use explicit types in get_generation_prompt (#33551)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-02-02 12:38:49 +00:00
Borushiki
b398e5c819 Update get_expert_mapping to include self parameter (#33525)
Signed-off-by: Borushiki <38628261+Otsutsukii@users.noreply.github.com>
2026-02-02 20:29:07 +08:00
Grzegorz K. Karch
78061ef584 Fix accessing hidden_act from model config (#32686)
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
2026-02-02 11:11:33 +00:00
RED
808dd87b30 [Model] Support DeepSeek-OCR-2 (#33165)
Signed-off-by: liuli <ll407707@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: liuli <ll407707@alibaba-inc.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-02-02 06:24:10 +00:00
csy0225
c3b40dc3e7 [Models] Step-3.5-Flash (#33523)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: i-zhangmingming <i-zhangmingming@stepfun.com>
Co-authored-by: xiewuxun <xiewuxun@stepfun.com>
Co-authored-by: zetaohong <i-hongzetao@stepfun.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-02-02 10:21:18 +08:00
will b.
46b4a02794 Fix DeepSeek V2 RoPE initialization error (#33501)
Signed-off-by: Eduardo Salinas <edus@microsoft.com>
Signed-off-by: catswe <212922539+catswe@users.noreply.github.com>
Co-authored-by: Eduardo Salinas <edus@microsoft.com>
2026-02-01 21:00:56 +00:00
shaharmor98
8869cd8ec1 Add MoE config for Super B200 TP2 (#33510) 2026-02-01 18:48:37 +00:00
JartX
cd86fff38f [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM (#33077)
Signed-off-by: JartX <sagformas@epdcenter.es>
2026-02-01 13:36:25 +00:00
Maral
b5f8c3092d [W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. (#33047)
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-02-01 09:28:01 +00:00
Eduardo Salinas
302ecf64ff [Models]: lfm2_siglip2 return intermediate encoder layers (#33370)
Signed-off-by: Eduardo Salinas <edus@microsoft.com>
2026-02-01 06:17:49 +00:00
René Honig
079781177a fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels (#33417)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2026-01-31 14:06:42 -08:00
Roy Wang
63c0889416 [Misc] Fix flashinfer related tests (#33462)
Signed-off-by: esmeetu <jasonailu87@gmail.com>
2026-01-31 16:10:24 -05:00
Cyrus Leung
88c3e114d8 [Refactor] Move MM data parsing outside processor (#33408)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-31 16:46:14 +00:00
Cyrus Leung
92924b2ddd [Deprecation] Remove deprecated items related to pooling (#33477)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-31 08:44:40 -08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
13b842f271 [BugFix][Router Replay] Capture Logical Experts with EPLB (#33013)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2026-01-31 10:12:17 -05:00
Cyrus Leung
f0a1c8453a [Frontend] Use new Renderer for Completions and Tokenize API (#32863)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-31 04:51:15 -08:00
Jinwu
f68e3ea4e1 [BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. (#33078)
Co-authored-by: jinwuguo <jinwuguo@tencent.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-31 08:14:54 +00:00
Fadi Arafeh
1618e25492 [CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs (#33122)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-01-31 07:16:22 +00:00
AutumnAurelium
f3888aca83 Add EAGLE3 support for AFMoE (#33111)
Signed-off-by: AutumnAurelium <88015631+AutumnAurelium@users.noreply.github.com>
2026-01-31 06:53:08 +00:00
Dimitrios Bariamis
f0bca83ee4 Add support for Mistral Large 3 inference with Flashinfer MoE (#33174)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-01-30 22:48:27 -08:00
Matthias Gehre
73419abfae [Bugfix] Handle Asym W4A16 (ConchLinearKernel) for CT (#33200)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-01-31 06:21:51 +00:00
Nicolò Lucchesi
e77f162cf5 [Bugfix] Fix Qwen3ASR language asr tag in output (#33410)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-31 05:24:49 +00:00
Patrick von Platen
15e0bb9c42 [Streaming -> Realtime] Rename all voxtral related classes, fn, files (#33415)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
2026-01-31 04:49:00 +00:00
Matthew Bonanni
aaa901ad55 [Attention] Move MLA forward from backend to layer (#33284)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-30 19:30:00 -08:00
Wentao Ye
010ec0c30e [Deprecation] Deprecate seed_everything and scatter_mm_placeholders in v0.15 (#33362)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-31 02:54:16 +00:00
Gregory Shtrasberg
31aedfe7d6 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2026-01-30 19:05:23 -06:00
Michael Goin
67ebaff528 Refactor NVFP4 Linear utils for ModelOpt and CT (#33201)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-30 16:37:42 -08:00
Pavani Majety
c3a9752b0c [Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. (#32437)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2026-01-30 10:30:46 -08:00