Commit Graph

3905 Commits

Author SHA1 Message Date
JJJYmmm
04a9e064db [Bugfix] fix the ima issue of qwen-vit (#32687)
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
2026-01-20 17:21:25 +00:00
whx
4ca62a0dbd [PluggableLayer][1/N] Define PluggableLayer (#32331)
Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-01-20 16:19:21 +00:00
Cyrus Leung
fda3f03eb2 [4/N] Initialize MM components in context managers (M-P) (#32663)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-20 14:06:32 +00:00
Chauncey
c4e5bdf61b [Bugfix] Fix the fp8_mqa_logits dim mismatch (#32652)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-01-20 18:48:07 +08:00
Cyrus Leung
7f1bcd18ff [3/N] Initialize MM components in context managers (I-L) (#32650)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-20 10:21:56 +00:00
Cyrus Leung
e1a34c3a5d [2/N] Initialize MM components in context managers (E-H) (#32641)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-20 08:12:56 +00:00
vllmellm
148117ea2e [Refactor] Make FP8 Linear Ops use kernel abstraction (#27814)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-01-20 14:48:20 +08:00
Cyrus Leung
b75e85dede [1/N] Initialize MM components in context managers (A-D) (#32632)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-20 14:12:42 +08:00
Cyrus Leung
4753f3bf69 [Model] Use context managers for encoder- and LM-only mode (#32605)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-20 11:43:38 +08:00
Matthew Bonanni
1a1fc3bbc0 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill (#32615)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-19 18:41:34 -05:00
Tomas Ruiz
4a5299c93f feat: spec decode with draft models (#24322)
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
2026-01-19 16:05:46 -05:00
jiahanc
7350331718 [BugFix] Fix TRT-LLM NVFP4 DP/EP (#32349)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-19 14:32:24 -05:00
Netanel Haber
cd3ac5b797 support dynamic resolution image encoding for Nemotron Nano VL (#32121)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2026-01-19 18:15:58 +00:00
Jee Jee Li
2636d76257 [Misc] Remove unused ModelKeys (#32608)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-19 17:34:59 +00:00
danisereb
aa7f37ccfa Add support for LoRA adapters in Nemotron-H models (#30802)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
2026-01-19 22:30:44 +08:00
Nicolò Lucchesi
74c583bc50 [Core] Whisper support torch.compile (#30385)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-19 10:02:31 +00:00
Yuxuan Zhang
71832ba71e [GLM-4.7] GLM Model support for GLM-Lite (#31386)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Yuxuan Zhang <2448370773@qq.com>
2026-01-19 01:18:38 -08:00
honglyua
976af2f314 [BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration (#32462) 2026-01-19 03:06:02 +00:00
Iryna Boiko
f5d1740030 [Bugfix] Add OOT backend option (#32471)
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
2026-01-18 22:20:39 +00:00
Andrey Khalyavin
ba29ab441e Use the same memory for workspace13 and fused_output. (#31531)
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru>
2026-01-18 19:14:22 +00:00
bnellnm
327a02d8db [MoE Refactor] Separate Router into OO Classes (#30623)
Signed-off-by: Bill Nell <bnell@redhat.com>
2026-01-18 11:40:49 -05:00
tjp_zju
2f03035a61 "refactor: refactor_repeated_interfaces" (#32486)
Signed-off-by: tom-zju <tanjianpingzju1990@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-18 22:07:01 +08:00
Isotr0py
38bf2ffb21 [Bugfix] Fix GLM-ASR audio encoder RoPE dim (#32540)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-18 19:17:59 +08:00
Li Xie
c826c72a96 [Model] Support Step1 Model (#32511)
Signed-off-by: xieli <xieli@stepfun.com>
2026-01-18 10:20:46 +00:00
Canlin Guo
fe36bf5e80 [Model] Remove the unnecessary dtype conversion in MiniCPM (#32523)
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-18 08:07:28 +00:00
Robert Shaw
4a6af8813f [MoE Refactor] Move Test Impl into Test Dirs (#32129)
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2026-01-18 12:16:59 +08:00
Kim Hee Su
1646fea672 [Model] Molmo2: Enable quantized weight mapping for vision backbone (#32385)
Signed-off-by: kimheesu <wlskaka4@gmail.com>
2026-01-17 09:33:05 +00:00
Paul Pak
d3317bbba4 [Models] Lfm2Moe: minor name changes for resolving lora conflicts (#29063)
Signed-off-by: Paul Pak <paulpak58@gmail.com>
2026-01-16 22:12:55 -08:00
Matthew Bonanni
2e7c89e708 Revert "[Attention][MLA] Make FLASHINFER_MLA the default MLA backen… (#32484)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-17 04:42:39 +00:00
Hashem Hashemi
7a1030431a Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (#29843)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
2026-01-16 11:45:04 -06:00
Cyrus Leung
180e981d56 [Chore] Replace swish with silu (#32459)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-16 08:22:45 +00:00
Rabi Mishra
b66b0d6abb fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm (#32244)
Signed-off-by: rabi <ramishra@redhat.com>
2026-01-16 15:31:10 +08:00
Hongxin Xu
03da3b52ef [Bugfix] Refactor to support DP parallel in R3 (#32306)
Signed-off-by: xhx1022 <1737006628@qq.com>
Co-authored-by: arlenxu <arlenxu@tencent.com>
2026-01-16 15:13:58 +08:00
XiongfeiWei
73f635a75f [Bug] Add TPU backend option (#32438)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2026-01-16 05:17:12 +00:00
Kebe
5de6dd0662 [Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding (#32175)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-16 03:21:55 +00:00
ltd0924
709502558c [Model] Add Step3vl 10b (#32329)
Signed-off-by: luotingdan <luotingdan@stepfun.com>
Signed-off-by: ltd0924 <32387785+ltd0924@users.noreply.github.com>
Co-authored-by: luotingdan <luotingdan@stepfun.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-01-15 19:04:16 -08:00
Matthew Bonanni
bcf2333cd6 [CI] Fix LM Eval Large Models (H100) (#32423)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-16 00:52:49 +00:00
Michael Goin
83239ff19a Add thread_n=64 support to Marlin MoE (#32360)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-15 16:45:44 -08:00
TomerBN-Nvidia
c277fbdf31 [Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors (#32257)
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Tomer Natan <tbarnatan@ipp1-1429.ipp1a1.colossus.nvidia.com>
2026-01-15 16:15:05 -08:00
Yongye Zhu
31c29257c8 [MoE Refactor][17/N] Apply Refactor to Bf16 (#31827)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-01-15 12:53:40 -08:00
Aleksandr Malyshev
8c11001ba2 [ROCM] DSfp4 mla projection gemms weight dynamic quantization (#32238)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2026-01-15 14:13:08 -06:00
Lucas Wilkinson
c36ba69bda [BugFix] Fix assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1] in Blackwell Quantized MoE Test (#32362)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-15 10:19:12 -08:00
Dipika Sikka
361dfdc9d8 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models (#32285)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-15 07:25:55 -08:00
Matthew Bonanni
8ebfacaa75 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill (#32339)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-15 09:49:57 -05:00
brian033
b89275d018 [ROCm] Improve error handling while loading quantized model on gfx120… (#31715)
Signed-off-by: brian033 <85883730+brian033@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-01-15 04:16:00 -08:00
Cyrus Leung
cdba4c74b3 [Model] Avoid token selection in SigLIP pooling head (#32389)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-15 17:01:59 +08:00
Lucas Wilkinson
2c9b4cf5bf [BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes (#32361)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
2026-01-15 06:32:22 +00:00
rasmith
3c2685645e [CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol (#32201)
Signed-off-by: Randall Smith <ransmith@amd.com>
2026-01-15 05:04:34 +00:00
Cyrus Leung
9ea07b41da [1/N] Reorganize multimodal processing code (#32327)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-14 15:25:31 +00:00
Roger Wang
b8199f6049 [Model] Re-implement Qwen3Omni Audio Encoder (#32167)
Signed-off-by: Roger Wang <hey@rogerw.io>
2026-01-14 15:40:30 +08:00