HZY
|
32693db8ce
|
[Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading (#35289)
Signed-off-by: daowu.hzy <daowu.hzy@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-02-26 18:26:15 +08:00 |
|
Jiangyun Zhu
|
ab87f85231
|
[Model] Ring 2.5 (#35102)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
2026-02-26 02:17:11 -08:00 |
|
Fadi Arafeh
|
4171ff6dd9
|
[CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes (#34890)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
2026-02-26 05:00:10 +00:00 |
|
Michael Goin
|
de527e1cec
|
[UX] Add --moe-backend arg for explicit kernel selection (#33807)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2026-02-25 17:44:44 -08:00 |
|
Yongye Zhu
|
1976356ee6
|
[MoE Refactor] MXFP4 Cutlass Experts to MK (#34542)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
2026-02-25 17:32:39 -08:00 |
|
Ming Yang
|
6831650c40
|
[offloader] v2: Hide weight onloading latency via prefetching (#29941)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-25 17:20:59 -08:00 |
|
Elizabeth Thomas
|
c97234c08b
|
fix(mxfp4): Disable monolithic path for TRITON backend with EP (#34270)
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-25 13:33:42 -08:00 |
|
Lucas Wilkinson
|
15d76f74e2
|
Revert "[Misc] Enable weights loading tracking for quantized models" (#35309)
|
2026-02-25 09:20:15 -08:00 |
|
Laura Wang
|
2465071510
|
[Perf] Add opt-in SM100 Oink RMSNorm custom-op path (#31828)
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
|
2026-02-24 23:01:53 -08:00 |
|
Kunshang Ji
|
8ad54a991b
|
[Platform] Add current_platform.num_compute_units interface (#35042)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
|
2026-02-24 22:22:49 -08:00 |
|
Isotr0py
|
a6c137521c
|
[Misc] Add shard_id validation for MergedColumnLinear (#35055)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-02-24 22:12:28 -08:00 |
|
Isotr0py
|
4572a06afe
|
[Misc] Enable weights loading tracking for quantized models (#35074)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-02-24 22:11:03 -08:00 |
|
Rohan Potdar
|
f38f8c9742
|
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (#35180)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
|
2026-02-25 04:36:40 +00:00 |
|
yugong333
|
576fe50333
|
Adding Nemotron fp8 Triton MoE Config (#34674)
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-24 15:56:38 -08:00 |
|
Benjamin Chislett
|
f5972a872f
|
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support (#33726)
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-02-24 09:49:56 -08:00 |
|
Harry Mellor
|
c38b8d5a31
|
Remove padding_index from models that don't use it for better Transformers v5 compatibility (#35189)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-24 08:04:46 -08:00 |
|
danisereb
|
9609b1f18d
|
Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 (#35053)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
|
2026-02-24 08:45:13 -07:00 |
|
danisereb
|
a0c7081695
|
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#35088)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
|
2026-02-24 07:25:44 -08:00 |
|
Eldar Kurtić
|
a87cc50859
|
[Attn,KV-cache] Use per-head scales in the attention selector (#34281)
Signed-off-by: Your Name <you@example.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
Co-authored-by: Eldar Kurtic <research@neuralmagic.com>
Co-authored-by: Your Name <you@example.com>
|
2026-02-24 09:02:43 -05:00 |
|
Dor Huri
|
012dee9233
|
[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) (#35147)
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2026-02-24 04:10:32 -08:00 |
|
Tugsbayasgalan Manlaibaatar
|
f1c664545b
|
Make voxtral compile friendly (#33959)
Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2026-02-24 09:33:35 +01:00 |
|
BadrBasowid
|
6af03f2394
|
[Refactor] [1/N] Reorganize kernel abstraction directory (#34055)
Signed-off-by: BadrBasowid <badr.basowid@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2026-02-24 06:47:22 +00:00 |
|
Vadim Gimpelson
|
33a0d43c71
|
[BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable (#35156)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2026-02-23 19:42:24 -08:00 |
|
Jia Guo
|
ec85340531
|
[Quantization] Support FP8 MoE bias for models like GPT-OSS (#34906)
Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-02-23 19:07:47 -08:00 |
|
Rohan Potdar
|
2ff4e51152
|
[ROCm] AITER fused RoPE+KVCache (#33443)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: charlifu <charlifu@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
|
2026-02-23 19:06:00 -08:00 |
|
Michael Goin
|
22a97e6613
|
[Perf] Improve default triton fused moe configs (#34846)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2026-02-23 16:01:28 -08:00 |
|
Harry Mellor
|
c4f38696f7
|
Use Xet high performance mode for Transformers v5 (#35098)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-23 08:19:30 -08:00 |
|
haosdent
|
a7f341c323
|
[Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling (#35080)
Signed-off-by: haosdent <haosdent@gmail.com>
|
2026-02-23 16:05:52 +00:00 |
|
Eldar Kurtić
|
b95bb6927f
|
[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies (#34254)
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
|
2026-02-23 07:37:55 -07:00 |
|
Cyrus Leung
|
392645454b
|
[Refactor] Decouple TimingContext from InputProcessingContext (#35083)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-02-23 14:15:50 +00:00 |
|
Eldar Kurtić
|
1e8438a89a
|
[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests (#35033)
Signed-off-by: Eldar Kurtic <you@example.com>
Co-authored-by: Eldar Kurtic <you@example.com>
|
2026-02-23 09:04:34 -05:00 |
|
Robert Shaw
|
8435b2e049
|
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) (#34302)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-02-23 14:02:26 +00:00 |
|
Yan Ma
|
b1b5e045df
|
[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend (#35010)
Signed-off-by: Yan Ma <yan.ma@intel.com>
|
2026-02-23 05:06:44 -08:00 |
|
Wentao Ye
|
7f40e9e516
|
[Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item (#35068)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-23 05:05:20 -08:00 |
|
Harry Mellor
|
103e614b14
|
Fix pipeline parallel with embed scaling in the Transformers modelling backend (#35094)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-23 05:04:47 -08:00 |
|
Gabe Goodhart
|
e631f8e78e
|
fix: Apply embedding_multiplier to inputs_embeds (#34813)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2026-02-23 00:42:46 -08:00 |
|
Cyrus Leung
|
987506bca6
|
[Refactor] Simplify dummy data generation (#35025)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-02-22 20:55:27 -08:00 |
|
qizixi
|
2bcf71b9c0
|
[Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
|
2026-02-22 14:59:16 -08:00 |
|
tacos8me
|
b7892a3bef
|
[Model] Add NVFP4 quantization support for Step3.5-Flash (#34478)
Signed-off-by: tacos8me <ian@cloudhabit.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-22 12:30:46 -07:00 |
|
Athrael Soju
|
970861ac0c
|
[New Model] Add ColModernVBERT (#34558)
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com>
|
2026-02-22 12:23:41 +08:00 |
|
Vadim Gimpelson
|
74d90b1ce4
|
[Model Bash][DSR1] Add selective dynamic shape marking for CustomOp (#34900)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2026-02-21 19:28:01 -05:00 |
|
Roman
|
98b0205c3c
|
[Frontend] Add automatic language detection for Whisper transcription (#34342)
Signed-off-by: space_check <roman.vuskov@rwth-aachen.de>
Signed-off-by: Roman <45857014+spacecheck@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2026-02-21 04:49:41 -08:00 |
|
BADAOUI Abdennacer
|
8dc8a99b56
|
[ROCm] Enable bitsandbytes quantization support on ROCm (#34688)
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com>
|
2026-02-21 00:34:55 -08:00 |
|
Kata Coder
|
5719a4e4e6
|
[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed (#34574)
Signed-off-by: craftsangjae <craftsangjae@gmail.com>
|
2026-02-20 20:01:40 -08:00 |
|
pougetat
|
11be2c74dc
|
[Realtime] Add Qwen3-ASR realtime streaming support (#34613)
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com>
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2026-02-20 19:59:42 -08:00 |
|
Rohan Potdar
|
ded333fb9b
|
[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion (#34636)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
|
2026-02-20 19:56:16 -08:00 |
|
Lucas Wilkinson
|
0e22cd618b
|
Revert "[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers " (#34997)
|
2026-02-20 17:19:19 -08:00 |
|
Wei Zhao
|
ea5f903f80
|
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion (#34899)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2026-02-20 13:37:31 -08:00 |
|
Lucas Wilkinson
|
aaefc58ee0
|
[CI] Revert PRs 34818 and 33600 (#34979)
|
2026-02-20 13:25:50 -08:00 |
|
Frank Wang
|
059779231f
|
[Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend (#34916)
Signed-off-by: frankwang28 <frank.wbb@hotmail.com>
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-02-19 22:07:57 -08:00 |
|