Li, Jiang
|
e3ab93c896
|
[CPU] Refactor CPU fused MOE (#30531)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-12-18 14:36:49 +08:00 |
|
Vadim Gimpelson
|
717ac33d9c
|
[PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. chmod -x *MI308X.json (#29553)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2025-12-18 13:16:04 +08:00 |
|
zzhxxx
|
b166ef20e1
|
[refactor] Add prefix support to embed_tokens in DeepSeek MTP (#30788)
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
|
2025-12-18 04:45:56 +00:00 |
|
Matthew Bonanni
|
4a8412f773
|
[UX] Reduce DeepGEMM warmup log output to single progress bar (#30903)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-12-17 20:21:51 -08:00 |
|
Bowen Bao
|
0c738b58bc
|
[Quantization] Support Quark int4-fp8 w4a8 for MoE (#30071)
Signed-off-by: Bowen Bao <bowenbao@amd.com>
|
2025-12-18 04:20:42 +00:00 |
|
Isotr0py
|
6fe5887652
|
[Chore] Remove v0 dead code for Qwen2.5-omni (#30883)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-12-17 19:54:39 -08:00 |
|
Isotr0py
|
74a1ac38b0
|
[v1] Add PrefixLM support to TritonAttention backend (#30386)
|
2025-12-17 16:05:24 -08:00 |
|
Varun Sundar Rabindranath
|
e3fc374a9a
|
[BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM (#30899)
|
2025-12-17 15:00:59 -08:00 |
|
Andrey Talman
|
e06d0bf0aa
|
2.9.1 PyTorch release update (#28495)
|
2025-12-17 12:20:22 -08:00 |
|
KimHyemin
|
196cdc3224
|
[Model] Gemma3: Support untied word embeddings (#30827)
Signed-off-by: www-spam <panmahm@naver.com>
|
2025-12-17 07:11:18 -08:00 |
|
baoqian426
|
84896fda22
|
[Bugfix] deepseek-V3.2 self.weights_proj has no bias (#30841)
Signed-off-by: baoqian <1354987947@qq.com>
Signed-off-by: baoqian426 <1354987947@qq.com>
|
2025-12-17 03:32:34 -08:00 |
|
Wentao Ye
|
f284d7bd0c
|
[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv (#30823)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-17 02:00:35 -08:00 |
|
Xinyu Chen
|
3b1d440ede
|
CustomOp: grouped topk (#29575)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
|
2025-12-17 17:43:00 +08:00 |
|
Asaf Joseph Gardin
|
a9e15c21ef
|
[Mamba] Removed disable cascade attn in MambaModelConfig (#30712)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
|
2025-12-17 08:48:53 +00:00 |
|
Yan Ma
|
4f735babb7
|
[XPU] fix broken fp8 online quantization for XPU platform (#30831)
Signed-off-by: Yan Ma <yan.ma@intel.com>
|
2025-12-17 00:28:13 -08:00 |
|
Li, Jiang
|
0cd5353644
|
[Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models (#30829)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-16 23:25:12 -08:00 |
|
Michael Goin
|
d4d2751732
|
Update note comment for flashinfer attention warmup (#30711)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 21:29:03 -08:00 |
|
Grzegorz K. Karch
|
f5db6385a1
|
Fix nemotron_nas intermediate_size computation (#30795)
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
|
2025-12-17 01:06:28 +00:00 |
|
Jinzhen Lin
|
ce96857fdd
|
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-12-16 14:35:28 -08:00 |
|
Roger Wang
|
f5f51e5931
|
[Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Sun Kim <sunytokki@gmail.com>
|
2025-12-16 14:18:17 -08:00 |
|
jiahanc
|
254a7f8fd6
|
[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (#30014)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
|
2025-12-16 13:01:48 -08:00 |
|
Michael Goin
|
10ee1c64cf
|
[CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test (#30723)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 14:28:34 -05:00 |
|
Harry Mellor
|
0b0acc758e
|
Remove head_mask from Ultravox and Swin (#30764)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-12-16 08:02:41 -08:00 |
|
Ming Yang
|
ce12b407f2
|
[TRTLLM] Remove the MoE GEMM weight name change (#30713)
Signed-off-by: Ming Yang <minos.future@gmail.com>
|
2025-12-16 11:01:38 -05:00 |
|
Wentao Ye
|
59bd5f6a71
|
[Feat] Enable eplb with default all2all backend (#30559)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-16 10:33:52 -05:00 |
|
Harry Mellor
|
6f15ac5de7
|
Don'e assume position_embedding_type will be present for BERT and RoBERTa models (#30770)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-12-16 13:40:26 +00:00 |
|
Isotr0py
|
e94384bbad
|
[Bugfix] Fix broken ViT attention selection for Blackwell device (#30731)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-12-16 05:24:32 +00:00 |
|
Shanshan Shen
|
3bd9c49158
|
[CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic (#29873)
Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2025-12-15 19:08:16 -08:00 |
|
Matthew Bonanni
|
60dbf7d8f1
|
Update batch invariant to use attention config (#30704)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-15 15:24:16 -05:00 |
|
Robert Shaw
|
d0502b4928
|
[MoE][Refactor 1/N] Separate Online Quantization (#30627)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2025-12-15 06:54:53 -08:00 |
|
Max Hu
|
3f175f18a2
|
[Bugfix] Fix multimodal configuration for Qwen3VL MOE model (#30670)
Signed-off-by: Max Hu <hyoung2991@gmail.com>
|
2025-12-15 14:06:01 +00:00 |
|
duke
|
e4806d973a
|
[BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model (#30674)
Signed-off-by: root <iwzbi@zju.edu.cn>
Co-authored-by: root <iwzbi@zju.edu.cn>
|
2025-12-15 10:38:29 +00:00 |
|
wang.yuqi
|
4429d934de
|
[Model] Automatic conversion of TokenClassification model (#30666)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
|
2025-12-15 08:13:00 +00:00 |
|
汪志鹏
|
1adeb3b84c
|
[New Model] BAGEL support (AR only) (#28439)
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-12-15 14:58:23 +08:00 |
|
Wentao Ye
|
3778673ea8
|
[Feat] Refactor for parallel_config in FusedMoEModularKernel (#30282)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-12-15 04:21:36 +00:00 |
|
Shanshan Shen
|
87b4d1557d
|
[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. (#30125)
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-12-15 11:13:32 +08:00 |
|
Shanshan Shen
|
738648fb81
|
[CustomOp] Support object-level enable for CustomOp (#30547)
Signed-off-by: shen-shanshan <467638484@qq.com>
|
2025-12-15 11:02:09 +08:00 |
|
ZiTian Zhao
|
ae88aada38
|
[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL (#29752)
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: deitxfge <huhaibo1990@126.com>
|
2025-12-14 05:24:56 -08:00 |
|
zifeitong
|
48b8456ff9
|
[Bugfix] Revert Qwen2-VL part of change in #28271 (#30542)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
|
2025-12-14 05:20:08 -08:00 |
|
tjp_zju
|
6ecc1e411b
|
[Bugfix] fix _get_quant_method of FusedMoE for deepseekV3.2 on non-NV… (#30057)
Signed-off-by: tjp_zju <tanjianpingzju1990@gmail.com>
|
2025-12-14 02:20:51 -08:00 |
|
Shengliang Xu
|
0bb0bae436
|
Nvidia ModelOpt workaround for issue 28072 (#30164)
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
|
2025-12-14 18:18:31 +08:00 |
|
Ilya Markov
|
3224ea9915
|
[torch.compile] Add encoder tag for compilation (#30489)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
|
2025-12-14 18:15:11 +08:00 |
|
Lasha Koroshinadze
|
3a20450d31
|
Add AudioFlamingo3 model support (#30539)
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha Koroshinadze <26011196+lashahub@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-12-14 02:14:55 -08:00 |
|
Didier Durand
|
1a55cfafcb
|
[Doc]: fixing typos in various files (#30540)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-12-14 02:14:37 -08:00 |
|
Wentao Ye
|
6e78ed6ba7
|
[Logs] Optimize startup logs 4 (#29903)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-13 16:12:53 -05:00 |
|
Chen Zhang
|
ace34e3783
|
[Bugfix] Qwen3-next with --hf-overrides \{\"num_hidden_layers\":8\} (#30433)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
2025-12-13 22:12:45 +08:00 |
|
Cyrus Leung
|
64251f48df
|
[Chore] Adjust tokenizer import to avoid circular imports (#30601)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-12-13 04:42:39 -08:00 |
|
Tsukasa OI
|
fdc135d768
|
[Misc][Quantization] Clarify the intent of GGUF FusedMoE weight materialization (#30310)
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
|
2025-12-13 13:55:14 +08:00 |
|
Roberto L. Castro
|
4fa7ce46f3
|
[Feature] Add SM103 (Blackwell Ultra) Support to vLLM (#30484)
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-12-12 19:34:23 -08:00 |
|
rasmith
|
08f8a5627e
|
[CI/Build][Kernel][BugFix][AMD] Fix per_token_group_quant_fp8 to use correct fp8 min/max values and update atol/rtol in test_quantfp8_group_functionality (#30292)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
|
2025-12-12 18:41:56 -05:00 |
|