zhrrr
|
68c09efc37
|
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
2025-11-11 12:00:31 -05:00 |
|
Michael Goin
|
f9a4087182
|
Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (#28431)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-11-11 11:46:04 -05:00 |
|
Fanli Lin
|
b886068056
|
[BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (#28444)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
|
2025-11-11 15:29:33 +00:00 |
|
bnellnm
|
a1448b4b69
|
[Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064)
|
2025-11-11 07:29:02 -07:00 |
|
Cyrus Leung
|
afffd3cc8a
|
[Model] Pass mm_features directly into get_mrope_input_positions (#28399)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-11-11 21:14:48 +08:00 |
|
Matthew Bonanni
|
b30dfa03c5
|
[Attention] Refactor CUDA attention backend selection logic (#24794)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-11-11 07:40:44 -05:00 |
|
Lukas Geiger
|
9973e6e04a
|
[Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate (#28434)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
|
2025-11-11 10:35:10 +00:00 |
|
Fanli Lin
|
c7991269dd
|
[BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (#28387)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
|
2025-11-11 08:45:38 +00:00 |
|
Jiangyun Zhu
|
f0359fffa4
|
[Bugfix] fix qwen3-next crash (#28202)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
2025-11-11 08:24:28 +00:00 |
|
Roger Wang
|
4fd4b743a2
|
[Bugfix] Fix max image size for PaddleOCR-VL (#28442)
Signed-off-by: Roger Wang <hey@rogerw.io>
|
2025-11-11 08:07:24 +00:00 |
|
Robert Shaw
|
e605e8e323
|
[Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2025-11-11 05:59:08 +00:00 |
|
Wentao Ye
|
de540c0354
|
[Feature] Add env var VLLM_MOE_USE_DEEP_GEMM (#28422)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-11 02:29:48 +00:00 |
|
Wentao Ye
|
35d801f13f
|
[Feature] Refactor batch invariant fp8 DeepGEMM (#27606)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-11 00:08:40 +00:00 |
|
Ilya Markov
|
d17ecc6b19
|
[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (#24248)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-11-10 18:33:11 -05:00 |
|
Yong Hoon Shin
|
021143561f
|
[ROCm] Add missing gemm_a8w8_blockscale import (#28378)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-11-10 23:13:36 +00:00 |
|
Lucas Wilkinson
|
6dec9f6109
|
[BugFix] Fix DeepGEMM over-allocating workspace (#28254)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-11-10 17:01:17 -05:00 |
|
Sage Moore
|
40d33264c6
|
[Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (#28377)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sagemoore@utexas.edu>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-10 20:39:19 +00:00 |
|
jiahanc
|
34553b9d27
|
[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (#27492)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
|
2025-11-10 12:34:57 -05:00 |
|
Varun Sundar Rabindranath
|
b039bfda8f
|
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-10 09:21:52 -08:00 |
|
Cyrus Leung
|
d0e186c16f
|
[V0 Deprecation] Remove unused context_len and seq_len from M-RoPE (#28395)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-11-11 00:30:06 +08:00 |
|
vllmellm
|
f080a83511
|
[RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (#24490)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-11-10 08:20:53 -08:00 |
|
zejunchen-zejun
|
b06b9470ca
|
[Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model (#27474)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
|
2025-11-10 10:38:56 -05:00 |
|
Ferrebo
|
912744d066
|
[Fix] optimize visual token mask with caching and multi-token support (#28374)
Signed-off-by: Ferrebo <itachi971009@gmail.com>
Signed-off-by: kebo01 <kebo01@baidu.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-11-10 13:23:49 +00:00 |
|
Yu Jiaqi
|
15be507c86
|
[bugfix] fix siglip batch text output error (#28365)
Signed-off-by: piood <2477084691@qq.com>
|
2025-11-10 21:21:15 +08:00 |
|
Xiake Sun
|
03fa4d3fb3
|
[Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X (#28373)
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xisun@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-11-10 04:53:40 +00:00 |
|
Jiangyun Zhu
|
c4768dcf47
|
[Kernel] Fix fused_gdn_gating (#28343)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
2025-11-09 14:26:35 -07:00 |
|
Jiangyun Zhu
|
7ae5a5fb11
|
[Misc] Add some comments in qwen3-next (#28267)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
2025-11-08 23:59:24 -08:00 |
|
Yong Hoon Shin
|
de2b78305f
|
[ROCm] Add env to enable/disable aiter triton gemm (#28321)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-11-08 22:27:00 -08:00 |
|
Mohammad Miadh Angkad
|
404d7a9d14
|
[Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 (#28345)
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
|
2025-11-08 15:50:10 -07:00 |
|
Robert Shaw
|
26990d25dc
|
[Bugfix] Update device name for H200 detection (#28349)
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-11-08 19:01:11 +00:00 |
|
Isotr0py
|
934a9c3b79
|
[Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (#28101)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2025-11-08 05:01:27 +00:00 |
|
Michael Goin
|
0852527647
|
[Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (#28124)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-07 18:20:55 -08:00 |
|
Kunshang Ji
|
1aaecda078
|
[XPU] Enable Expert parallel for MoE models (#28263)
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2025-11-08 00:33:11 +00:00 |
|
Pavani Majety
|
72b1c2ae2c
|
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-11-07 04:18:39 -08:00 |
|
Lukas Geiger
|
e0919f331d
|
[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU (#28168)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
|
2025-11-07 12:14:29 +00:00 |
|
Kevin H. Luu
|
8e19d470af
|
[fix] Revert "fixing mm placeholder replacement issue with gemma3" (#28285)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
|
2025-11-07 12:09:09 +00:00 |
|
Mengqing Cao
|
1958bda9b4
|
[Misc][Model][Refactor] Pass the prefix into Linear layers (#28259)
Signed-off-by: MengqingCao <cmq0113@163.com>
|
2025-11-07 19:38:38 +08:00 |
|
smit kadvani
|
11fd69dd54
|
[amd][gptoss] Perf gain because of block alignment (#28024)
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com>
|
2025-11-07 05:27:42 +00:00 |
|
Harry Mellor
|
c0a4b95d64
|
Fix issues from #28242 (#28257)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-11-07 04:23:17 +00:00 |
|
Lucas Kabela
|
4bf56c79cc
|
[Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (#28242)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
|
2025-11-07 00:16:03 +00:00 |
|
Aleksandr Malyshev
|
449de9001a
|
[ROCm] triton fp8 kernel (#27058)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
|
2025-11-06 14:46:44 -05:00 |
|
Julien Denize
|
7a8375f8a0
|
Add llama 4 scaling support (#28145)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
|
2025-11-06 18:55:17 +00:00 |
|
Eric Yue
|
0370679ce9
|
[Kernel][Model] Tune fused_moe Triton configs for MiniMax-M2 on H100 (#28200)
Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>
|
2025-11-06 07:29:46 -08:00 |
|
xiangze-arm
|
c757a15f0f
|
[CPU]Improve cpu fused moe perf (#27244)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
|
2025-11-06 11:04:18 +00:00 |
|
Seungduk Kim
|
201dc98acc
|
Fix hard-coded parameter name in gemma3n.py (#27946)
Signed-off-by: Seungduk Kim <seungduk.kim@yanolja.com>
Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2025-11-05 23:07:36 -08:00 |
|
Xiaozhu Meng
|
e31946f86e
|
[flashinfer] fix FI all2all with FI cutlass moe (#28166)
Signed-off-by: Xiaozhu <mxz297@gmail.com>
|
2025-11-06 05:52:16 +00:00 |
|
Isotr0py
|
43ecd0a900
|
[Chore] Clean up deepseek v2/v3 config copy (#28055)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-11-06 03:46:30 +00:00 |
|
Wentao Ye
|
d71af5f502
|
[Feature] Enable TP + EP shared_experts overlap with router, 3.7% E2E performance improvement (#28164)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-05 17:21:08 -08:00 |
|
Vadim Gimpelson
|
b6a248bdd7
|
[PERF] Decouple projections from GDN custom op. Attempt 2 (#28083)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2025-11-05 17:01:12 -08:00 |
|
wang.yuqi
|
802748bddb
|
[Bugfix] Fix Qwen3-Reranker-8B load (#28117)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-11-05 18:33:50 +00:00 |
|