jiahanc
|
5f96c00c55
|
[Fix] Add SM check to flashinfer MOE backend (#29144)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-11-23 00:39:30 +00:00 |
|
Federico
|
f55c76c2b3
|
chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning (#29240)
|
2025-11-22 08:42:48 -08:00 |
|
Bram Wasti
|
5f7209a793
|
[tiny] Remove unsupported TRITON_MLA backend from batch invariance (#28832)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-22 21:00:50 +08:00 |
|
jinghanhu
|
988ee66b0d
|
Handle triton kernel import exception (#29062)
|
2025-11-22 10:07:50 +00:00 |
|
FlintyLemming
|
052950e5b3
|
Add fused MoE config for H200 E160 N192 fp8 (#29182)
Signed-off-by: FlintyLemming <admin@flinty.moe>
|
2025-11-21 17:37:51 -08:00 |
|
Lukas Geiger
|
d045e22dfe
|
[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (#29217)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
|
2025-11-21 17:30:55 -08:00 |
|
Varun Sundar Rabindranath
|
3137991f55
|
[BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor (#29162)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-21 14:28:17 -08:00 |
|
Lucas Wilkinson
|
1840c5cb18
|
[BugFix] Make sure to allocate worst case MoE workspace during profile run in the DP + EP case (#27426)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-11-21 11:41:52 -08:00 |
|
Mingyuan Ma
|
b4c8fbaae2
|
Add TRTLLM MoE NVFP4 kernel to CompressedTensorsW4A4MoeMethod (#28892)
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-11-21 09:54:11 -07:00 |
|
rasmith
|
e99e467384
|
[CI/Build][Kernel][AMD] Move extra dim to after load in _fwd_kv_parallel in lighting_attn.py (#29132)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
|
2025-11-21 11:53:09 -05:00 |
|
Wentao Ye
|
a42ab317ac
|
[Log] Optimize startup log (#28948)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-11-21 08:46:20 -08:00 |
|
Aleksandr Malyshev
|
b7f1f490a6
|
Upstream triton fp4 weight preshuffle (#28888)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
|
2025-11-21 11:34:46 -05:00 |
|
Cyrus Leung
|
aab0102a26
|
[V0 deprecation] Remove more V0 references (#29088)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-11-21 11:56:59 +00:00 |
|
Hongxia Yang
|
3f5f36da3f
|
[ROCm] Fix for import when building with upstream triton for gfx1100 for gpt-oss serving (#29127)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
|
2025-11-21 03:30:07 +00:00 |
|
Wentao Ye
|
e1eefa4c40
|
[Bug] Fix torch warning of tf32 usage (#29112)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-21 01:54:59 +00:00 |
|
Wentao Ye
|
df44df0143
|
[Feature] Shared Experts Overlap with FI deepgemm swap kernel, 2.2% throughput improvement and 3.6% TTFT improvement (#28879)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-20 18:41:49 -07:00 |
|
Anna Shors
|
6eb745d9bd
|
Add truncate arg to yarn to match openai implementation of gpt-oss (#28244)
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
|
2025-11-20 18:53:50 +08:00 |
|
Wentao Ye
|
2c52c7fd9a
|
[Bug] Fix torch dynamo warning Dynamo detected a call to a functools.lru_cache (#29038)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-20 16:52:23 +08:00 |
|
Shengliang Xu
|
a8c536829c
|
Consolidate Nvidia ModelOpt quant config handling for all quantization methods (#28076)
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
2025-11-19 22:39:36 -05:00 |
|
Wentao Ye
|
5031cd5d55
|
[Refactor] Optimize select_experts (#28069)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-19 18:53:15 -05:00 |
|
JartX
|
8e38e99829
|
[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (#28849)
|
2025-11-19 18:30:08 -05:00 |
|
Max Hu
|
cb0a7b4bea
|
[Bugfix] Move flashinfer kernel check into ``__init__` function of `FusedMoE`` (#29018)
Signed-off-by: Max Hu <hyoung2991@gmail.com>
|
2025-11-19 21:54:15 +00:00 |
|
Yongye Zhu
|
88f5b19f0b
|
[DeepSeek] Fix DeepSeek V3.2 Rope Embedding (#28968)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
2025-11-19 16:30:04 -05:00 |
|
Shu Wang
|
613abb50d5
|
[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#25990)
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-11-19 13:29:06 -08:00 |
|
Wentao Ye
|
1607e664f0
|
[Bug] Fix Batch Invariant MLA test (#28967)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-19 21:18:32 +00:00 |
|
Qiu
|
2fd893b4ce
|
[Feature] Prefill Context Parallel (PCP) basic support (#28718)
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com>
Co-authored-by: LookAround <lixushi@huawei.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>
|
2025-11-19 15:52:44 -05:00 |
|
杰兮
|
9d2d561257
|
[Bugfix] Fix precision corruption when shared_experts_stream=None (#28942)
Signed-off-by: zhyajie <yajizhan@amd.com>
Co-authored-by: zhyajie <yajizhan@amd.com>
|
2025-11-19 19:30:57 +00:00 |
|
Robert Shaw
|
fe69f331f8
|
[Kernels] Improve H200 Fused MoE Config (#28992)
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-11-19 19:23:54 +00:00 |
|
Harry Mellor
|
a8b70304d6
|
Update rope_scaling to rope_parameters in preparation for Transformers v5 (#28542)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-11-19 09:06:36 -08:00 |
|
Shanshan Shen
|
d44e9df7d4
|
[Model][Mamba] Add selector for mamba attention backend and make it pluggable for other device (#26487)
Signed-off-by: shen-shanshan <467638484@qq.com>
|
2025-11-19 16:24:55 +00:00 |
|
Chen Bruce
|
da2f6800e0
|
[Feat][Perf] Enable deepep-low-latency with round-robin expert placement. (#28449)
Signed-off-by: bruceszchen <bruceszchen@tencent.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-11-19 13:46:24 +01:00 |
|
Xin Yang
|
468a8d72ba
|
[Bugfix] Fix FusedMoEModularKernel for triton backend (#28913)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2025-11-19 13:05:22 +08:00 |
|
Li, Jiang
|
20852c8f4c
|
[CPU] Refactor CPU WNA16 (#28826)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-11-19 10:32:00 +08:00 |
|
tomeras91
|
1395461f5f
|
[Hybrid][torch.compile] Refactor mamba2 forward to avoid obscuring linear projections under custom op (#28587)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-11-18 16:49:36 -08:00 |
|
Varun Sundar Rabindranath
|
9912b8ccb8
|
[Build] Add OpenAI triton_kernels (#28788)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-18 16:45:20 -08:00 |
|
Isotr0py
|
e4bb2684bc
|
[Models] Replace all nn.Conv2d with vLLM's Conv2dLayer (#28842)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-11-18 18:56:04 +00:00 |
|
Luciano Martins
|
c2612371ad
|
[Model] Add Gemma3 GGUF multimodal support (#27772)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-11-18 08:56:29 -08:00 |
|
Canlin Guo
|
b9489f51e1
|
[Model][Perf] Use cos and sin cache in QwenVL (#28798)
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
2025-11-18 11:51:54 +00:00 |
|
Wentao Ye
|
3ddcf46011
|
[Refactor] Remove Unused Func in Batch Invariant (#28881)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-11-17 20:29:29 -08:00 |
|
xuebwang-amd
|
d0a73620cc
|
[ROCm][Quantization] add apply_vllm_mapper in quark config for models like gpt-oss (#28638)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-11-18 11:16:45 +08:00 |
|
Zhewen Li
|
f8b19c0ffd
|
[Bugfix] Fix GPT-OSS on AMD after #28603 (#28816)
Signed-off-by: zhewenli <zhewenli@meta.com>
|
2025-11-17 13:15:26 -05:00 |
|
jiahanc
|
561253b37f
|
[Performance][Fix] update nvfp4 code to support renorm routing (#28569)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-11-16 18:02:42 -08:00 |
|
amirkl94
|
03ee48111d
|
Feature: Support Relu2 in FusedMoE fp8 cutlass path (#27261)
|
2025-11-16 13:39:44 -05:00 |
|
Zhewen Li
|
1ec978c209
|
[Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (#28709)
Signed-off-by: Zhewen Li <zhewenli@meta.com>
|
2025-11-15 01:10:48 -08:00 |
|
Varun Sundar Rabindranath
|
6965ef436f
|
[Performance][DeepGEMM] Estimate expected_m (#28694)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-15 13:52:14 +08:00 |
|
Thomas Parnell
|
e0c910bb89
|
[Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 (#28295)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-11-14 22:55:42 +00:00 |
|
Alexander Matveev
|
e5c78956c0
|
[Bugfix] Fix incorrect use of hidden_states for shared_experts due to do_naive_dispatch_combine (#28740)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
|
2025-11-14 14:13:46 -08:00 |
|
Andrey Khalyavin
|
fd4555089a
|
[BugFix] Fix misprint introduced by modular_kernel refactoring. (#28728)
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru>
|
2025-11-14 10:58:18 -08:00 |
|
TJian
|
a425dc256e
|
[Bugfix] [ROCm] [AITER]: Fix aiter block quant not compatible with torch compile dynamo (#28716)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-11-14 10:30:50 -08:00 |
|
Duncan Moss
|
3f8a874065
|
[Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (#27134)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-14 08:02:44 -08:00 |
|