Kevin McKay
|
c60578de0a
|
[Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process (#31295)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
|
2026-01-10 03:57:38 +00:00 |
|
PatrykSaffer
|
80fead8bf6
|
Fuse RoPE and MLA KV-cache write (#25774)
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com>
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai>
Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-09 19:18:37 -08:00 |
|
Lucas Wilkinson
|
0a0aa07747
|
[Quant] Make static quant support all group shapes (#30833)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-01-09 12:49:27 -08:00 |
|
Wentao Ye
|
308feab33f
|
[Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement (#31830)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-01-09 11:13:43 -08:00 |
|
Michael Goin
|
34cd32fe30
|
[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (#31832)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-09 07:40:33 -07:00 |
|
R3hankhan
|
8e27663b6a
|
[CPU] Add head sizes 80 and 112 with vec16 fallback (#31968)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2026-01-09 22:14:46 +08:00 |
|
Xin Yang
|
0ada960a20
|
[Kernel] Support bias type in grouped_topk kernel (#31781)
Signed-off-by: Xin Yang <xyangx@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-07 12:16:32 -08:00 |
|
Michael Goin
|
f347ac6c34
|
[Perf] Fuse stride preparation for NVFP4 cutlass_moe (#31837)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-01-07 13:31:26 -05:00 |
|
Jinzhen Lin
|
2f4bdee61e
|
[Quantization][MoE] remove unused ep logic from moe marlin (#31571)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-06 09:07:19 -08:00 |
|
Andreas Karatzas
|
3ecfdc3776
|
[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition (#30719)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
2025-12-29 01:13:14 -08:00 |
|
skaraban3807
|
7cd288a4b3
|
[PERF] Add interleaved memory allocation to NUMA module (#30800)
|
2025-12-24 13:47:49 +00:00 |
|
Matthew Bonanni
|
369f47aa0f
|
[DeepSeek v3.2] Remove unnecessary syncwarps (#31047)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-12-23 21:33:30 -08:00 |
|
rongfu.leng
|
4ed11105d7
|
[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla (#30967)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
|
2025-12-23 18:22:35 -08:00 |
|
danielafrimi
|
b94f80ffb8
|
[FIX] FP4 quantization kernel padding initialization bug (#31097)
Signed-off-by: <>
Co-authored-by: root <root@gpu-193.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-951.slurm-workers-slurm.slurm.svc.cluster.local>
|
2025-12-23 08:45:18 -08:00 |
|
TJian
|
022f3cea53
|
[ROCm] [Critical]: Remove unused variable (#31156)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-12-22 08:28:22 -08:00 |
|
Jee Jee Li
|
097978a15d
|
[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (#30821)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-12-21 18:39:22 -08:00 |
|
Michael Goin
|
06d490282f
|
[NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size (#30897)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-21 09:41:57 -08:00 |
|
Robert Shaw
|
83a317f650
|
[MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) (#30990)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2025-12-19 13:09:54 -08:00 |
|
Nishidha Panpaliya
|
bd2b52fc2d
|
[CPU][Bugfix] Fix ppc64le CPU build (#30871)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
|
2025-12-19 12:26:35 +00:00 |
|
Li, Jiang
|
f90d3636e2
|
[Bugfix][CPU] Fix Mac CPU build (#30955)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-12-18 01:38:22 -08:00 |
|
Li, Jiang
|
e3ab93c896
|
[CPU] Refactor CPU fused MOE (#30531)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-12-18 14:36:49 +08:00 |
|
Sheng Lin
|
f4e884f222
|
[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator (#29569)
Signed-off-by: Somoku <linsh0@protonmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2025-12-17 01:52:58 -08:00 |
|
Michael Goin
|
0a1ab1e565
|
[Perf][Kernels] Vectorize csrc/activations_kernels.cu (#29512)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 14:56:02 -08:00 |
|
Jinzhen Lin
|
ce96857fdd
|
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-12-16 14:35:28 -08:00 |
|
Daniel Cámpora
|
eaa82a709a
|
[Bugfix][DSV32] Fix overflow in topk. (#30754)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 14:21:17 -08:00 |
|
Wentao Ye
|
f21f5ea38c
|
[Refactor] Small refactor for group topk (#30562)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-12-16 14:50:59 -05:00 |
|
Wentao Ye
|
1e6b115300
|
[Refactor] Reduce duplicate code in per_token_group_quant cuda kernels (#30496)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-12 16:45:23 -05:00 |
|
Lucas Wilkinson
|
3e41992fec
|
[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 (#27532)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-12-12 05:57:47 -08:00 |
|
Bhanu Prakash Voutharoja
|
6a6fc41c79
|
gptq marlin quantization support for fused moe with lora (#30254)
Signed-off-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
|
2025-12-12 02:27:22 +00:00 |
|
Wentao Ye
|
61249b177d
|
[Refactor] Remove useless syncwarp (#30510)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-11 17:43:41 -05:00 |
|
Aditya Tewari
|
cebda2a4af
|
[CPU] Support for Whisper (#30062)
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
|
2025-12-10 04:58:42 -08:00 |
|
Wilson Wu
|
3bdd426636
|
Fix typos in comments across multiple files (#30345)
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-12-09 20:05:28 -08:00 |
|
Hashem Hashemi
|
2e7054da06
|
Improve wvsplitK tile and balance heristics. (#29937)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2025-12-09 23:51:32 +00:00 |
|
czhu-cohere
|
f6227c22ab
|
[Kernel]Support W4A8 Grouped GEMM on Hopper (#29691)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
|
2025-12-08 19:29:06 -08:00 |
|
gnovack
|
ea657f2078
|
Lora MoE Align Improvements (#29257)
Signed-off-by: gnovack <gnovack@amazon.com>
|
2025-12-09 10:35:16 +08:00 |
|
Wentao Ye
|
0ee6416f67
|
[Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt (#30159)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-08 19:44:01 -05:00 |
|
Daniel Cámpora
|
184076c3fe
|
[DeepSeek v3.2] Make top-k work for any logit values. (#27568)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-12-08 06:55:58 -08:00 |
|
ElizaWszola
|
af0444bf40
|
[Performance] Fused blockwise quant RMS norm (#27883)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-07 16:38:04 +00:00 |
|
Wentao Ye
|
541a2ef892
|
[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. (#29546)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-07 20:31:14 +08:00 |
|
Jinzhen Lin
|
879ddb09c3
|
[Kernel][MoE] optimize moe_align_block_size (#29642)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-12-07 01:58:47 -08:00 |
|
Elham
|
9843e332da
|
[CPU][Perf] Add fast vectorized exp impl from Arm Optimized Routines (#30068)
Signed-off-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal>
|
2025-12-05 13:09:20 +00:00 |
|
Zhang Xiangze
|
13ea39bc09
|
[CPU]Parallelize over tokens in int4 moe (#29600)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
|
2025-12-02 06:21:39 +00:00 |
|
Hendrik Holtmann
|
c0dfc89485
|
SM120 / NVFP4: add device guard and runtime SM dispatch to cutlass_scaled_fp4_mm (#29711)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-12-01 17:24:18 -08:00 |
|
Jinzhen Lin
|
1656ad3704
|
[Kernel][Quantization] add w4a8 support for marlin kernel (#24722)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
|
2025-11-29 07:19:33 -08:00 |
|
Li, Jiang
|
e2f56c309d
|
[CPU] Update torch 2.9.1 for CPU backend (#29664)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-11-28 13:37:54 +00:00 |
|
Jinzhen Lin
|
a67dec7cba
|
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel (#28619)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-26 19:02:21 -08:00 |
|
Pleaplusone
|
d9d342d214
|
[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457)
Signed-off-by: ganyi <ygan@amd.com>
|
2025-11-26 12:45:28 +08:00 |
|
Michael Goin
|
e502098643
|
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 (#29242)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2025-11-25 06:59:07 -08:00 |
|
Pleaplusone
|
77e10c9cab
|
[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence (#28029)
Signed-off-by: ganyi <ygan@amd.com>
|
2025-11-24 19:05:46 -07:00 |
|
R3hankhan
|
4de87866a8
|
[CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x (#28926)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2025-11-24 12:08:09 +00:00 |
|