Xin Yang
|
b1169d7be8
|
[Kernel] Add gpt-oss Router GEMM kernel (#37205)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-03-18 08:15:56 -07:00 |
|
Michael Goin
|
09e4576f65
|
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE (#37320)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-03-17 18:12:04 -04:00 |
|
Terry Gao
|
3e6a1e1686
|
[Custom Ops] Add functional + out variant for scaled_fp4_quant (#34389)
Signed-off-by: tianrengao <terrygao87@gmail.com>
|
2026-03-16 18:51:46 -04:00 |
|
Krish Gupta
|
c0f011918d
|
[Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688) (#36779)
Signed-off-by: Krish Gupta <krishom70@gmail.com>
|
2026-03-16 21:11:33 +00:00 |
|
Matthew Bonanni
|
c88ea8338b
|
[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible (#36982)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-16 13:51:21 -04:00 |
|
Wentao Ye
|
ce8cf9161d
|
[Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh (#36693)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-16 11:12:15 -04:00 |
|
xjx
|
18be11fd59
|
[BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 (#35594)
Signed-off-by: xjx <493337577@qq.com>
|
2026-03-16 15:10:42 +00:00 |
|
Wentao Ye
|
e855d380fa
|
[Compile] Fix compile warning in moe_permute (#36529)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-16 10:16:14 -04:00 |
|
Luka Govedič
|
9556af87d5
|
[torch.compile] Add support for non-contiguous fused RMSNorm + group quant (#36551)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com>
|
2026-03-11 10:56:55 -07:00 |
|
Julien Denize
|
a5d06dc557
|
Add 320 dimension size support to MLA (#36161)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
|
2026-03-11 10:21:22 -07:00 |
|
typer-J
|
4184653775
|
feat: add RISC-V support for CPU backend (v2) (#36578)
Signed-off-by: typer-J <2236066784@qq.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-03-10 21:51:39 -07:00 |
|
Hashem Hashemi
|
721ae79f50
|
Improvements to wvSplitKrc skinny GEMM solution (#34304)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-03-10 09:14:27 -07:00 |
|
Roberto L. Castro
|
580864d81e
|
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-03-09 09:50:36 -07:00 |
|
Roberto L. Castro
|
2b28b9b269
|
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 (#35290)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
|
2026-03-09 09:46:57 -07:00 |
|
Jiayi Yan
|
6a895197fa
|
[Bugfix][CI] fix typos (#34934)
Signed-off-by: 1195343015 <1195343015@qq.com>
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-05 17:05:46 +00:00 |
|
Tianmu Li
|
8e7820131e
|
[Perf] Use dummy M for weight prepacking on x86 (#35890)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
|
2026-03-05 04:56:49 +00:00 |
|
EdalatiAli
|
cb21972a97
|
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448)
Signed-off-by: EdalatiAli <aliedalati@cohere.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-03-01 23:31:19 -08:00 |
|
Asaf Gardin
|
bbf81f9a92
|
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (#34798)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-03-01 20:40:23 +08:00 |
|
Hashem Hashemi
|
7600642eae
|
Add padding support to wvSplitK solution for skinny GEMMs (#33762)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-28 09:02:05 +00:00 |
|
Ma Jian
|
90805ff464
|
[CI/Build] CPU release supports both of AVX2 and AVX512 (#35466)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: jiang1.li <jiang1.li@intel.com>
|
2026-02-28 04:35:21 +00:00 |
|
Roberto L. Castro
|
a201ad72d8
|
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops (#35105)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-02-27 16:28:17 -08:00 |
|
roikoren755
|
38c498b8e3
|
[Performance] Cublas Bf16 Gate with Fp32 Output (#35121)
Signed-off-by: Roi Koren <roik@nvidia.com>
|
2026-02-26 16:51:28 -08:00 |
|
Tyler Michael Smith
|
eb19955c37
|
[WideEP] Remove pplx all2all backend (#33724)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-26 14:30:10 -08:00 |
|
Asaf Gardin
|
ec13e549d3
|
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic (#35275)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-02-26 12:22:06 +00:00 |
|
Roberto L. Castro
|
86c3b5a808
|
[BugFix] Fix fp4 quant kernel on CUDA 12.8 (#35210)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
|
2026-02-25 18:32:50 -08:00 |
|
wenshuai
|
cd43673668
|
[Perf] Optimize FP8 gemm of sm120. (#34424)
Signed-off-by: wenshuai <wenshuai@xiaomi.com>
|
2026-02-24 22:25:24 -08:00 |
|
Xin Yang
|
3bbb2046ff
|
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel (#35161)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-02-24 17:14:24 -08:00 |
|
Hashem Hashemi
|
a0e50a4260
|
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. (#34100)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-24 23:35:21 +00:00 |
|
R3hankhan
|
34ce0ffd1f
|
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics (#34434)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-02-24 07:25:39 -08:00 |
|
Michael Goin
|
3ef9fd0f98
|
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches (#35123)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-02-23 17:11:27 -08:00 |
|
Robert Shaw
|
8435b2e049
|
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) (#34302)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-02-23 14:02:26 +00:00 |
|
Huy Do
|
272b535ab3
|
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ (#34791)
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2026-02-21 04:48:14 -08:00 |
|
Xin Yang
|
b1c4f0b265
|
[Kernel] Optimize grouped topk kernel (#34206)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-02-20 01:34:45 -08:00 |
|
Robert Shaw
|
6874638bc4
|
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) (#34758)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-02-18 07:42:36 -08:00 |
|
ElizaWszola
|
a88b3be7c4
|
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales (#33255)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-02-17 23:35:04 -08:00 |
|
Hongxia Yang
|
4a00a511bb
|
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site (#34653)
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
|
2026-02-17 19:19:41 -08:00 |
|
Pushpinder Singh
|
bcd65c1f6a
|
[Bugfix] Replace c10::optional with std::optional in topk kernel (#34467)
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com>
|
2026-02-13 08:30:23 -08:00 |
|
Wei Zhao
|
59d53066d8
|
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation (#32993)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-13 08:11:26 -08:00 |
|
Hashem Hashemi
|
fac4e96940
|
small adjustment to wvSplitKrc (#34410)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-12 20:26:36 +00:00 |
|
Kyle Sayers
|
e9cd691132
|
[Bugfix] Fix Sparse24 Compressed Tensors models (#33446)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-11 23:15:16 -08:00 |
|
Li, Jiang
|
05339a7b20
|
[Bugfix][CPU] Fix llama4 inference on CPU (#34321)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-02-11 19:07:23 +08:00 |
|
R3hankhan
|
d1b837f0ae
|
[CPU] Enable FP16 (Half dtype) support for s390x (#34116)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2026-02-11 14:41:42 +08:00 |
|
Дзержи́нский
|
1485396abb
|
[Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022)
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-02-10 19:31:51 -08:00 |
|
Kebe
|
5ee5c86eeb
|
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884)
Signed-off-by: Kebe <mail@kebe7jun.com>
|
2026-02-10 19:31:36 -08:00 |
|
Roberto L. Castro
|
afdce12c89
|
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2026-02-10 10:29:52 -05:00 |
|
Nikhil Gupta
|
caad9f1e01
|
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901)
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
|
2026-02-09 18:04:41 +08:00 |
|
ihb2032
|
5a5c43511a
|
fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052)
Signed-off-by: ihb2032 <hebome@foxmail.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
|
2026-02-09 08:55:41 +00:00 |
|
Hashem Hashemi
|
ed17f54c8b
|
Perf tuning and expansion of cases covered for wvSplitKrc (#33493)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-07 05:33:11 -08:00 |
|
Vel
|
bc32444b23
|
[Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517)
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
|
2026-02-06 20:28:01 -08:00 |
|
Wentao Ye
|
77c09e1130
|
[Refactor] Remove align block size logic in moe_permute (#33449)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-06 10:57:06 -08:00 |
|