typer-J
|
4184653775
|
feat: add RISC-V support for CPU backend (v2) (#36578)
Signed-off-by: typer-J <2236066784@qq.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-03-10 21:51:39 -07:00 |
|
Hashem Hashemi
|
721ae79f50
|
Improvements to wvSplitKrc skinny GEMM solution (#34304)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-03-10 09:14:27 -07:00 |
|
Roberto L. Castro
|
580864d81e
|
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-03-09 09:50:36 -07:00 |
|
Roberto L. Castro
|
2b28b9b269
|
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 (#35290)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
|
2026-03-09 09:46:57 -07:00 |
|
Jiayi Yan
|
6a895197fa
|
[Bugfix][CI] fix typos (#34934)
Signed-off-by: 1195343015 <1195343015@qq.com>
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-05 17:05:46 +00:00 |
|
Tianmu Li
|
8e7820131e
|
[Perf] Use dummy M for weight prepacking on x86 (#35890)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
|
2026-03-05 04:56:49 +00:00 |
|
EdalatiAli
|
cb21972a97
|
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448)
Signed-off-by: EdalatiAli <aliedalati@cohere.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-03-01 23:31:19 -08:00 |
|
Asaf Gardin
|
bbf81f9a92
|
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (#34798)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-03-01 20:40:23 +08:00 |
|
Hashem Hashemi
|
7600642eae
|
Add padding support to wvSplitK solution for skinny GEMMs (#33762)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-28 09:02:05 +00:00 |
|
Ma Jian
|
90805ff464
|
[CI/Build] CPU release supports both of AVX2 and AVX512 (#35466)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: jiang1.li <jiang1.li@intel.com>
|
2026-02-28 04:35:21 +00:00 |
|
Roberto L. Castro
|
a201ad72d8
|
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops (#35105)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-02-27 16:28:17 -08:00 |
|
roikoren755
|
38c498b8e3
|
[Performance] Cublas Bf16 Gate with Fp32 Output (#35121)
Signed-off-by: Roi Koren <roik@nvidia.com>
|
2026-02-26 16:51:28 -08:00 |
|
Tyler Michael Smith
|
eb19955c37
|
[WideEP] Remove pplx all2all backend (#33724)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-26 14:30:10 -08:00 |
|
Asaf Gardin
|
ec13e549d3
|
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic (#35275)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-02-26 12:22:06 +00:00 |
|
Roberto L. Castro
|
86c3b5a808
|
[BugFix] Fix fp4 quant kernel on CUDA 12.8 (#35210)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
|
2026-02-25 18:32:50 -08:00 |
|
wenshuai
|
cd43673668
|
[Perf] Optimize FP8 gemm of sm120. (#34424)
Signed-off-by: wenshuai <wenshuai@xiaomi.com>
|
2026-02-24 22:25:24 -08:00 |
|
Xin Yang
|
3bbb2046ff
|
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel (#35161)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-02-24 17:14:24 -08:00 |
|
Hashem Hashemi
|
a0e50a4260
|
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. (#34100)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-24 23:35:21 +00:00 |
|
R3hankhan
|
34ce0ffd1f
|
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics (#34434)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-02-24 07:25:39 -08:00 |
|
Michael Goin
|
3ef9fd0f98
|
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches (#35123)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-02-23 17:11:27 -08:00 |
|
Robert Shaw
|
8435b2e049
|
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) (#34302)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-02-23 14:02:26 +00:00 |
|
Huy Do
|
272b535ab3
|
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ (#34791)
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2026-02-21 04:48:14 -08:00 |
|
Xin Yang
|
b1c4f0b265
|
[Kernel] Optimize grouped topk kernel (#34206)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-02-20 01:34:45 -08:00 |
|
Robert Shaw
|
6874638bc4
|
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) (#34758)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-02-18 07:42:36 -08:00 |
|
ElizaWszola
|
a88b3be7c4
|
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales (#33255)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-02-17 23:35:04 -08:00 |
|
Hongxia Yang
|
4a00a511bb
|
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site (#34653)
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
|
2026-02-17 19:19:41 -08:00 |
|
Pushpinder Singh
|
bcd65c1f6a
|
[Bugfix] Replace c10::optional with std::optional in topk kernel (#34467)
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com>
|
2026-02-13 08:30:23 -08:00 |
|
Wei Zhao
|
59d53066d8
|
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation (#32993)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-13 08:11:26 -08:00 |
|
Hashem Hashemi
|
fac4e96940
|
small adjustment to wvSplitKrc (#34410)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-12 20:26:36 +00:00 |
|
Kyle Sayers
|
e9cd691132
|
[Bugfix] Fix Sparse24 Compressed Tensors models (#33446)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-11 23:15:16 -08:00 |
|
Li, Jiang
|
05339a7b20
|
[Bugfix][CPU] Fix llama4 inference on CPU (#34321)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-02-11 19:07:23 +08:00 |
|
R3hankhan
|
d1b837f0ae
|
[CPU] Enable FP16 (Half dtype) support for s390x (#34116)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2026-02-11 14:41:42 +08:00 |
|
Дзержи́нский
|
1485396abb
|
[Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022)
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-02-10 19:31:51 -08:00 |
|
Kebe
|
5ee5c86eeb
|
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884)
Signed-off-by: Kebe <mail@kebe7jun.com>
|
2026-02-10 19:31:36 -08:00 |
|
Roberto L. Castro
|
afdce12c89
|
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2026-02-10 10:29:52 -05:00 |
|
Nikhil Gupta
|
caad9f1e01
|
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901)
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
|
2026-02-09 18:04:41 +08:00 |
|
ihb2032
|
5a5c43511a
|
fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052)
Signed-off-by: ihb2032 <hebome@foxmail.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
|
2026-02-09 08:55:41 +00:00 |
|
Hashem Hashemi
|
ed17f54c8b
|
Perf tuning and expansion of cases covered for wvSplitKrc (#33493)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-07 05:33:11 -08:00 |
|
Vel
|
bc32444b23
|
[Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517)
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
|
2026-02-06 20:28:01 -08:00 |
|
Wentao Ye
|
77c09e1130
|
[Refactor] Remove align block size logic in moe_permute (#33449)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-02-06 10:57:06 -08:00 |
|
Gassan Salama
|
1363e3d6d5
|
[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (#32263)
Signed-off-by: Gassan <gassan.salama@arm.com>
|
2026-02-06 15:01:48 +08:00 |
|
R3hankhan
|
ac04dd374f
|
[CPU] Add BF16 Kernel type for s390x (#33788)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2026-02-06 04:57:02 +00:00 |
|
Hashem Hashemi
|
d5c4800112
|
Adds padding and perf improvements to wvSplitK_fp8 (#33527)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-05 22:16:02 +00:00 |
|
R3hankhan
|
4dffc5e044
|
[CPU] Split attention dispatch by head_dim alignment (#32161)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2026-02-03 19:37:15 -08:00 |
|
Radu Salavat
|
e69c990c21
|
[Feature][CPU Backend]: Optimize ARM vectorization backend (#30329)
Signed-off-by: Radu Salavat <radu.salavat@arm.com>
|
2026-02-02 20:17:56 -08:00 |
|
Lain
|
089cd4f002
|
fix cutlass_3x_gemm_fp8_blockwise on sm103a (#32224)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
|
2026-02-02 11:47:46 -08:00 |
|
Kebe
|
528e9b1490
|
[Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series (#33540)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Thomas Vegas <tvegas@nvidia.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2026-02-02 22:55:46 +08:00 |
|
linhaifeng
|
fedf64332e
|
[Bugfix]: Fix display errors in TORCH_CHECK messages (#32942)
Signed-off-by: linhaifeng <1371675203@qq.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-01-31 09:48:48 -08:00 |
|
Li, Jiang
|
8311f083bd
|
[Bugfix][CPU] Fix thread num for shared memory communication (#33317)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-29 03:26:58 -08:00 |
|
Didier Durand
|
31b25f6516
|
[Doc]: fixing multiple typos in diverse files (#33256)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-29 16:52:03 +08:00 |
|