Anton Ivanov
|
abebd9323d
|
[CPU] Replace OMP initialization (#36487)
Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
|
2026-04-03 18:42:43 +08:00 |
|
Itay Etelis
|
4a06e1246e
|
[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (#38460)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
|
2026-04-03 03:13:23 +00:00 |
|
Carl Y
|
3bc2734dd0
|
[Kernel] Fuse FP8 output quantization into merge_attn_states (#36518)
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com>
|
2026-04-03 01:47:04 +00:00 |
|
Li, Jiang
|
c6f722b93e
|
[CPU] Support gelu act in cpu_fused_moe (#38770)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-04-02 14:14:32 +08:00 |
|
Xin Yang
|
9bd7231106
|
Revert "[Kernel] Add gpt-oss Router GEMM kernel (#37205)" (#38778)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-04-01 22:02:32 -07:00 |
|
Gregory Shtrasberg
|
3aab680e3e
|
[ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol (#38750)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com>
|
2026-04-01 21:30:11 -07:00 |
|
Monishver
|
c09ad767cd
|
Feature/silu block quant fusion v1 (#32996)
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
|
2026-04-01 18:50:43 +00:00 |
|
Wentao Ye
|
c9a9db0e02
|
[Compile] Fix nvfp4 compile warning (#38573)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-04-01 18:28:57 +00:00 |
|
Michael Goin
|
db5d0719e1
|
[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp (#34664)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-04-01 09:41:42 -07:00 |
|
Li, Jiang
|
36d7f19897
|
[CPU] Support head_size 512 in cpu_attn (#38676)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-04-01 05:42:27 +00:00 |
|
Olya Kozlova
|
598190aac3
|
[fix] Remove trtllm ragged mla prefills (#36540)
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
|
2026-03-31 12:30:27 -07:00 |
|
mikaylagawarecki
|
7c080dd3c5
|
[4/n] Migrate FP4/W4A8 CUTLASS kernels to torch stable ABI (#37503)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
|
2026-03-31 10:21:13 -07:00 |
|
Yintong Lu
|
f09daea261
|
[CPU] Support int8 compute mode in CPU AWQ (#35697)
Signed-off-by: Yintong Lu <yintong.lu@intel.com>
|
2026-03-31 15:27:37 +08:00 |
|
SandishKumarHN
|
bcc6f67447
|
[Bugfix] Use null block (0) for padded block table entries (#35431)
Signed-off-by: SandishKumarHN <sandish@fb.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-30 14:02:51 -07:00 |
|
mikaylagawarecki
|
ab1a6a43fa
|
[3/n] Migrate cutlass/scaled_mm_entry.cu torch stable ABI (#37221)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
|
2026-03-30 11:20:13 -07:00 |
|
Johnny
|
b4a2f3ac36
|
[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 (#38423)
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Johnny <johnnynuca14@gmail.com>
|
2026-03-30 09:36:18 -07:00 |
|
IriKa
|
148a5c1226
|
[Bugfix]fix output Nan/Inf in marlin if dtype=float16 (#33972)
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com>
|
2026-03-27 16:36:08 -07:00 |
|
mikaylagawarecki
|
bf4cc9ed2d
|
[2/n] Migrate per_token_group_quant to torch stable ABI (#36058)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
|
2026-03-25 10:15:13 -07:00 |
|
Necofish
|
e7221180e1
|
[Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM (#37970)
Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-03-25 08:20:04 -07:00 |
|
Li, Jiang
|
352b90c4a4
|
[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU (#37987)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-03-24 07:00:20 -07:00 |
|
Kyle Sayers
|
38364a7e32
|
[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
2026-03-23 16:03:29 -04:00 |
|
Zhaodong Bing
|
10a1018c12
|
[ROCm] fix sleep mode not releasing GPU memory problem on ROCm (#37533)
Signed-off-by: bingzhaodong <aaab8b@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2026-03-23 06:07:19 -07:00 |
|
L.B.R.
|
1779c09898
|
[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode (#34709)
Signed-off-by: L.B.R. <lbr@mmonad.com>
Co-authored-by: L.B.R. <lbr@mmonad.com>
|
2026-03-20 10:11:23 -05:00 |
|
mikaylagawarecki
|
8b10e4fb31
|
[1/n] Migrate permute_cols to libtorch stable ABI (#31509)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
|
2026-03-19 11:27:26 -04:00 |
|
yassha
|
199f914183
|
fix(cpu): add null check for aligned_alloc in ScratchPadManager (#37369)
Signed-off-by: yassha <50112520+yassha@users.noreply.github.com>
|
2026-03-19 17:45:06 +08:00 |
|
Xin Yang
|
b1169d7be8
|
[Kernel] Add gpt-oss Router GEMM kernel (#37205)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-03-18 08:15:56 -07:00 |
|
Michael Goin
|
09e4576f65
|
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE (#37320)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-03-17 18:12:04 -04:00 |
|
Terry Gao
|
3e6a1e1686
|
[Custom Ops] Add functional + out variant for scaled_fp4_quant (#34389)
Signed-off-by: tianrengao <terrygao87@gmail.com>
|
2026-03-16 18:51:46 -04:00 |
|
Krish Gupta
|
c0f011918d
|
[Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688) (#36779)
Signed-off-by: Krish Gupta <krishom70@gmail.com>
|
2026-03-16 21:11:33 +00:00 |
|
Matthew Bonanni
|
c88ea8338b
|
[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible (#36982)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-16 13:51:21 -04:00 |
|
Wentao Ye
|
ce8cf9161d
|
[Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh (#36693)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-16 11:12:15 -04:00 |
|
xjx
|
18be11fd59
|
[BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 (#35594)
Signed-off-by: xjx <493337577@qq.com>
|
2026-03-16 15:10:42 +00:00 |
|
Wentao Ye
|
e855d380fa
|
[Compile] Fix compile warning in moe_permute (#36529)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-16 10:16:14 -04:00 |
|
Luka Govedič
|
9556af87d5
|
[torch.compile] Add support for non-contiguous fused RMSNorm + group quant (#36551)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com>
|
2026-03-11 10:56:55 -07:00 |
|
Julien Denize
|
a5d06dc557
|
Add 320 dimension size support to MLA (#36161)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
|
2026-03-11 10:21:22 -07:00 |
|
typer-J
|
4184653775
|
feat: add RISC-V support for CPU backend (v2) (#36578)
Signed-off-by: typer-J <2236066784@qq.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-03-10 21:51:39 -07:00 |
|
Hashem Hashemi
|
721ae79f50
|
Improvements to wvSplitKrc skinny GEMM solution (#34304)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-03-10 09:14:27 -07:00 |
|
Roberto L. Castro
|
580864d81e
|
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-03-09 09:50:36 -07:00 |
|
Roberto L. Castro
|
2b28b9b269
|
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 (#35290)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
|
2026-03-09 09:46:57 -07:00 |
|
Jiayi Yan
|
6a895197fa
|
[Bugfix][CI] fix typos (#34934)
Signed-off-by: 1195343015 <1195343015@qq.com>
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-05 17:05:46 +00:00 |
|
Tianmu Li
|
8e7820131e
|
[Perf] Use dummy M for weight prepacking on x86 (#35890)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
|
2026-03-05 04:56:49 +00:00 |
|
EdalatiAli
|
cb21972a97
|
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448)
Signed-off-by: EdalatiAli <aliedalati@cohere.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-03-01 23:31:19 -08:00 |
|
Asaf Gardin
|
bbf81f9a92
|
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (#34798)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-03-01 20:40:23 +08:00 |
|
Hashem Hashemi
|
7600642eae
|
Add padding support to wvSplitK solution for skinny GEMMs (#33762)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-02-28 09:02:05 +00:00 |
|
Ma Jian
|
90805ff464
|
[CI/Build] CPU release supports both of AVX2 and AVX512 (#35466)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: jiang1.li <jiang1.li@intel.com>
|
2026-02-28 04:35:21 +00:00 |
|
Roberto L. Castro
|
a201ad72d8
|
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops (#35105)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-02-27 16:28:17 -08:00 |
|
roikoren755
|
38c498b8e3
|
[Performance] Cublas Bf16 Gate with Fp32 Output (#35121)
Signed-off-by: Roi Koren <roik@nvidia.com>
|
2026-02-26 16:51:28 -08:00 |
|
Tyler Michael Smith
|
eb19955c37
|
[WideEP] Remove pplx all2all backend (#33724)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-26 14:30:10 -08:00 |
|
Asaf Gardin
|
ec13e549d3
|
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic (#35275)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
2026-02-26 12:22:06 +00:00 |
|
Roberto L. Castro
|
86c3b5a808
|
[BugFix] Fix fp4 quant kernel on CUDA 12.8 (#35210)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
|
2026-02-25 18:32:50 -08:00 |
|