Radu Salavat
|
e69c990c21
|
[Feature][CPU Backend]: Optimize ARM vectorization backend (#30329)
Signed-off-by: Radu Salavat <radu.salavat@arm.com>
|
2026-02-02 20:17:56 -08:00 |
|
Lain
|
089cd4f002
|
fix cutlass_3x_gemm_fp8_blockwise on sm103a (#32224)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
|
2026-02-02 11:47:46 -08:00 |
|
Kebe
|
528e9b1490
|
[Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series (#33540)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Thomas Vegas <tvegas@nvidia.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2026-02-02 22:55:46 +08:00 |
|
linhaifeng
|
fedf64332e
|
[Bugfix]: Fix display errors in TORCH_CHECK messages (#32942)
Signed-off-by: linhaifeng <1371675203@qq.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-01-31 09:48:48 -08:00 |
|
Li, Jiang
|
8311f083bd
|
[Bugfix][CPU] Fix thread num for shared memory communication (#33317)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-29 03:26:58 -08:00 |
|
Didier Durand
|
31b25f6516
|
[Doc]: fixing multiple typos in diverse files (#33256)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-29 16:52:03 +08:00 |
|
Wentao Ye
|
c4e744dbd4
|
[Perf] Optimize moe_permute for CUTLASS FP8 (#32892)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-01-28 10:15:24 -08:00 |
|
Robert Shaw
|
af9b69f977
|
[Quantization][Deprecation] Remove Marlin 24 (#32688)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-01-28 15:54:59 +00:00 |
|
Harry Mellor
|
2eb673a088
|
Add flake8-implicit-str-concat rules to Ruff (#33191)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-01-28 04:56:10 +00:00 |
|
rasmith
|
58996f3589
|
[AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 (#32976)
Signed-off-by: Randall Smith <ransmith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
|
2026-01-27 07:16:43 +00:00 |
|
dolpm
|
58a05b0ca1
|
[fix] CPUDNNLGEMMHandler pointer baked into inductor artifact (#32913)
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com>
|
2026-01-26 16:59:44 -05:00 |
|
Roberto L. Castro
|
fcb9df99bd
|
[Perf][Kernel] Optimize FP4 quantization kernels (SM100F) (#32520)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
|
2026-01-24 18:45:27 -07:00 |
|
Michael Goin
|
4561f13985
|
[Refactor] Rename gptq_marlin to marlin to match MoE (#32952)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-01-23 16:48:12 -05:00 |
|
rasmith
|
6cc6d92be5
|
[CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel (#32831)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
|
2026-01-23 13:35:48 -08:00 |
|
Li, Jiang
|
5da4c7d789
|
[CI/Build][CPU] Fix failed pooling tests and macos smoke test (#32907)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-23 10:48:20 +00:00 |
|
Eldar Kurtić
|
44f08af3a7
|
Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (#30141)
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>
|
2026-01-22 13:29:57 -07:00 |
|
Fadi Arafeh
|
744ef30484
|
[CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm (#32792)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
|
2026-01-22 18:55:23 +00:00 |
|
Or Ozeri
|
421012b63a
|
OffloadingConnector: Support kernel_block_size != block_size (#30692)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
|
2026-01-22 12:30:04 +00:00 |
|
Xin Yang
|
63227accf5
|
[Kernel] Add topk_sigmoid kernel (#31246)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-01-21 22:49:51 +00:00 |
|
Robert Shaw
|
85f55c943c
|
[Quantization][Deprecation] Deprecate HQQ (#32681)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-01-21 09:32:40 -05:00 |
|
Wentao Ye
|
6c97b9b9b6
|
[Perf] Only clone when needed for moe_permute (#32273)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-01-20 11:34:39 -05:00 |
|
Wentao Ye
|
eebc58df0c
|
[Refactor] Remove unused cutlass moe problem size function (#32047)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-01-18 12:46:59 -08:00 |
|
Hashem Hashemi
|
7a1030431a
|
Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (#29843)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
|
2026-01-16 11:45:04 -06:00 |
|
Michael Goin
|
83239ff19a
|
Add thread_n=64 support to Marlin MoE (#32360)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-01-15 16:45:44 -08:00 |
|
Wentao Ye
|
f28125d87b
|
[Perf] Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement (#32058)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-01-13 10:58:18 -08:00 |
|
Kevin McKay
|
c60578de0a
|
[Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process (#31295)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
|
2026-01-10 03:57:38 +00:00 |
|
PatrykSaffer
|
80fead8bf6
|
Fuse RoPE and MLA KV-cache write (#25774)
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com>
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai>
Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-09 19:18:37 -08:00 |
|
Lucas Wilkinson
|
0a0aa07747
|
[Quant] Make static quant support all group shapes (#30833)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-01-09 12:49:27 -08:00 |
|
Wentao Ye
|
308feab33f
|
[Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement (#31830)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-01-09 11:13:43 -08:00 |
|
Michael Goin
|
34cd32fe30
|
[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (#31832)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-09 07:40:33 -07:00 |
|
R3hankhan
|
8e27663b6a
|
[CPU] Add head sizes 80 and 112 with vec16 fallback (#31968)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2026-01-09 22:14:46 +08:00 |
|
Xin Yang
|
0ada960a20
|
[Kernel] Support bias type in grouped_topk kernel (#31781)
Signed-off-by: Xin Yang <xyangx@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-07 12:16:32 -08:00 |
|
Michael Goin
|
f347ac6c34
|
[Perf] Fuse stride preparation for NVFP4 cutlass_moe (#31837)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-01-07 13:31:26 -05:00 |
|
Jinzhen Lin
|
2f4bdee61e
|
[Quantization][MoE] remove unused ep logic from moe marlin (#31571)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-06 09:07:19 -08:00 |
|
Andreas Karatzas
|
3ecfdc3776
|
[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition (#30719)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
2025-12-29 01:13:14 -08:00 |
|
skaraban3807
|
7cd288a4b3
|
[PERF] Add interleaved memory allocation to NUMA module (#30800)
|
2025-12-24 13:47:49 +00:00 |
|
Matthew Bonanni
|
369f47aa0f
|
[DeepSeek v3.2] Remove unnecessary syncwarps (#31047)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-12-23 21:33:30 -08:00 |
|
rongfu.leng
|
4ed11105d7
|
[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla (#30967)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
|
2025-12-23 18:22:35 -08:00 |
|
danielafrimi
|
b94f80ffb8
|
[FIX] FP4 quantization kernel padding initialization bug (#31097)
Signed-off-by: <>
Co-authored-by: root <root@gpu-193.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-951.slurm-workers-slurm.slurm.svc.cluster.local>
|
2025-12-23 08:45:18 -08:00 |
|
TJian
|
022f3cea53
|
[ROCm] [Critical]: Remove unused variable (#31156)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-12-22 08:28:22 -08:00 |
|
Jee Jee Li
|
097978a15d
|
[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (#30821)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-12-21 18:39:22 -08:00 |
|
Michael Goin
|
06d490282f
|
[NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size (#30897)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-21 09:41:57 -08:00 |
|
Robert Shaw
|
83a317f650
|
[MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) (#30990)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2025-12-19 13:09:54 -08:00 |
|
Nishidha Panpaliya
|
bd2b52fc2d
|
[CPU][Bugfix] Fix ppc64le CPU build (#30871)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
|
2025-12-19 12:26:35 +00:00 |
|
Li, Jiang
|
f90d3636e2
|
[Bugfix][CPU] Fix Mac CPU build (#30955)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-12-18 01:38:22 -08:00 |
|
Li, Jiang
|
e3ab93c896
|
[CPU] Refactor CPU fused MOE (#30531)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-12-18 14:36:49 +08:00 |
|
Sheng Lin
|
f4e884f222
|
[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator (#29569)
Signed-off-by: Somoku <linsh0@protonmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2025-12-17 01:52:58 -08:00 |
|
Michael Goin
|
0a1ab1e565
|
[Perf][Kernels] Vectorize csrc/activations_kernels.cu (#29512)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 14:56:02 -08:00 |
|
Jinzhen Lin
|
ce96857fdd
|
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-12-16 14:35:28 -08:00 |
|
Daniel Cámpora
|
eaa82a709a
|
[Bugfix][DSV32] Fix overflow in topk. (#30754)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 14:21:17 -08:00 |
|