Harry Mellor
|
6c04638214
|
Fix per file ruff ignores related to line length (#26262)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-06 05:12:40 +00:00 |
|
Harry Mellor
|
4e256cadc2
|
Remove all references to yapf as it's no longer used (#26251)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-05 09:18:11 -07:00 |
|
Harry Mellor
|
d6953beb91
|
Convert formatting to use ruff instead of yapf + isort (#26247)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-10-05 07:06:22 -07:00 |
|
Fadi Arafeh
|
9705fba7b7
|
[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack (#25948)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2025-10-04 12:16:38 +08:00 |
|
Matthew Bonanni
|
300a59c4c3
|
Avoid division by zero in cache DS MLA kernel (#26174)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-10-03 17:35:17 +00:00 |
|
Jun Jiang
|
4f8f47e87e
|
Fix undefined symbol: cutlass_moe_mm_sm100 (#26098)
Signed-off-by: Jun Jiang <jasl9187@hotmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-10-03 15:48:32 +00:00 |
|
Jiangyun Zhu
|
eb0fa43868
|
[Perf] Optimize reshape_and_cache CUDA Kernel (#25955)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Liu-congo <1502632128@qq.com>
|
2025-10-03 01:33:46 -07:00 |
|
Matthew Bonanni
|
47b9339546
|
[DeepSeek] Improve performance of DS MLA cache kernel (#26132)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-10-02 20:35:47 -07:00 |
|
Michael Goin
|
5d5146eee3
|
[CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper (#26138)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-10-02 20:32:38 -07:00 |
|
Hosang
|
aac622e0cd
|
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com>
|
2025-10-01 21:39:49 +00:00 |
|
Lucas Wilkinson
|
1726e93ef1
|
[BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2025-10-01 12:30:00 -07:00 |
|
Johnny
|
5234dc7451
|
[NVIDIA] Blackwell Family (#24673)
Signed-off-by: Johnny <johnnynuca14@gmail.com>
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: Johnny <johnnync13@gmail.com>
Signed-off-by: Salvatore Cena <cena@cenas.it>
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Co-authored-by: Salvatore Cena <cena@cenas.it>
|
2025-10-01 10:50:54 -07:00 |
|
Zhewen Li
|
e952eee698
|
[Bugfix] Fix __syncwarp on ROCM (#25996)
|
2025-09-30 21:15:11 -07:00 |
|
Param
|
99028fda44
|
Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)
Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
|
2025-09-30 19:19:53 -07:00 |
|
Salvatore Cena
|
a73f6491c8
|
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (#25843)
Signed-off-by: Salvatore Cena <cena@cenas.it>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-09-30 19:18:19 -07:00 |
|
Yongye Zhu
|
fa7e254a7f
|
[New Model] DeepSeek-V3.2 (Rebased to Main) (#25896)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Co-authored-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
|
2025-09-30 17:14:41 +08:00 |
|
Bram Wasti
|
dc48ba0c75
|
Kernel-override Determinism [1/n] (#25603)
Signed-off-by: Bram Wasti <bwasti@meta.com>
|
2025-09-26 16:59:09 -07:00 |
|
chenlang
|
1e9a77e037
|
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112)
Signed-off-by: chenlang <chen.lang5@zte.com.cn>
Co-authored-by: chenlang <10346245@zte.com.cn>
|
2025-09-25 20:46:11 +08:00 |
|
XuruiYang
|
845adb3ec6
|
[Model] Add LongCat-Flash (#23991)
Signed-off-by: yangxurui <yangxurui@meituan.com>
Co-authored-by: yangxurui <yangxurui@meituan.com>
|
2025-09-24 21:53:40 -07:00 |
|
Shiyan Deng
|
5c1e496a75
|
[MISC] replace c10::optional with std::optional (#25602)
Signed-off-by: Shiyan Deng <dsy842974287@meta.com>
|
2025-09-24 16:56:21 -07:00 |
|
Nikhil Gupta
|
359d293006
|
[fix]: add Arm 4bit fused moe support (#23809)
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
|
2025-09-24 01:32:22 +00:00 |
|
Wentao Ye
|
bf68fd76a9
|
[Compile] Fix AMD Compile Error (#25518)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-09-24 00:42:48 +00:00 |
|
Alexander Matveev
|
1210e4d95b
|
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 (#25509)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
|
2025-09-23 16:57:55 -07:00 |
|
Hashem Hashemi
|
a3a7828010
|
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 (#24988)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com>
|
2025-09-23 14:31:45 -04:00 |
|
Ming Yang
|
527821d191
|
Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu (#25346)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Co-authored-by: Rahul Tuli <rtuli@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
|
2025-09-23 09:45:39 -07:00 |
|
rivos-shreeasish
|
2357480b1a
|
[BugFix] Fix UB in per_token_group_quant.cu (#24913)
Signed-off-by: Shreeasish Kumar <shreeasish@rivosinc.com>
|
2025-09-23 09:14:22 -07:00 |
|
Wentao Ye
|
711e912946
|
[Compile] Fix Compile Warning for Ignoring MIN_BLOCK_PER_SM (#25193)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-09-19 16:23:19 -06:00 |
|
Elvir Crnčević
|
abdfcd4f3d
|
silu-v1: Fix EPS not being used during max-reduction (#25069)
Signed-off-by: elvircrn <elvircrn@gmail.com>
|
2025-09-18 10:25:12 +00:00 |
|
ihb2032
|
4f02b77de4
|
Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains (#24951)
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn>
Signed-off-by: ihb2032 <1355790728@qq.com>
|
2025-09-18 17:43:23 +08:00 |
|
Lumina
|
81b16a2bc9
|
[Kernel] Better inf handling for grouped topk cu (#24886)
Signed-off-by: lumina37 <starry.qvq@gmail.com>
|
2025-09-18 05:53:55 +00:00 |
|
elvischenv
|
e6585ddb45
|
[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel (#24833)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-09-17 16:37:23 -07:00 |
|
Alexander Matveev
|
fedb75fa27
|
[Bugfix][B200] Fix cutlass_mla hang (#24966)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-09-17 18:06:38 -04:00 |
|
czhu-cohere
|
3c068c637b
|
[Kernel] Faster pre-processing time for W4A8 (#23972)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
|
2025-09-17 14:35:32 -07:00 |
|
Matthew Bonanni
|
8f3616f422
|
Remove old cutlass mla (#23961)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-09-17 14:31:43 +00:00 |
|
Aidyn-A
|
bfe9380161
|
Apply fixes for CUDA 13 (#24599)
Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
|
2025-09-17 09:15:42 -04:00 |
|
Li, Jiang
|
9fccd04e30
|
[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check (#25046)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-09-17 05:54:02 -07:00 |
|
Wentao Ye
|
e757a629e7
|
[Bug] Fix Cutlass Scaled MM Compilation Error (#24887)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-09-15 17:21:17 -04:00 |
|
Kyle Sayers
|
a0b26701c9
|
[Transform] Deterministic Hadacore Transforms (#24106)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
2025-09-15 12:59:31 -06:00 |
|
xiao-llm
|
01413e0cf5
|
Fp8 paged attention update (#22222)
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com>
Co-authored-by: Xiao Yu <xiao.yu@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
|
2025-09-15 10:43:26 -04:00 |
|
Michael Goin
|
59d7ffc17f
|
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (#24750)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-09-13 07:29:19 +00:00 |
|
Elvir Crnčević
|
98229db244
|
[Kernels][DP/EP] Optimize Silu Kernel for R1 (#24054)
Signed-off-by: elvircrn <elvircrn@gmail.com>
|
2025-09-13 00:17:27 -07:00 |
|
elvischenv
|
dbeee3844c
|
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (#24757)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
|
2025-09-13 00:16:24 -07:00 |
|
Woosuk Kwon
|
5febdc8750
|
[Chore] Remove unused batched RoPE op & kernel (#24789)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-09-13 00:08:20 -07:00 |
|
Didier Durand
|
bcb06d7baf
|
[Doc]: fix typos in various files (#24726)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
|
2025-09-12 06:43:12 -07:00 |
|
Michael Goin
|
c3aea10dc8
|
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel (#23280)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-09-11 15:43:14 -07:00 |
|
Ilya Markov
|
1fdd5c42d7
|
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default (#24111)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-09-11 09:45:31 -07:00 |
|
TaehyunKim
|
9bd831f501
|
[Model] New model support for Motif-1-Tiny (#23414)
Signed-off-by: ca1207 <ca1207zzz@gmail.com>
Signed-off-by: TaehyunKim <73943231+ca1207@users.noreply.github.com>
Co-authored-by: WyldeCat <skan1543@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-09-10 23:29:40 -07:00 |
|
Ming Yang
|
86173ad593
|
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA (#24385)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-08 09:27:12 +08:00 |
|
mohankku
|
0eadaeff7e
|
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. (#24335)
Signed-off-by: Mohan Kumar Kumar <mohan.cbein@gmail.com>
Signed-off-by: mohankku <mohan.cbein@gmail.com>
|
2025-09-06 08:17:03 -07:00 |
|
yzds
|
ac201a0eaf
|
[Feature] Support Decode Context Parallel (DCP) for MLA (#23734)
Signed-off-by: hongchao <hongchao@msh.team>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: hongchao <hongchao@msh.team>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-06 13:24:05 +08:00 |
|