-
5e389b5ed9
fix: remove duplicate desc declaration
biondizzle
2026-05-28 05:10:43 +00:00
-
7ac2499266
fix: defer UMMA descriptor — use placeholder for now
biondizzle
2026-05-28 05:10:15 +00:00
-
db17d8db9a
fix: cvta.to.shared PTX for SMEM address
biondizzle
2026-05-28 05:09:50 +00:00
-
e12a81ae36
fix: include cstdint
biondizzle
2026-05-28 05:09:28 +00:00
-
0c73a024ba
fix: guard CUTLASS includes with __CUDA_ARCH__ for host compilation
biondizzle
2026-05-28 05:09:07 +00:00
-
41e59a2423
FMHA SM100: Add SMEM descriptor construction for tcgen05.mma
biondizzle
2026-05-28 05:08:25 +00:00
-
3eb432d064
fix: CUTLASS path /root/cutlass
biondizzle
2026-05-28 05:06:48 +00:00
-
66d9f5c60f
fix: --x cu for .cuh compilation
biondizzle
2026-05-28 05:06:13 +00:00
-
4dcd80ea0d
fix: use full nvcc path
biondizzle
2026-05-28 05:05:55 +00:00
-
fac7275f2b
test: nvcc compilation test for FMHA SM100 kernel
biondizzle
2026-05-28 05:05:31 +00:00
-
230c350c77
FMHA SM100: Raw CUDA C++ decode kernel — initial skeleton
biondizzle
2026-05-28 05:04:44 +00:00
-
b2d0417a46
NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files
biondizzle
2026-05-28 04:59:01 +00:00
-
650bcdcccf
test: f32 vs i32 GMEM store
biondizzle
2026-05-28 04:57:45 +00:00
-
cc37ce6dbf
test: absolute minimum CuTeDSL int store + float cmp
biondizzle
2026-05-28 04:56:16 +00:00
-
c4fdfc7789
test: isolate which fp4_quant function causes LLVM ERROR
biondizzle
2026-05-28 04:55:23 +00:00
-
b3eb46d4ec
NVFP4-1.1: Restore threshold RNE approach — inline PTX blocked by toolchain
biondizzle
2026-05-28 04:54:27 +00:00
-
71ee1485ea
test: constraints runner
biondizzle
2026-05-28 04:50:14 +00:00
-
c55c237fcd
test: different constraint strings + bitcast approach
biondizzle
2026-05-28 04:50:09 +00:00
-
4806e9ba11
test: llvm.inline_asm with Int32._mlir_type matching cvt_i8_bf16 pattern
biondizzle
2026-05-28 04:49:02 +00:00
-
ade49d964d
fix: test_ptx_runner path
biondizzle
2026-05-28 04:47:04 +00:00
-
dc9596c6bc
test: sub-process isolation for each f32→i32 approach
biondizzle
2026-05-28 04:46:45 +00:00
-
136a89f4e3
test: compare nvvm.inline_ptx approaches + arith.fptosi
biondizzle
2026-05-28 04:46:06 +00:00
-
eebf33b97d
test: clean minimal nvvm.inline_ptx test
biondizzle
2026-05-28 04:45:21 +00:00
-
882d48588b
test: debug nvvm.inline_ptx with CUTLASS_LOG_LEVEL=DEBUG
biondizzle
2026-05-28 04:44:35 +00:00
-
3ffb3b807a
test: minimal nvvm.inline_ptx isolation test
biondizzle
2026-05-28 04:43:18 +00:00
-
e33c48e44c
NVFP4-1.1: Use nvvm.inline_ptx instead of llvm.inline_asm for f32→i32
biondizzle
2026-05-28 04:42:33 +00:00
-
74dba6ab9d
auto: pre-test commit
biondizzle
2026-05-28 04:40:20 +00:00
-
1cbb3cf752
NVFP4-1.1: Replace threshold rounding with inline PTX cvt.rni/rz/rmi
biondizzle
2026-05-28 04:40:17 +00:00
-
2777ebfe8e
NVFP4-1.1: ultra-minimal test — Float32 comparison + Int32 select
biondizzle
2026-05-28 04:35:06 +00:00
-
2087eaef49
NVFP4-1.1: minimal threshold rounding test
biondizzle
2026-05-28 04:33:38 +00:00
-
1828a71cde
NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)
biondizzle
2026-05-28 04:32:08 +00:00
-
d2aa93aad7
NVFP4-1.1: fix Int32 clamping — use comparisons instead of fmin/fmax (float-only ops)
biondizzle
2026-05-28 04:30:06 +00:00
-
accc66741d
NVFP4-1.1: update test kernel with threshold rounding API
biondizzle
2026-05-28 04:27:29 +00:00
-
dabcc415a8
NVFP4-1.1: threshold rounding for float-to-int — avoids CuTeDSL limitation
biondizzle
2026-05-28 04:26:40 +00:00
-
acf46c494c
NVFP4-1.1: update approach doc and fp4_quant with CuTeDSL API fixes
biondizzle
2026-05-28 04:09:58 +00:00
-
f3a2b37d70
NVFP4-1.1: document CuTeDSL float-to-int limitation, revise approach to compact SwiGLU output
biondizzle
2026-05-28 04:06:27 +00:00
-
c3d5a7b82f
NVFP4-1.1: try .to(Int32) for float-to-int conversion
biondizzle
2026-05-28 04:02:45 +00:00
-
dc35d29811
NVFP4-1.1: fix cute.arch.store signature - store(ptr, val) not store(ptr, val, dtype)
biondizzle
2026-05-28 04:01:38 +00:00
-
a05a76bb6b
NVFP4-1.1: add Int32 cast diagnostic test
biondizzle
2026-05-28 03:59:01 +00:00
-
e565ebce91
NVFP4-1.1: replace cute.math.fmin with cute.arch.fmin (correct API)
biondizzle
2026-05-28 03:55:54 +00:00
-
20d5ddfa3d
NVFP4-1.1: fix indentation for @cute.jit decorators
biondizzle
2026-05-28 03:52:46 +00:00
-
f6f59d34cb
NVFP4-1.1: add @cute.jit decorator to fp4_quant functions for CuTeDSL if-block support
biondizzle
2026-05-28 03:50:11 +00:00
-
0ecb98daee
auto: pre-test commit
biondizzle
2026-05-28 03:49:03 +00:00
-
6f94925491
NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)
biondizzle
2026-05-28 03:48:51 +00:00
-
60790564f0
NVFP4-1.1: fix test - two-pass kernel, cute.arch.store confirmed on B200
biondizzle
2026-05-28 03:46:45 +00:00
-
ca9f920414
auto: pre-test commit
biondizzle
2026-05-28 03:42:39 +00:00
-
a41de129cb
NVFP4-1.1: fix test kernel - use cute.copy instead of cute.arch.store
biondizzle
2026-05-28 03:42:24 +00:00
-
3a78bdf570
NVFP4-1.1: add CuTeDSL kernel test for FP4 quantization
biondizzle
2026-05-28 03:40:54 +00:00
-
80b6b79f9e
NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels
biondizzle
2026-05-28 03:39:55 +00:00
-
b9f15c250f
Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API
biondizzle
2026-05-27 15:15:03 +00:00
-
2412a5431b
MQA/GQA: batch Q heads into kernel batch dim, shared K/V per KV group
biondizzle
2026-05-27 08:31:23 +00:00
-
06a895ff99
Clean test suite for production attention (1/2/4 segments, multi-head)
biondizzle
2026-05-27 07:12:02 +00:00
-
778d9d4f4f
Compile with row_sums tensor so kernel writes per-row row_sums
biondizzle
2026-05-27 07:10:00 +00:00
-
0736a04d9b
Fix KV merge: use NORMALIZED O (O_unnorm/row_sum) with LSE
biondizzle
2026-05-27 07:07:51 +00:00
-
06e7f7ab48
Debug: print LSE values for 2-segment merge
biondizzle
2026-05-27 07:04:39 +00:00
-
8f8d14c300
Match tensor slicing exactly to test_d1_kv_merge (2D slices, 3D unsqueeze)
biondizzle
2026-05-27 06:58:28 +00:00
-
6ee61717c0
Match tensor shapes from working test_d1_kv_merge
biondizzle
2026-05-27 06:56:04 +00:00
-
3a25c7feff
Test multi-KV merge (2 segments) separately from multi-head
biondizzle
2026-05-27 06:54:16 +00:00
-
36a6f07a7e
Fix: unsqueeze k/v when dim==2
biondizzle
2026-05-27 06:52:43 +00:00
-
fc4172937c
Clean production wrapper: always normalize=False + KV merge
biondizzle
2026-05-27 06:51:14 +00:00
-
8f87109f86
Single-segment: use normalize=False + per-row normalization from row_sums
biondizzle
2026-05-27 06:48:56 +00:00
-
fe55bf23a0
Split single-segment (normalized) and multi-segment (KV merge) paths
biondizzle
2026-05-27 06:46:30 +00:00
-
e45b94c01b
Test: compare both normalized and un-normalized reference
biondizzle
2026-05-27 06:44:37 +00:00
-
b70ab2a6ee
Return o_accum directly (un-normalized merge result)
biondizzle
2026-05-27 06:42:58 +00:00
-
6111db571c
Match working test: don't pass row_sums to kernel
biondizzle
2026-05-27 06:41:44 +00:00
-
312ac52d15
Normalize O_accum by exp(lse) before returning
biondizzle
2026-05-27 06:39:36 +00:00
-
ddc701af9b
Use exact merge formula from working test_d1_kv_merge.py
biondizzle
2026-05-27 06:38:04 +00:00
-
8321ccf9c1
Fix production KV merge: use normalized O for log-sum-exp merge
biondizzle
2026-05-27 06:36:24 +00:00
-
98c93c1cd8
Stage E: production attention wrapper + Python KV merge, clean fmha_smem_acc
biondizzle
2026-05-27 06:34:10 +00:00
-
51e456df44
Slice MMA tile coords from tOgO for TMA copy
biondizzle
2026-05-27 05:39:42 +00:00
-
1caa737b09
Move sC_flat_staged creation before const_expr guard
biondizzle
2026-05-27 05:38:39 +00:00
-
3c9dbc0c5d
Staged sC_flat with (128, pv_n_tile//2, 2) to match TMA atom
biondizzle
2026-05-27 05:37:05 +00:00
-
de2028b106
Split sC_flat into staged layout to match TMA atom decomposition
biondizzle
2026-05-27 05:35:56 +00:00
-
a0e9f7534b
Use tCgC_epi (transformed) for GMEM side of TMA partition
biondizzle
2026-05-27 05:34:40 +00:00
-
b02e103ac0
Add c_simple GMEM tensor (non-dynamic) for SMEM accumulator TMA store
biondizzle
2026-05-27 05:33:30 +00:00
-
2438826eee
Use tma_partition with group_modes on both sC_flat and gO
biondizzle
2026-05-27 05:31:47 +00:00
-
603f52de78
Fix gO creation: use slice_(pv_mma_tiler) like fmha.py
biondizzle
2026-05-27 05:30:50 +00:00
-
b39d7f1a14
Try cute.copy(tma_c, sC_flat, gO) directly
biondizzle
2026-05-27 05:29:51 +00:00
-
2af767a90c
Try full tensor TMA copy without slicing
biondizzle
2026-05-27 05:28:43 +00:00
-
7d14a2f764
sC_flat with simple (128, pv_n_tile) layout for full epi_tile coverage
biondizzle
2026-05-27 05:27:51 +00:00
-
6fb0e6a417
Use sC_flat (non-swizzled epi_s layout) for TMA store from SMEM accumulator
biondizzle
2026-05-27 05:26:50 +00:00
-
4a2a06f9e1
Fix gO slice: use separate Int32(0) instead of tuple
biondizzle
2026-05-27 05:25:33 +00:00
-
bf36979a8d
Use CUTLASS FMHA reference pattern for sC->GMEM TMA store (flat_divide + tma_partition)
biondizzle
2026-05-27 05:24:39 +00:00
-
97bc6d8d2f
Add c_direct GMEM tensor for direct writes in SMEM accumulator path
biondizzle
2026-05-27 05:15:47 +00:00
-
3d349b497b
SME accumulator: direct GMEM write from sO_acc (bypass TMA for multi-kt)
biondizzle
2026-05-27 05:14:31 +00:00
-
7d1e0a605d
Different coordinate dims for bSG_sC (2D) and bSG_gC (3D)
biondizzle
2026-05-27 05:13:38 +00:00
-
75b272c5f2
2D coordinate for bSG_sC TMA copy
biondizzle
2026-05-27 05:12:58 +00:00
-
72dff90165
3D coordinate for bSG_sC/gC TMA copy
biondizzle
2026-05-27 05:12:11 +00:00
-
b8b6e8cc0b
Slice bSG_gC MMA tile coords for TMA copy
biondizzle
2026-05-27 05:11:26 +00:00
-
754740d5e5
Try bSG_sC[(None, 0)] for TMA copy coordinate
biondizzle
2026-05-27 05:10:40 +00:00
-
23a2b49daf
Add SMEM accumulator for n_kv_tiles>1: O load from TMEM, accumulate in sO_acc, TMA store from sC
biondizzle
2026-05-27 05:09:54 +00:00
-
a858ed1c14
Fix test: normalize=False for un-normalized O comparison
biondizzle
2026-05-27 05:06:52 +00:00
-
2e262d2b99
Reset fmha_smem_acc.py to working fmha.py base
biondizzle
2026-05-27 05:05:41 +00:00
-
b43ffe9dac
Guard sO_acc allocation/zero-init with n_kv_tiles>1
biondizzle
2026-05-27 05:05:01 +00:00
-
101840c78c
Guard SMEM accumulation with n_kv_tiles>1 to avoid TMEM destructive read
biondizzle
2026-05-27 05:02:51 +00:00
-
02a34512cb
Use epilogue_tma_store for n_kv_tiles=1; TODO for multi-tile
biondizzle
2026-05-27 05:01:39 +00:00
-
4652cab8b4
Fix: 3D coords for TMA copy (bSG_sC has 3 modes)
biondizzle
2026-05-27 05:00:39 +00:00
-
b0ebf41ee3
Slice bSG_gC with mma_tile_coord (like epilogue_tma_store)
biondizzle
2026-05-27 05:00:04 +00:00
-
eb0bf0cce0
Fix TMA store: use bSG_sC[(None,0)] indexing pattern from epilogue_tma_store
biondizzle
2026-05-27 04:59:29 +00:00
-
7ea77a121f
Use cpasync.tma_partition for SMEM->GMEM TMA store (like epilogue_tma_store)
biondizzle
2026-05-27 04:58:47 +00:00