Commit Graph

1408 Commits

Author SHA1 Message Date
4cb656e583 test: try idesc=0 (same as gau-nernst) 2026-05-28 09:40:19 +00:00
cfba8484da test: try idesc with N=128 (full extent) + 128 TMEM cols 2026-05-28 09:39:19 +00:00
30f0056b11 test: clean rewrite with SMEM Q/K verification and dot product check 2026-05-28 09:38:26 +00:00
7eb85a71fc test: add Q SMEM verification output + bf16_to_f32_host 2026-05-28 09:37:07 +00:00
8f23c2aaf6 test: verify SMEM Q layout by reading back canonical data 2026-05-28 09:35:58 +00:00
004046a6a8 test: read only 1 TMEM column after MMA 2026-05-28 09:35:02 +00:00
41128122e3 test: clean rewrite, 32 TMEM cols, MMA N=32, tmem_load loop 2026-05-28 09:33:45 +00:00
58be79957d test: 32 TMEM cols, add MMA call with N=32, read S from TMEM 2026-05-28 09:32:33 +00:00
22fb861447 test: 2 tmem_stores with syncwarp between 2026-05-28 09:30:37 +00:00
a87f20a4ae test: just 1 tmem_store, no fence, no loop 2026-05-28 09:29:46 +00:00
2b57f28968 test: zero 128 TMEM columns, skip fence 2026-05-28 09:29:14 +00:00
25c9b70591 test: zero 2 TMEM columns 2026-05-28 09:28:31 +00:00
01c4097ccc test: zero 32 TMEM columns 2026-05-28 09:27:59 +00:00
3694f63ba4 test: re-enable full TMEM zeroing (128 columns) 2026-05-28 09:27:25 +00:00
c3b6c3a5e6 test: minimal tmem_store debug (1 column + sentinels) 2026-05-28 09:26:52 +00:00
f1aaa50326 test: re-enable TMEM zeroing with tmem_base debug 2026-05-28 09:26:16 +00:00
a7f81331f8 test: skip TMEM zeroing again, alloc+dealloc only 2026-05-28 09:25:37 +00:00
3f5dcd481e test: zero only 32 TMEM columns 2026-05-28 09:25:05 +00:00
2b1c8ce7df test: re-enable all TMEM ops (alloc, zero, dealloc) 2026-05-28 09:24:28 +00:00
acc7424a48 test: skip TMEM zeroing, just alloc+dealloc 2026-05-28 09:23:48 +00:00
ca419c52f3 test: re-enable TMEM alloc + zero 2026-05-28 09:23:10 +00:00
09e8ea5933 test: fix compile error, skip TMEM read 2026-05-28 09:22:17 +00:00
69bbc21300 test: skip all TMEM ops, just test SMEM layout + descriptor 2026-05-28 09:21:52 +00:00
a6c0ce51a2 test: skip MMA, just test descriptor values 2026-05-28 09:20:59 +00:00
ea6b42e649 test_umma_qk: add descriptor debug output 2026-05-28 09:20:12 +00:00
0f6907b001 UMMA: fix descriptor + idesc — use gau-nernst tutorial values
- LBO = BLOCK_MN * 16 (bytes), SBO = 128 (bytes) for K-major NONE
- Canonical SMEM layout: column-major interleaving of core matrices
- idesc is SEPARATE 32-bit value (was using desc_a>>32 = WRONG)
- idesc encodes dtype/atype/btype/MMA_M/MMA_N
- This was the root cause of 'misaligned address' errors
2026-05-28 09:18:45 +00:00
9b458d2a6c test_umma_qk: clean rewrite, hardcoded HD=16, explicit core-matrix layout writes 2026-05-28 09:16:37 +00:00
427410d94a UMMA: Rewrite fmha_umma_desc.cuh with correct K-major core-matrix layout + minimal QK GEMM test
- Core-matrix layout: each 8x8 BF16 tile (128B) contiguous in SMEM
- K-major NONE descriptor: LBO=1 (16B), SBO=block_k/8, lbo_mode=0
- MMA K-tiling: tcgen05.mma uses K=16 per call, tile for hd>16
- write_smem_kmajor: converts row-major to core-matrix layout
- write_smem_ktile: extracts single K-tile in core-matrix layout
- test_umma_qk.cu: minimal hd=16, sk=128 test (single MMA call)
- Previous UMMA descriptors were wrong (row-major SMEM, wrong LBO/SBO)
2026-05-28 09:15:40 +00:00
68b4151d21 dump SMEM layout info 2026-05-28 08:59:19 +00:00
fe0588d906 fix: simplify UMMA dump script 2026-05-28 08:57:49 +00:00
948a3f8a7a add UMMA descriptor dump script 2026-05-28 08:55:43 +00:00
e5ba0ca119 debug: clean QK verify with scalar sanity + MMA result 2026-05-28 08:53:35 +00:00
a04d794979 debug: skip TMEM alloc — test SMEM loads only 2026-05-28 08:49:37 +00:00
72c97f2546 debug: minimal UMMA descriptor (just start_addr + version) 2026-05-28 08:48:01 +00:00
9a51bfa578 fix: align SMEM layout properly (128B aligned tmem + Q) 2026-05-28 08:46:56 +00:00
2a765be715 fix: correct SMEM size for row-major (not swizzled) 2026-05-28 08:44:55 +00:00
c64bd7b875 debug: read Q/K directly from SMEM 2026-05-28 08:43:39 +00:00
58b610c96c fix: proper early return for SMEM load test 2026-05-28 08:41:30 +00:00
82bc2c4a49 debug: verify SMEM loads + scalar QK sanity check 2026-05-28 08:40:16 +00:00
53139d24bf debug: verify TMEM r/w works before MMA 2026-05-28 08:39:12 +00:00
a9d71ff6ab debug: print TMEM values after MMA 2026-05-28 08:38:08 +00:00
bfb1e177ce debug: try all-lane MMA + print tmem_base 2026-05-28 08:37:02 +00:00
d3510980e4 feat: SWIZZLE_NONE UMMA descriptors with row-major SMEM
Canonical UMMA layout for SWIZZLE_NONE:
- MN-major (128, 64): LBO=16, SBO=128 (from logical_divide Tile(1,8))
- K-major (128, 64): LBO=16, SBO=32 (from logical_divide Tile(8,2))

Using simple row-major SMEM layout (no swizzle, no permutation).
Data is written directly to SMEM in row-major order.
The descriptor strides describe the canonical layout.
2026-05-28 08:35:30 +00:00
8c67c31497 add CuTe descriptor printing script 2026-05-28 08:23:34 +00:00
d29d6b575f add UMMA descriptor diagnostic script 2026-05-28 08:20:56 +00:00
ab84ad0f86 feat: implement canonical UMMA SMEM layout with SWIZZLE_128B
Proper implementation of the SMEM layout that tcgen05.mma expects:
- SWIZZLE_128B (layout_type=2) for both MN-major A and K-major B
- Swizzle<3,4,3> applied to element offsets before SMEM write
- MN_SW128 atom: (1024, 8) BF16, stride (1, 1024)
- K_SW128 atom: (8, 1024) BF16, stride (1, 8)
- umma_smem_write/read functions for both MN and K major
- Descriptor with correct leading_byte_offset and stride_byte_offset

This is the RIGHT WAY. No shortcuts.
2026-05-28 08:18:47 +00:00
ecbc75255c fix: correct UMMA descriptor format from CUTLASS source
The descriptor bitfield is completely different from what I assumed:
- [0,14) start_address (smem_ptr >> 4)
- [16,30) leading_byte_offset (row stride bytes >> 4)
- [32,46) stride_byte_offset
- [46,48) version = 1 (Blackwell)
- [61,64) layout_type (0=NONE, 2=128B, 4=64B, 6=32B)
- idescE = desc >> 32, passed as separate arg to MMA PTX

The 64-bit descriptor uses byte offsets (not log2 or element counts).
The upper 32 bits are reinterpreted by the MMA hardware as idescE.
2026-05-28 08:07:52 +00:00
fe7d561143 debug: print UMMA descriptor values for diagnosis 2026-05-28 08:03:53 +00:00
c5f7a9a15c fix: align SMEM buffers to 16 bytes for UMMA descriptors 2026-05-28 08:02:53 +00:00
3549a2388b fix: constexpr HD for template param 2026-05-28 08:01:18 +00:00