abfe9dbaa1
test: only 1 tmem_store to verify single column works
2026-05-28 09:49:21 +00:00
5795589abc
test: TMEM 4 columns, individual store calls + loop load
2026-05-28 09:48:27 +00:00
8a428f6127
test: TMEM column addressing test (128 cols, store+load)
2026-05-28 09:46:49 +00:00
ee3fe6d6b2
test: tmem_load column 1 only
2026-05-28 09:45:34 +00:00
6c38c6e442
test: read 8 TMEM columns individually (no loop)
2026-05-28 09:44:30 +00:00
bcc6ed114d
test: add 8KB padding after sQ to prevent MMA read overrun
2026-05-28 09:43:17 +00:00
764ed01d6f
test: try M=64 in descriptor + idesc to debug 4x factor
2026-05-28 09:41:50 +00:00
4cb656e583
test: try idesc=0 (same as gau-nernst)
2026-05-28 09:40:19 +00:00
cfba8484da
test: try idesc with N=128 (full extent) + 128 TMEM cols
2026-05-28 09:39:19 +00:00
30f0056b11
test: clean rewrite with SMEM Q/K verification and dot product check
2026-05-28 09:38:26 +00:00
7eb85a71fc
test: add Q SMEM verification output + bf16_to_f32_host
2026-05-28 09:37:07 +00:00
8f23c2aaf6
test: verify SMEM Q layout by reading back canonical data
2026-05-28 09:35:58 +00:00
004046a6a8
test: read only 1 TMEM column after MMA
2026-05-28 09:35:02 +00:00
41128122e3
test: clean rewrite, 32 TMEM cols, MMA N=32, tmem_load loop
2026-05-28 09:33:45 +00:00
58be79957d
test: 32 TMEM cols, add MMA call with N=32, read S from TMEM
2026-05-28 09:32:33 +00:00
22fb861447
test: 2 tmem_stores with syncwarp between
2026-05-28 09:30:37 +00:00
a87f20a4ae
test: just 1 tmem_store, no fence, no loop
2026-05-28 09:29:46 +00:00
2b57f28968
test: zero 128 TMEM columns, skip fence
2026-05-28 09:29:14 +00:00
25c9b70591
test: zero 2 TMEM columns
2026-05-28 09:28:31 +00:00
01c4097ccc
test: zero 32 TMEM columns
2026-05-28 09:27:59 +00:00
3694f63ba4
test: re-enable full TMEM zeroing (128 columns)
2026-05-28 09:27:25 +00:00
c3b6c3a5e6
test: minimal tmem_store debug (1 column + sentinels)
2026-05-28 09:26:52 +00:00
f1aaa50326
test: re-enable TMEM zeroing with tmem_base debug
2026-05-28 09:26:16 +00:00
a7f81331f8
test: skip TMEM zeroing again, alloc+dealloc only
2026-05-28 09:25:37 +00:00
3f5dcd481e
test: zero only 32 TMEM columns
2026-05-28 09:25:05 +00:00
2b1c8ce7df
test: re-enable all TMEM ops (alloc, zero, dealloc)
2026-05-28 09:24:28 +00:00
acc7424a48
test: skip TMEM zeroing, just alloc+dealloc
2026-05-28 09:23:48 +00:00
ca419c52f3
test: re-enable TMEM alloc + zero
2026-05-28 09:23:10 +00:00
09e8ea5933
test: fix compile error, skip TMEM read
2026-05-28 09:22:17 +00:00
69bbc21300
test: skip all TMEM ops, just test SMEM layout + descriptor
2026-05-28 09:21:52 +00:00
a6c0ce51a2
test: skip MMA, just test descriptor values
2026-05-28 09:20:59 +00:00
ea6b42e649
test_umma_qk: add descriptor debug output
2026-05-28 09:20:12 +00:00
0f6907b001
UMMA: fix descriptor + idesc — use gau-nernst tutorial values
...
- LBO = BLOCK_MN * 16 (bytes), SBO = 128 (bytes) for K-major NONE
- Canonical SMEM layout: column-major interleaving of core matrices
- idesc is SEPARATE 32-bit value (was using desc_a>>32 = WRONG)
- idesc encodes dtype/atype/btype/MMA_M/MMA_N
- This was the root cause of 'misaligned address' errors
2026-05-28 09:18:45 +00:00
9b458d2a6c
test_umma_qk: clean rewrite, hardcoded HD=16, explicit core-matrix layout writes
2026-05-28 09:16:37 +00:00
427410d94a
UMMA: Rewrite fmha_umma_desc.cuh with correct K-major core-matrix layout + minimal QK GEMM test
...
- Core-matrix layout: each 8x8 BF16 tile (128B) contiguous in SMEM
- K-major NONE descriptor: LBO=1 (16B), SBO=block_k/8, lbo_mode=0
- MMA K-tiling: tcgen05.mma uses K=16 per call, tile for hd>16
- write_smem_kmajor: converts row-major to core-matrix layout
- write_smem_ktile: extracts single K-tile in core-matrix layout
- test_umma_qk.cu: minimal hd=16, sk=128 test (single MMA call)
- Previous UMMA descriptors were wrong (row-major SMEM, wrong LBO/SBO)
2026-05-28 09:15:40 +00:00
e5ba0ca119
debug: clean QK verify with scalar sanity + MMA result
2026-05-28 08:53:35 +00:00
9a51bfa578
fix: align SMEM layout properly (128B aligned tmem + Q)
2026-05-28 08:46:56 +00:00
2a765be715
fix: correct SMEM size for row-major (not swizzled)
2026-05-28 08:44:55 +00:00
ab84ad0f86
feat: implement canonical UMMA SMEM layout with SWIZZLE_128B
...
Proper implementation of the SMEM layout that tcgen05.mma expects:
- SWIZZLE_128B (layout_type=2) for both MN-major A and K-major B
- Swizzle<3,4,3> applied to element offsets before SMEM write
- MN_SW128 atom: (1024, 8) BF16, stride (1, 1024)
- K_SW128 atom: (8, 1024) BF16, stride (1, 8)
- umma_smem_write/read functions for both MN and K major
- Descriptor with correct leading_byte_offset and stride_byte_offset
This is the RIGHT WAY. No shortcuts.
2026-05-28 08:18:47 +00:00
3549a2388b
fix: constexpr HD for template param
2026-05-28 08:01:18 +00:00
7436315309
feat: add tcgen05.mma QK GEMM verification kernel + test
...
Step 1 of tensor-core acceleration:
- fmha_umma_desc.cuh: UMMA SMEM descriptor construction (raw bitfield)
- fmha_qk_verify.cuh: QK GEMM using tcgen05.mma SS (SMEM A, SMEM B → TMEM C)
- test_qk_mma.cu: standalone test comparing MMA output vs CPU reference
Key design decisions:
- UMMA descriptors built from raw bitfield (no CuTe dependency)
- tcgen05.mma called by one lane per warp (elect_one_sync pattern)
- Q: (128, HD) MN-major, K: (128, HD) K-major (transposed via descriptor)
- S: (128, 128) in TMEM, row 0 read back via tcgen05.ld
2026-05-28 08:00:42 +00:00
9524b674ab
test: enable both reference + TMEM epilogue tests at hd=64/128
2026-05-28 07:49:48 +00:00
146e4f0282
debug: print NaN positions in test
2026-05-28 07:46:57 +00:00
a12607b0bd
test: add NaN counter to FMHA test
2026-05-28 07:45:32 +00:00
53c676c8a6
test: add max_abs_diff to FMHA test output
2026-05-28 07:44:45 +00:00
593bc25afa
test: add TMEM lane mapping diagnostics
2026-05-28 07:42:16 +00:00
0ddcc6bafd
debug: add printf to TMEM kernel to find hang point
2026-05-28 07:39:53 +00:00
44fb04fa1f
test: disable tmem epilogue test (debugging reference hang)
2026-05-28 07:38:47 +00:00
224d7e24c6
harness: add fire_b200_cuda_test + check_b200_cuda, update README
...
Two new turnkey harness scripts for .cu tests:
- fire_b200_cuda_test: compile+run+poll, kills everything first,
deletes old logs, one test at a time, screen-based, timeout
- check_b200_cuda: peek at running test log, or kill hung test
README updated with CUDA harness documentation.
Removed janky tests/run_cuda_test.sh.
2026-05-28 07:36:10 +00:00
cec505ce14
add CUDA test runner script (screen-based, follows harness pattern)
2026-05-28 07:31:41 +00:00