ba2e390e1e
test: debug single K-tile from full (128,64) SMEM
2026-05-28 12:55:52 +00:00
a7e8b483cd
test: HD=64 multi-K-tile with correct source stride in SMEM writes
2026-05-28 12:54:57 +00:00
926ae5d7bf
test: fix K source stride mismatch in manual SMEM write
2026-05-28 12:54:03 +00:00
7d16a30cb6
test: exact HD=16 pattern with HD=64 data
2026-05-28 12:53:13 +00:00
db4f661843
test: debug with (128,16) SMEM matching HD=16 exactly
2026-05-28 12:52:19 +00:00
b703dc0a50
test: debug single K-tile with offset descriptor
2026-05-28 12:51:33 +00:00
435ca037cf
test: use accumulate=false for first K-tile, skip TMEM zero
2026-05-28 12:50:44 +00:00
e8ac2120ad
test: HD=64 QK with contiguous SMEM + offset descriptors
2026-05-28 12:50:07 +00:00
1c01e8e412
test: fix inline asm line continuation for nvcc
2026-05-28 12:48:45 +00:00
71c774027c
test: fix HD=64 QK — zero TMEM, fence after MMA, single-thread MMA call
2026-05-28 12:47:51 +00:00
1bf76388c8
test: always accumulate, separate SMEM per K-tile, TMEM starts at 0
2026-05-28 12:23:47 +00:00
8707f555c2
test: add extra syncwarp + syncthreads for MMA safety
2026-05-28 12:20:01 +00:00
5a65d46c26
test: HD=64 with separate SMEM per K-tile — no offset descriptors needed
2026-05-28 12:18:06 +00:00
526fafb808
test: revert volatile, fix wid==0, full 4 K-tiles
2026-05-28 12:16:09 +00:00
de879342dd
test: 1 K-tile, volatile writes, verify SMEM
2026-05-28 12:13:23 +00:00
bd6440fd83
test: volatile SMEM writes + 2 K-tiles
2026-05-28 12:11:47 +00:00
c2e41a858e
test: force 2 K-tiles for debug
2026-05-28 12:09:45 +00:00
8b2200a6d3
test: HD=64 full 4 K-tile accumulate + full-HD scalar reference
2026-05-28 12:07:50 +00:00
afb18caf2d
test: clean HD=64, 1 K-tile only, verify SMEM writes + compare vs scalar
2026-05-28 12:04:54 +00:00
e587e26b06
test: log canonical indices we write Q to
2026-05-28 12:01:28 +00:00
facd509c3c
test: remove sanity check (zeroing loop overwrites), fix verify offsets
2026-05-28 11:59:08 +00:00
20ae390d32
test: fix compile error
2026-05-28 11:57:08 +00:00
7b16eceb91
test: more detailed SMEM sanity check
2026-05-28 11:56:07 +00:00
eb0ca18e23
test: sanity check sQ[0] write+read
2026-05-28 11:54:13 +00:00
8936a2dec7
test: clean SMEM write loops for HD=64
2026-05-28 11:52:51 +00:00
2ffbfda47d
test: print SMEM verify data
2026-05-28 11:51:08 +00:00
4fd41365de
test: add SMEM verify for HD=64 K-tile offsets
2026-05-28 11:49:44 +00:00
4483539f01
test: HD=64 random data, 4 K-tiles, accumulate
2026-05-28 11:47:56 +00:00
73bd21ce01
test: force 1 K-tile for HD=64 debug
2026-05-28 11:46:12 +00:00
abe1870429
test: HD=64 all-ones, expected S[0,j]=64 (unscaled) or 8.0 scaled
2026-05-28 11:44:31 +00:00
73f9ff98c9
test: UMMA QK HD=64 (4 K-tiles, accumulate) — multi-K-tile test
2026-05-28 11:42:29 +00:00
df34cae9c6
UMMA QK GEMM WORKING! Update docs — 4x was scale factor, not bug
...
Major milestone: UMMA QK GEMM produces correct attention scores at HD=16!
- MMA computes raw dot product; apply 1/sqrt(HD) scaling manually
- tcgen05.fence::after_thread_sync for MMA→TMEM fence
- 32x32b.x8 TMEM reads for Layout D output
- 4 warps (128 threads) required for M=128
- Next: HD=64 multi-K-tile, PV GEMM, full FMHA pipeline
2026-05-28 11:41:19 +00:00
1874a70a6d
test: fix var ref
2026-05-28 11:39:15 +00:00
8426d13285
test: fix comparison — row 0 is S[0,c], rows 1-127 should be zero
2026-05-28 11:38:22 +00:00
6f40fafa91
test: verify ALL 128 rows × 8 cols match scalar reference
2026-05-28 11:36:46 +00:00
3c7d9d9303
test: apply 1/sqrt(HD) scale to MMA output — 4x was the scale factor, not a bug!
2026-05-28 11:34:45 +00:00
013f370046
test: all-ones data, expected S[0,j]=16.0 for every j
2026-05-28 11:32:56 +00:00
f5a0966afc
test: 4 warp leaders (lane==0) call MMA simultaneously
2026-05-28 11:30:19 +00:00
c01d6fddf4
test: gau-nernst pattern — fence::after_thread_sync, 4 warps, 128 threads, 32x32b.x8 loop
2026-05-28 11:28:47 +00:00
a048b56886
test: single-thread MMA + 0.25 scaling for 4× factor
2026-05-28 10:23:06 +00:00
57d67e6b51
test: revert to 64-bit descriptors, 4 warp leaders, 32x32b read
2026-05-28 10:21:06 +00:00
32f7fa7bce
Update CURRENT_ISSUE.md and MEMORY.md with UMMA 4× bug details
...
- MMA produces exactly 4× scalar reference for all output values
- SMEM data verified correct, descriptor values correct
- 4× persists across different N, warp counts, padding
- TMEM multi-store bug documented (16x256b.x1 crashes on 2nd store)
- Layout D read with 32x32b.x8 works
- Next: study CUTLASS FMHA TMEM output layout to fix 4× factor
2026-05-28 10:15:14 +00:00
3f95f1c5d4
test: try LBO with block_mn=32 (1/4 of M=128)
2026-05-28 10:11:38 +00:00
d03e353972
test: 4 warp leaders call MMA (Layout D requires 4 warps)
2026-05-28 10:10:07 +00:00
8059ed15ad
test: explicitly zero padding between Q and K
2026-05-28 10:08:35 +00:00
9e98c067ab
test: Layout D TMEM read using 32x32b.x8 format, 4 warps
2026-05-28 10:07:15 +00:00
68d1a7920c
test: M=64 in both desc and idesc
2026-05-28 10:04:17 +00:00
0f51fda0da
test: try N=8 in idesc
2026-05-28 10:02:52 +00:00
4f7c9649fd
test: clean UMMA QK test, debug 4x factor, 8KB padding, 128 TMEM cols
2026-05-28 10:01:39 +00:00
ac65ece33b
test: TMEM 2-store with fence outside wid guard, 64 threads
2026-05-28 09:59:43 +00:00