-
7cb133c5bf
test: exact copy of working TS test (verify it still passes)
biondizzle
2026-05-28 13:49:04 +00:00
-
0dcaa648b3
test: properly aligned V SMEM buffer
biondizzle
2026-05-28 13:47:47 +00:00
-
1c14ada386
test: write O to tb (overwriting P), same as isolated test
biondizzle
2026-05-28 13:46:04 +00:00
-
a944f90040
test: match isolated TS test exactly (V=all-1, BLOCK_MN=16)
biondizzle
2026-05-28 13:44:36 +00:00
-
482328160a
test: single PV K-tile debug
biondizzle
2026-05-28 13:43:24 +00:00
-
3a40ed6d69
test: skip QK+softmax, write P directly to TMEM for PV debug
biondizzle
2026-05-28 13:41:50 +00:00
-
f24bc583dc
test: zero O TMEM before PV GEMM
biondizzle
2026-05-28 13:40:30 +00:00
-
2885b3f2ed
test: full FMHA HD=16 with PV GEMM via tcgen05.mma TS
biondizzle
2026-05-28 13:39:34 +00:00
-
dc2130cb12
test: cleanup TS MMA test
biondizzle
2026-05-28 13:38:07 +00:00
-
a767e90a12
test: B=2.0 to understand TS MMA scale factor
biondizzle
2026-05-28 13:36:30 +00:00
-
b7c6971720
test: use 32x32b.x8 for A write (avoids 16x256b misalign)
biondizzle
2026-05-28 13:34:50 +00:00
-
a7c81d66ba
test: step-by-step TMEM write/read debug for TS MMA
biondizzle
2026-05-28 13:33:36 +00:00
-
c05cc1ac93
test: separate TMEM regions for A and C in TS MMA
biondizzle
2026-05-28 13:32:22 +00:00
-
37a502e476
test: minimal tcgen05.mma TS debug (PV GEMM)
biondizzle
2026-05-28 13:31:18 +00:00
-
efa03f53d4
docs: update CURRENT_ISSUE and MEMORY — full FMHA HD=64 pipeline working
biondizzle
2026-05-28 13:11:32 +00:00
-
654a2ae7f4
test: merge softmax+PV into single warp0 block (s_vals scope fix)
biondizzle
2026-05-28 13:10:02 +00:00
-
5c9e3c41af
test: full FMHA HD=64 — QK+softmax+PV(register math)
biondizzle
2026-05-28 13:09:20 +00:00
-
0672373e51
test: debug — just QK+softmax+P read (no PV)
biondizzle
2026-05-28 13:08:06 +00:00
-
5d75decd57
test: full FMHA HD=16 — PV via register math (decode T=1)
biondizzle
2026-05-28 13:06:52 +00:00
-
f62772992b
test: full FMHA HD=16 with PV GEMM (separate TMEM for P and O)
biondizzle
2026-05-28 13:05:27 +00:00
-
bd15bce853
test: HD=16 QK+softmax (no PV)
biondizzle
2026-05-28 13:04:10 +00:00
-
38d7bcd776
test: HD=16 FMHA softmax only (skip PV for now)
biondizzle
2026-05-28 13:03:06 +00:00
-
834d682443
test: full FMHA HD=16 pipeline (QK→softmax→PV→epilogue)
biondizzle
2026-05-28 13:02:00 +00:00
-
3b8be4b2db
test: FMHA softmax (QK→read S→softmax→write P→read P→verify)
biondizzle
2026-05-28 13:00:37 +00:00
-
c936940428
test: separate (128,16) SMEM per K-tile with correct source stride
biondizzle
2026-05-28 12:57:38 +00:00
-
f244c4fdd2
test: single-thread MMA (tid==0) for Layout D
biondizzle
2026-05-28 12:56:39 +00:00
-
ba2e390e1e
test: debug single K-tile from full (128,64) SMEM
biondizzle
2026-05-28 12:55:52 +00:00
-
a7e8b483cd
test: HD=64 multi-K-tile with correct source stride in SMEM writes
biondizzle
2026-05-28 12:54:57 +00:00
-
926ae5d7bf
test: fix K source stride mismatch in manual SMEM write
biondizzle
2026-05-28 12:54:03 +00:00
-
7d16a30cb6
test: exact HD=16 pattern with HD=64 data
biondizzle
2026-05-28 12:53:13 +00:00
-
db4f661843
test: debug with (128,16) SMEM matching HD=16 exactly
biondizzle
2026-05-28 12:52:19 +00:00
-
b703dc0a50
test: debug single K-tile with offset descriptor
biondizzle
2026-05-28 12:51:33 +00:00
-
435ca037cf
test: use accumulate=false for first K-tile, skip TMEM zero
biondizzle
2026-05-28 12:50:44 +00:00
-
e8ac2120ad
test: HD=64 QK with contiguous SMEM + offset descriptors
biondizzle
2026-05-28 12:50:07 +00:00
-
1c01e8e412
test: fix inline asm line continuation for nvcc
biondizzle
2026-05-28 12:48:45 +00:00
-
71c774027c
test: fix HD=64 QK — zero TMEM, fence after MMA, single-thread MMA call
biondizzle
2026-05-28 12:47:51 +00:00
-
1bf76388c8
test: always accumulate, separate SMEM per K-tile, TMEM starts at 0
biondizzle
2026-05-28 12:23:47 +00:00
-
8707f555c2
test: add extra syncwarp + syncthreads for MMA safety
biondizzle
2026-05-28 12:20:01 +00:00
-
5a65d46c26
test: HD=64 with separate SMEM per K-tile — no offset descriptors needed
biondizzle
2026-05-28 12:18:06 +00:00
-
526fafb808
test: revert volatile, fix wid==0, full 4 K-tiles
biondizzle
2026-05-28 12:16:09 +00:00
-
de879342dd
test: 1 K-tile, volatile writes, verify SMEM
biondizzle
2026-05-28 12:13:23 +00:00
-
bd6440fd83
test: volatile SMEM writes + 2 K-tiles
biondizzle
2026-05-28 12:11:47 +00:00
-
c2e41a858e
test: force 2 K-tiles for debug
biondizzle
2026-05-28 12:09:45 +00:00
-
8b2200a6d3
test: HD=64 full 4 K-tile accumulate + full-HD scalar reference
biondizzle
2026-05-28 12:07:50 +00:00
-
afb18caf2d
test: clean HD=64, 1 K-tile only, verify SMEM writes + compare vs scalar
biondizzle
2026-05-28 12:04:54 +00:00
-
e587e26b06
test: log canonical indices we write Q to
biondizzle
2026-05-28 12:01:28 +00:00
-
facd509c3c
test: remove sanity check (zeroing loop overwrites), fix verify offsets
biondizzle
2026-05-28 11:59:08 +00:00
-
20ae390d32
test: fix compile error
biondizzle
2026-05-28 11:57:08 +00:00
-
7b16eceb91
test: more detailed SMEM sanity check
biondizzle
2026-05-28 11:56:07 +00:00
-
eb0ca18e23
test: sanity check sQ[0] write+read
biondizzle
2026-05-28 11:54:13 +00:00
-
8936a2dec7
test: clean SMEM write loops for HD=64
biondizzle
2026-05-28 11:52:51 +00:00
-
2ffbfda47d
test: print SMEM verify data
biondizzle
2026-05-28 11:51:08 +00:00
-
4fd41365de
test: add SMEM verify for HD=64 K-tile offsets
biondizzle
2026-05-28 11:49:44 +00:00
-
4483539f01
test: HD=64 random data, 4 K-tiles, accumulate
biondizzle
2026-05-28 11:47:56 +00:00
-
73bd21ce01
test: force 1 K-tile for HD=64 debug
biondizzle
2026-05-28 11:46:12 +00:00
-
abe1870429
test: HD=64 all-ones, expected S[0,j]=64 (unscaled) or 8.0 scaled
biondizzle
2026-05-28 11:44:31 +00:00
-
73f9ff98c9
test: UMMA QK HD=64 (4 K-tiles, accumulate) — multi-K-tile test
biondizzle
2026-05-28 11:42:29 +00:00
-
df34cae9c6
UMMA QK GEMM WORKING! Update docs — 4x was scale factor, not bug
biondizzle
2026-05-28 11:41:19 +00:00
-
1874a70a6d
test: fix var ref
biondizzle
2026-05-28 11:39:15 +00:00
-
8426d13285
test: fix comparison — row 0 is S[0,c], rows 1-127 should be zero
biondizzle
2026-05-28 11:38:22 +00:00
-
6f40fafa91
test: verify ALL 128 rows × 8 cols match scalar reference
biondizzle
2026-05-28 11:36:46 +00:00
-
3c7d9d9303
test: apply 1/sqrt(HD) scale to MMA output — 4x was the scale factor, not a bug!
biondizzle
2026-05-28 11:34:45 +00:00
-
013f370046
test: all-ones data, expected S[0,j]=16.0 for every j
biondizzle
2026-05-28 11:32:56 +00:00
-
f5a0966afc
test: 4 warp leaders (lane==0) call MMA simultaneously
biondizzle
2026-05-28 11:30:19 +00:00
-
c01d6fddf4
test: gau-nernst pattern — fence::after_thread_sync, 4 warps, 128 threads, 32x32b.x8 loop
biondizzle
2026-05-28 11:28:47 +00:00
-
a048b56886
test: single-thread MMA + 0.25 scaling for 4× factor
biondizzle
2026-05-28 10:23:06 +00:00
-
57d67e6b51
test: revert to 64-bit descriptors, 4 warp leaders, 32x32b read
biondizzle
2026-05-28 10:21:06 +00:00
-
32f7fa7bce
Update CURRENT_ISSUE.md and MEMORY.md with UMMA 4× bug details
biondizzle
2026-05-28 10:15:14 +00:00
-
3f95f1c5d4
test: try LBO with block_mn=32 (1/4 of M=128)
biondizzle
2026-05-28 10:11:38 +00:00
-
d03e353972
test: 4 warp leaders call MMA (Layout D requires 4 warps)
biondizzle
2026-05-28 10:10:07 +00:00
-
8059ed15ad
test: explicitly zero padding between Q and K
biondizzle
2026-05-28 10:08:35 +00:00
-
9e98c067ab
test: Layout D TMEM read using 32x32b.x8 format, 4 warps
biondizzle
2026-05-28 10:07:15 +00:00
-
68d1a7920c
test: M=64 in both desc and idesc
biondizzle
2026-05-28 10:04:17 +00:00
-
0f51fda0da
test: try N=8 in idesc
biondizzle
2026-05-28 10:02:52 +00:00
-
4f7c9649fd
test: clean UMMA QK test, debug 4x factor, 8KB padding, 128 TMEM cols
biondizzle
2026-05-28 10:01:39 +00:00
-
ac65ece33b
test: TMEM 2-store with fence outside wid guard, 64 threads
biondizzle
2026-05-28 09:59:43 +00:00
-
2c89eea6be
test: fence+sync between 2 tmem_stores
biondizzle
2026-05-28 09:58:51 +00:00
-
24c5afe1dc
test: 64 threads, 2 stores to col 0
biondizzle
2026-05-28 09:57:53 +00:00
-
987f2c8917
test: 2 tmem_stores to SAME column 0
biondizzle
2026-05-28 09:57:07 +00:00
-
494149f034
test: 32 threads (1 warp), no guards, all participate
biondizzle
2026-05-28 09:56:17 +00:00
-
f0cb71da5c
test: TMEM 2-col with fence+sync between stores, separate wid==0 blocks
biondizzle
2026-05-28 09:54:19 +00:00
-
b69a538ab1
test: add fence+sync between 2 tmem_stores
biondizzle
2026-05-28 09:53:10 +00:00
-
7a21fa4bd8
test: add 2nd tmem_store to column 1
biondizzle
2026-05-28 09:52:05 +00:00
-
4b129c146e
test: add 1 tmem_load back
biondizzle
2026-05-28 09:51:21 +00:00
-
61f19ce891
test: skip tmem_load, only store+dealloc
biondizzle
2026-05-28 09:50:48 +00:00
-
2513e1a692
test: use 64 threads, fence outside warp guard, 1 store
biondizzle
2026-05-28 09:50:09 +00:00
-
abfe9dbaa1
test: only 1 tmem_store to verify single column works
biondizzle
2026-05-28 09:49:21 +00:00
-
5795589abc
test: TMEM 4 columns, individual store calls + loop load
biondizzle
2026-05-28 09:48:27 +00:00
-
8a428f6127
test: TMEM column addressing test (128 cols, store+load)
biondizzle
2026-05-28 09:46:49 +00:00
-
ee3fe6d6b2
test: tmem_load column 1 only
biondizzle
2026-05-28 09:45:34 +00:00
-
6c38c6e442
test: read 8 TMEM columns individually (no loop)
biondizzle
2026-05-28 09:44:30 +00:00
-
bcc6ed114d
test: add 8KB padding after sQ to prevent MMA read overrun
biondizzle
2026-05-28 09:43:17 +00:00
-
764ed01d6f
test: try M=64 in descriptor + idesc to debug 4x factor
biondizzle
2026-05-28 09:41:50 +00:00
-
4cb656e583
test: try idesc=0 (same as gau-nernst)
biondizzle
2026-05-28 09:40:19 +00:00
-
cfba8484da
test: try idesc with N=128 (full extent) + 128 TMEM cols
biondizzle
2026-05-28 09:39:19 +00:00
-
30f0056b11
test: clean rewrite with SMEM Q/K verification and dot product check
biondizzle
2026-05-28 09:38:26 +00:00
-
7eb85a71fc
test: add Q SMEM verification output + bf16_to_f32_host
biondizzle
2026-05-28 09:37:07 +00:00
-
8f23c2aaf6
test: verify SMEM Q layout by reading back canonical data
biondizzle
2026-05-28 09:35:58 +00:00
-
004046a6a8
test: read only 1 TMEM column after MMA
biondizzle
2026-05-28 09:35:02 +00:00
-
41128122e3
test: clean rewrite, 32 TMEM cols, MMA N=32, tmem_load loop
biondizzle
2026-05-28 09:33:45 +00:00