|
|
ba2e390e1e
|
test: debug single K-tile from full (128,64) SMEM
|
2026-05-28 12:55:52 +00:00 |
|
|
|
a7e8b483cd
|
test: HD=64 multi-K-tile with correct source stride in SMEM writes
|
2026-05-28 12:54:57 +00:00 |
|
|
|
926ae5d7bf
|
test: fix K source stride mismatch in manual SMEM write
|
2026-05-28 12:54:03 +00:00 |
|
|
|
7d16a30cb6
|
test: exact HD=16 pattern with HD=64 data
|
2026-05-28 12:53:13 +00:00 |
|
|
|
db4f661843
|
test: debug with (128,16) SMEM matching HD=16 exactly
|
2026-05-28 12:52:19 +00:00 |
|
|
|
b703dc0a50
|
test: debug single K-tile with offset descriptor
|
2026-05-28 12:51:33 +00:00 |
|
|
|
435ca037cf
|
test: use accumulate=false for first K-tile, skip TMEM zero
|
2026-05-28 12:50:44 +00:00 |
|
|
|
e8ac2120ad
|
test: HD=64 QK with contiguous SMEM + offset descriptors
|
2026-05-28 12:50:07 +00:00 |
|
|
|
1c01e8e412
|
test: fix inline asm line continuation for nvcc
|
2026-05-28 12:48:45 +00:00 |
|
|
|
71c774027c
|
test: fix HD=64 QK — zero TMEM, fence after MMA, single-thread MMA call
|
2026-05-28 12:47:51 +00:00 |
|
|
|
1bf76388c8
|
test: always accumulate, separate SMEM per K-tile, TMEM starts at 0
|
2026-05-28 12:23:47 +00:00 |
|
|
|
8707f555c2
|
test: add extra syncwarp + syncthreads for MMA safety
|
2026-05-28 12:20:01 +00:00 |
|
|
|
5a65d46c26
|
test: HD=64 with separate SMEM per K-tile — no offset descriptors needed
|
2026-05-28 12:18:06 +00:00 |
|
|
|
526fafb808
|
test: revert volatile, fix wid==0, full 4 K-tiles
|
2026-05-28 12:16:09 +00:00 |
|
|
|
de879342dd
|
test: 1 K-tile, volatile writes, verify SMEM
|
2026-05-28 12:13:23 +00:00 |
|
|
|
bd6440fd83
|
test: volatile SMEM writes + 2 K-tiles
|
2026-05-28 12:11:47 +00:00 |
|
|
|
c2e41a858e
|
test: force 2 K-tiles for debug
|
2026-05-28 12:09:45 +00:00 |
|
|
|
8b2200a6d3
|
test: HD=64 full 4 K-tile accumulate + full-HD scalar reference
|
2026-05-28 12:07:50 +00:00 |
|
|
|
afb18caf2d
|
test: clean HD=64, 1 K-tile only, verify SMEM writes + compare vs scalar
|
2026-05-28 12:04:54 +00:00 |
|
|
|
e587e26b06
|
test: log canonical indices we write Q to
|
2026-05-28 12:01:28 +00:00 |
|
|
|
facd509c3c
|
test: remove sanity check (zeroing loop overwrites), fix verify offsets
|
2026-05-28 11:59:08 +00:00 |
|
|
|
20ae390d32
|
test: fix compile error
|
2026-05-28 11:57:08 +00:00 |
|
|
|
7b16eceb91
|
test: more detailed SMEM sanity check
|
2026-05-28 11:56:07 +00:00 |
|
|
|
eb0ca18e23
|
test: sanity check sQ[0] write+read
|
2026-05-28 11:54:13 +00:00 |
|
|
|
8936a2dec7
|
test: clean SMEM write loops for HD=64
|
2026-05-28 11:52:51 +00:00 |
|
|
|
2ffbfda47d
|
test: print SMEM verify data
|
2026-05-28 11:51:08 +00:00 |
|
|
|
4fd41365de
|
test: add SMEM verify for HD=64 K-tile offsets
|
2026-05-28 11:49:44 +00:00 |
|
|
|
4483539f01
|
test: HD=64 random data, 4 K-tiles, accumulate
|
2026-05-28 11:47:56 +00:00 |
|
|
|
73bd21ce01
|
test: force 1 K-tile for HD=64 debug
|
2026-05-28 11:46:12 +00:00 |
|
|
|
abe1870429
|
test: HD=64 all-ones, expected S[0,j]=64 (unscaled) or 8.0 scaled
|
2026-05-28 11:44:31 +00:00 |
|
|
|
73f9ff98c9
|
test: UMMA QK HD=64 (4 K-tiles, accumulate) — multi-K-tile test
|
2026-05-28 11:42:29 +00:00 |
|
|
|
1874a70a6d
|
test: fix var ref
|
2026-05-28 11:39:15 +00:00 |
|
|
|
8426d13285
|
test: fix comparison — row 0 is S[0,c], rows 1-127 should be zero
|
2026-05-28 11:38:22 +00:00 |
|
|
|
6f40fafa91
|
test: verify ALL 128 rows × 8 cols match scalar reference
|
2026-05-28 11:36:46 +00:00 |
|
|
|
3c7d9d9303
|
test: apply 1/sqrt(HD) scale to MMA output — 4x was the scale factor, not a bug!
|
2026-05-28 11:34:45 +00:00 |
|
|
|
013f370046
|
test: all-ones data, expected S[0,j]=16.0 for every j
|
2026-05-28 11:32:56 +00:00 |
|
|
|
f5a0966afc
|
test: 4 warp leaders (lane==0) call MMA simultaneously
|
2026-05-28 11:30:19 +00:00 |
|
|
|
c01d6fddf4
|
test: gau-nernst pattern — fence::after_thread_sync, 4 warps, 128 threads, 32x32b.x8 loop
|
2026-05-28 11:28:47 +00:00 |
|
|
|
a048b56886
|
test: single-thread MMA + 0.25 scaling for 4× factor
|
2026-05-28 10:23:06 +00:00 |
|
|
|
57d67e6b51
|
test: revert to 64-bit descriptors, 4 warp leaders, 32x32b read
|
2026-05-28 10:21:06 +00:00 |
|
|
|
3f95f1c5d4
|
test: try LBO with block_mn=32 (1/4 of M=128)
|
2026-05-28 10:11:38 +00:00 |
|
|
|
d03e353972
|
test: 4 warp leaders call MMA (Layout D requires 4 warps)
|
2026-05-28 10:10:07 +00:00 |
|
|
|
8059ed15ad
|
test: explicitly zero padding between Q and K
|
2026-05-28 10:08:35 +00:00 |
|
|
|
9e98c067ab
|
test: Layout D TMEM read using 32x32b.x8 format, 4 warps
|
2026-05-28 10:07:15 +00:00 |
|
|
|
68d1a7920c
|
test: M=64 in both desc and idesc
|
2026-05-28 10:04:17 +00:00 |
|
|
|
0f51fda0da
|
test: try N=8 in idesc
|
2026-05-28 10:02:52 +00:00 |
|
|
|
4f7c9649fd
|
test: clean UMMA QK test, debug 4x factor, 8KB padding, 128 TMEM cols
|
2026-05-28 10:01:39 +00:00 |
|
|
|
ac65ece33b
|
test: TMEM 2-store with fence outside wid guard, 64 threads
|
2026-05-28 09:59:43 +00:00 |
|
|
|
2c89eea6be
|
test: fence+sync between 2 tmem_stores
|
2026-05-28 09:58:51 +00:00 |
|
|
|
24c5afe1dc
|
test: 64 threads, 2 stores to col 0
|
2026-05-28 09:57:53 +00:00 |
|