Commit Graph

1541 Commits

Author SHA1 Message Date
75bdcbf728 Debug: override P with uniform 1/128 2026-05-28 14:46:21 +00:00
af93c283c7 Enable all 8 PV K-tiles 2026-05-28 14:45:13 +00:00
6f5be8a4e4 Debug: print P values 2026-05-28 14:44:09 +00:00
3d15f5bb21 Debug: 1 PV K-tile 2026-05-28 14:43:01 +00:00
284a06ddf1 FMHA v5: clean rewrite with QK + softmax + PV SS per K-tile 2026-05-28 14:42:13 +00:00
342193e0b4 Fix tb scope 2026-05-28 14:40:55 +00:00
a6f7ef7c45 Add softmax read from TMEM 2026-05-28 14:40:35 +00:00
38b0ff0bf8 Add QK GEMM to minimal PV test 2026-05-28 14:39:51 +00:00
e9f8f9e6e3 Minimal PV with s_p_vals in SMEM 2026-05-28 14:38:58 +00:00
97ebb964a2 Move s_p_vals to dynamic SMEM 2026-05-28 14:38:03 +00:00
d2387dd858 Full FMHA v4: per-K-tile P fill into reusable (128,16) buffer 2026-05-28 14:37:11 +00:00
78b470317f PV accumulation debug with detailed TMEM read 2026-05-28 14:35:29 +00:00
dacbf53081 Test K-tiles 0-1 accumulated 2026-05-28 14:33:31 +00:00
bad31d9476 Test K-tile 1 2026-05-28 14:32:51 +00:00
9198ed734f Test 1 PV K-tile from (128,128) P at offset 0 2026-05-28 14:32:10 +00:00
ce88cd6e9e Zero TMEM manually, all K-tiles accumulate=true 2026-05-28 14:31:22 +00:00
727c509454 PV SS MMA with 8 K-tile accumulation 2026-05-28 14:30:09 +00:00
d5b0941f2e PV SS MMA with (128,128) P layout 2026-05-28 14:29:13 +00:00
f94693fdc2 Fix: add back cudaDeviceSynchronize 2026-05-28 14:28:24 +00:00
fb8af865f4 Check launch error 2026-05-28 14:28:02 +00:00
738e39cb63 Debug: add printf at kernel start 2026-05-28 14:27:12 +00:00
9e13096bf8 Debug: skip QK, write P directly to SMEM, 1 PV K-tile 2026-05-28 14:26:36 +00:00
11da4daa01 Debug: single PV K-tile 2026-05-28 14:25:47 +00:00
8cb32cabc9 Fix asm constraint typo 2026-05-28 14:25:04 +00:00
36a50962b3 Full FMHA SMEM-P with scale calibration 2026-05-28 14:24:53 +00:00
4a36da9845 Minimal PV SS MMA test: A=128x16, B=16x16 2026-05-28 14:23:42 +00:00
77901834a9 Fix P K-tile offset: 2048 BF16 per (128,16) tile, not 1024 2026-05-28 14:22:27 +00:00
0bfc943cec FMHA with SMEM-P approach: PV via SS MMA avoids TMEM layout mismatch 2026-05-28 14:21:42 +00:00
faeedd3643 Test TS MMA with non-uniform A data 2026-05-28 14:19:45 +00:00
570c5b5154 Test softmax→PV with 1 K-tile in isolation 2026-05-28 14:18:39 +00:00
a29ef77b64 QK→PV layout test: skip softmax to test TMEM layout compatibility 2026-05-28 14:17:37 +00:00
acf17e001e Fix SMEM allocation (was half the needed size) + re-enable full pipeline 2026-05-28 14:16:43 +00:00
fa6c124163 Debug: QK only, skip softmax+PV 2026-05-28 14:15:18 +00:00
79cee32125 Debug: skip PV step entirely 2026-05-28 14:14:34 +00:00
47e9b8a413 Debug: single PV K-tile 2026-05-28 14:13:57 +00:00
414b3f4f92 Full FMHA HD=16 with PV GEMM via tcgen05.mma TS 2026-05-28 14:13:11 +00:00
ed8f48dddf Add systematic SS+TS sequence test to debug MMA coexistence crash 2026-05-28 14:10:07 +00:00
6a3159dfd9 test: PV then QK to find ordering issue 2026-05-28 13:53:35 +00:00
640233cb87 test: PV GEMM first (before QK) to test ordering 2026-05-28 13:52:18 +00:00
d4ed3fa06f test: QK GEMM + PV GEMM combined test 2026-05-28 13:50:47 +00:00
7cb133c5bf test: exact copy of working TS test (verify it still passes) 2026-05-28 13:49:04 +00:00
0dcaa648b3 test: properly aligned V SMEM buffer 2026-05-28 13:47:47 +00:00
1c14ada386 test: write O to tb (overwriting P), same as isolated test 2026-05-28 13:46:04 +00:00
a944f90040 test: match isolated TS test exactly (V=all-1, BLOCK_MN=16) 2026-05-28 13:44:36 +00:00
482328160a test: single PV K-tile debug 2026-05-28 13:43:24 +00:00
3a40ed6d69 test: skip QK+softmax, write P directly to TMEM for PV debug 2026-05-28 13:41:50 +00:00
f24bc583dc test: zero O TMEM before PV GEMM 2026-05-28 13:40:30 +00:00
2885b3f2ed test: full FMHA HD=16 with PV GEMM via tcgen05.mma TS 2026-05-28 13:39:34 +00:00
dc2130cb12 test: cleanup TS MMA test 2026-05-28 13:38:07 +00:00
a767e90a12 test: B=2.0 to understand TS MMA scale factor 2026-05-28 13:36:30 +00:00