|
|
78b470317f
|
PV accumulation debug with detailed TMEM read
|
2026-05-28 14:35:29 +00:00 |
|
|
|
dacbf53081
|
Test K-tiles 0-1 accumulated
|
2026-05-28 14:33:31 +00:00 |
|
|
|
bad31d9476
|
Test K-tile 1
|
2026-05-28 14:32:51 +00:00 |
|
|
|
9198ed734f
|
Test 1 PV K-tile from (128,128) P at offset 0
|
2026-05-28 14:32:10 +00:00 |
|
|
|
ce88cd6e9e
|
Zero TMEM manually, all K-tiles accumulate=true
|
2026-05-28 14:31:22 +00:00 |
|
|
|
727c509454
|
PV SS MMA with 8 K-tile accumulation
|
2026-05-28 14:30:09 +00:00 |
|
|
|
d5b0941f2e
|
PV SS MMA with (128,128) P layout
|
2026-05-28 14:29:13 +00:00 |
|
|
|
f94693fdc2
|
Fix: add back cudaDeviceSynchronize
|
2026-05-28 14:28:24 +00:00 |
|
|
|
fb8af865f4
|
Check launch error
|
2026-05-28 14:28:02 +00:00 |
|
|
|
738e39cb63
|
Debug: add printf at kernel start
|
2026-05-28 14:27:12 +00:00 |
|
|
|
9e13096bf8
|
Debug: skip QK, write P directly to SMEM, 1 PV K-tile
|
2026-05-28 14:26:36 +00:00 |
|
|
|
11da4daa01
|
Debug: single PV K-tile
|
2026-05-28 14:25:47 +00:00 |
|
|
|
8cb32cabc9
|
Fix asm constraint typo
|
2026-05-28 14:25:04 +00:00 |
|
|
|
36a50962b3
|
Full FMHA SMEM-P with scale calibration
|
2026-05-28 14:24:53 +00:00 |
|
|
|
4a36da9845
|
Minimal PV SS MMA test: A=128x16, B=16x16
|
2026-05-28 14:23:42 +00:00 |
|
|
|
77901834a9
|
Fix P K-tile offset: 2048 BF16 per (128,16) tile, not 1024
|
2026-05-28 14:22:27 +00:00 |
|
|
|
0bfc943cec
|
FMHA with SMEM-P approach: PV via SS MMA avoids TMEM layout mismatch
|
2026-05-28 14:21:42 +00:00 |
|
|
|
faeedd3643
|
Test TS MMA with non-uniform A data
|
2026-05-28 14:19:45 +00:00 |
|
|
|
570c5b5154
|
Test softmax→PV with 1 K-tile in isolation
|
2026-05-28 14:18:39 +00:00 |
|
|
|
a29ef77b64
|
QK→PV layout test: skip softmax to test TMEM layout compatibility
|
2026-05-28 14:17:37 +00:00 |
|
|
|
acf17e001e
|
Fix SMEM allocation (was half the needed size) + re-enable full pipeline
|
2026-05-28 14:16:43 +00:00 |
|
|
|
fa6c124163
|
Debug: QK only, skip softmax+PV
|
2026-05-28 14:15:18 +00:00 |
|
|
|
79cee32125
|
Debug: skip PV step entirely
|
2026-05-28 14:14:34 +00:00 |
|
|
|
47e9b8a413
|
Debug: single PV K-tile
|
2026-05-28 14:13:57 +00:00 |
|
|
|
414b3f4f92
|
Full FMHA HD=16 with PV GEMM via tcgen05.mma TS
|
2026-05-28 14:13:11 +00:00 |
|
|
|
ed8f48dddf
|
Add systematic SS+TS sequence test to debug MMA coexistence crash
|
2026-05-28 14:10:07 +00:00 |
|
|
|
6a3159dfd9
|
test: PV then QK to find ordering issue
|
2026-05-28 13:53:35 +00:00 |
|
|
|
640233cb87
|
test: PV GEMM first (before QK) to test ordering
|
2026-05-28 13:52:18 +00:00 |
|
|
|
d4ed3fa06f
|
test: QK GEMM + PV GEMM combined test
|
2026-05-28 13:50:47 +00:00 |
|
|
|
7cb133c5bf
|
test: exact copy of working TS test (verify it still passes)
|
2026-05-28 13:49:04 +00:00 |
|
|
|
0dcaa648b3
|
test: properly aligned V SMEM buffer
|
2026-05-28 13:47:47 +00:00 |
|
|
|
1c14ada386
|
test: write O to tb (overwriting P), same as isolated test
|
2026-05-28 13:46:04 +00:00 |
|
|
|
a944f90040
|
test: match isolated TS test exactly (V=all-1, BLOCK_MN=16)
|
2026-05-28 13:44:36 +00:00 |
|
|
|
482328160a
|
test: single PV K-tile debug
|
2026-05-28 13:43:24 +00:00 |
|
|
|
3a40ed6d69
|
test: skip QK+softmax, write P directly to TMEM for PV debug
|
2026-05-28 13:41:50 +00:00 |
|
|
|
f24bc583dc
|
test: zero O TMEM before PV GEMM
|
2026-05-28 13:40:30 +00:00 |
|
|
|
2885b3f2ed
|
test: full FMHA HD=16 with PV GEMM via tcgen05.mma TS
|
2026-05-28 13:39:34 +00:00 |
|
|
|
dc2130cb12
|
test: cleanup TS MMA test
|
2026-05-28 13:38:07 +00:00 |
|
|
|
a767e90a12
|
test: B=2.0 to understand TS MMA scale factor
|
2026-05-28 13:36:30 +00:00 |
|
|
|
b7c6971720
|
test: use 32x32b.x8 for A write (avoids 16x256b misalign)
|
2026-05-28 13:34:50 +00:00 |
|
|
|
a7c81d66ba
|
test: step-by-step TMEM write/read debug for TS MMA
|
2026-05-28 13:33:36 +00:00 |
|
|
|
c05cc1ac93
|
test: separate TMEM regions for A and C in TS MMA
|
2026-05-28 13:32:22 +00:00 |
|
|
|
37a502e476
|
test: minimal tcgen05.mma TS debug (PV GEMM)
|
2026-05-28 13:31:18 +00:00 |
|
|
|
efa03f53d4
|
docs: update CURRENT_ISSUE and MEMORY — full FMHA HD=64 pipeline working
|
2026-05-28 13:11:32 +00:00 |
|
|
|
654a2ae7f4
|
test: merge softmax+PV into single warp0 block (s_vals scope fix)
|
2026-05-28 13:10:02 +00:00 |
|
|
|
5c9e3c41af
|
test: full FMHA HD=64 — QK+softmax+PV(register math)
|
2026-05-28 13:09:20 +00:00 |
|
|
|
0672373e51
|
test: debug — just QK+softmax+P read (no PV)
|
2026-05-28 13:08:06 +00:00 |
|
|
|
5d75decd57
|
test: full FMHA HD=16 — PV via register math (decode T=1)
|
2026-05-28 13:06:52 +00:00 |
|
|
|
f62772992b
|
test: full FMHA HD=16 with PV GEMM (separate TMEM for P and O)
|
2026-05-28 13:05:27 +00:00 |
|
|
|
bd15bce853
|
test: HD=16 QK+softmax (no PV)
|
2026-05-28 13:04:10 +00:00 |
|