Commit Graph

1518 Commits

Author SHA1 Message Date
8cb32cabc9 Fix asm constraint typo 2026-05-28 14:25:04 +00:00
36a50962b3 Full FMHA SMEM-P with scale calibration 2026-05-28 14:24:53 +00:00
4a36da9845 Minimal PV SS MMA test: A=128x16, B=16x16 2026-05-28 14:23:42 +00:00
77901834a9 Fix P K-tile offset: 2048 BF16 per (128,16) tile, not 1024 2026-05-28 14:22:27 +00:00
0bfc943cec FMHA with SMEM-P approach: PV via SS MMA avoids TMEM layout mismatch 2026-05-28 14:21:42 +00:00
faeedd3643 Test TS MMA with non-uniform A data 2026-05-28 14:19:45 +00:00
570c5b5154 Test softmax→PV with 1 K-tile in isolation 2026-05-28 14:18:39 +00:00
a29ef77b64 QK→PV layout test: skip softmax to test TMEM layout compatibility 2026-05-28 14:17:37 +00:00
acf17e001e Fix SMEM allocation (was half the needed size) + re-enable full pipeline 2026-05-28 14:16:43 +00:00
fa6c124163 Debug: QK only, skip softmax+PV 2026-05-28 14:15:18 +00:00
79cee32125 Debug: skip PV step entirely 2026-05-28 14:14:34 +00:00
47e9b8a413 Debug: single PV K-tile 2026-05-28 14:13:57 +00:00
414b3f4f92 Full FMHA HD=16 with PV GEMM via tcgen05.mma TS 2026-05-28 14:13:11 +00:00
ed8f48dddf Add systematic SS+TS sequence test to debug MMA coexistence crash 2026-05-28 14:10:07 +00:00
6a3159dfd9 test: PV then QK to find ordering issue 2026-05-28 13:53:35 +00:00
640233cb87 test: PV GEMM first (before QK) to test ordering 2026-05-28 13:52:18 +00:00
d4ed3fa06f test: QK GEMM + PV GEMM combined test 2026-05-28 13:50:47 +00:00
7cb133c5bf test: exact copy of working TS test (verify it still passes) 2026-05-28 13:49:04 +00:00
0dcaa648b3 test: properly aligned V SMEM buffer 2026-05-28 13:47:47 +00:00
1c14ada386 test: write O to tb (overwriting P), same as isolated test 2026-05-28 13:46:04 +00:00
a944f90040 test: match isolated TS test exactly (V=all-1, BLOCK_MN=16) 2026-05-28 13:44:36 +00:00
482328160a test: single PV K-tile debug 2026-05-28 13:43:24 +00:00
3a40ed6d69 test: skip QK+softmax, write P directly to TMEM for PV debug 2026-05-28 13:41:50 +00:00
f24bc583dc test: zero O TMEM before PV GEMM 2026-05-28 13:40:30 +00:00
2885b3f2ed test: full FMHA HD=16 with PV GEMM via tcgen05.mma TS 2026-05-28 13:39:34 +00:00
dc2130cb12 test: cleanup TS MMA test 2026-05-28 13:38:07 +00:00
a767e90a12 test: B=2.0 to understand TS MMA scale factor 2026-05-28 13:36:30 +00:00
b7c6971720 test: use 32x32b.x8 for A write (avoids 16x256b misalign) 2026-05-28 13:34:50 +00:00
a7c81d66ba test: step-by-step TMEM write/read debug for TS MMA 2026-05-28 13:33:36 +00:00
c05cc1ac93 test: separate TMEM regions for A and C in TS MMA 2026-05-28 13:32:22 +00:00
37a502e476 test: minimal tcgen05.mma TS debug (PV GEMM) 2026-05-28 13:31:18 +00:00
efa03f53d4 docs: update CURRENT_ISSUE and MEMORY — full FMHA HD=64 pipeline working 2026-05-28 13:11:32 +00:00
654a2ae7f4 test: merge softmax+PV into single warp0 block (s_vals scope fix) 2026-05-28 13:10:02 +00:00
5c9e3c41af test: full FMHA HD=64 — QK+softmax+PV(register math) 2026-05-28 13:09:20 +00:00
0672373e51 test: debug — just QK+softmax+P read (no PV) 2026-05-28 13:08:06 +00:00
5d75decd57 test: full FMHA HD=16 — PV via register math (decode T=1) 2026-05-28 13:06:52 +00:00
f62772992b test: full FMHA HD=16 with PV GEMM (separate TMEM for P and O) 2026-05-28 13:05:27 +00:00
bd15bce853 test: HD=16 QK+softmax (no PV) 2026-05-28 13:04:10 +00:00
38d7bcd776 test: HD=16 FMHA softmax only (skip PV for now) 2026-05-28 13:03:06 +00:00
834d682443 test: full FMHA HD=16 pipeline (QK→softmax→PV→epilogue) 2026-05-28 13:02:00 +00:00
3b8be4b2db test: FMHA softmax (QK→read S→softmax→write P→read P→verify) 2026-05-28 13:00:37 +00:00
c936940428 test: separate (128,16) SMEM per K-tile with correct source stride 2026-05-28 12:57:38 +00:00
f244c4fdd2 test: single-thread MMA (tid==0) for Layout D 2026-05-28 12:56:39 +00:00
ba2e390e1e test: debug single K-tile from full (128,64) SMEM 2026-05-28 12:55:52 +00:00
a7e8b483cd test: HD=64 multi-K-tile with correct source stride in SMEM writes 2026-05-28 12:54:57 +00:00
926ae5d7bf test: fix K source stride mismatch in manual SMEM write 2026-05-28 12:54:03 +00:00
7d16a30cb6 test: exact HD=16 pattern with HD=64 data 2026-05-28 12:53:13 +00:00
db4f661843 test: debug with (128,16) SMEM matching HD=16 exactly 2026-05-28 12:52:19 +00:00
b703dc0a50 test: debug single K-tile with offset descriptor 2026-05-28 12:51:33 +00:00
435ca037cf test: use accumulate=false for first K-tile, skip TMEM zero 2026-05-28 12:50:44 +00:00