|
|
6a3159dfd9
|
test: PV then QK to find ordering issue
|
2026-05-28 13:53:35 +00:00 |
|
|
|
640233cb87
|
test: PV GEMM first (before QK) to test ordering
|
2026-05-28 13:52:18 +00:00 |
|
|
|
d4ed3fa06f
|
test: QK GEMM + PV GEMM combined test
|
2026-05-28 13:50:47 +00:00 |
|
|
|
7cb133c5bf
|
test: exact copy of working TS test (verify it still passes)
|
2026-05-28 13:49:04 +00:00 |
|
|
|
0dcaa648b3
|
test: properly aligned V SMEM buffer
|
2026-05-28 13:47:47 +00:00 |
|
|
|
1c14ada386
|
test: write O to tb (overwriting P), same as isolated test
|
2026-05-28 13:46:04 +00:00 |
|
|
|
a944f90040
|
test: match isolated TS test exactly (V=all-1, BLOCK_MN=16)
|
2026-05-28 13:44:36 +00:00 |
|
|
|
482328160a
|
test: single PV K-tile debug
|
2026-05-28 13:43:24 +00:00 |
|
|
|
3a40ed6d69
|
test: skip QK+softmax, write P directly to TMEM for PV debug
|
2026-05-28 13:41:50 +00:00 |
|
|
|
f24bc583dc
|
test: zero O TMEM before PV GEMM
|
2026-05-28 13:40:30 +00:00 |
|
|
|
2885b3f2ed
|
test: full FMHA HD=16 with PV GEMM via tcgen05.mma TS
|
2026-05-28 13:39:34 +00:00 |
|
|
|
dc2130cb12
|
test: cleanup TS MMA test
|
2026-05-28 13:38:07 +00:00 |
|
|
|
a767e90a12
|
test: B=2.0 to understand TS MMA scale factor
|
2026-05-28 13:36:30 +00:00 |
|
|
|
b7c6971720
|
test: use 32x32b.x8 for A write (avoids 16x256b misalign)
|
2026-05-28 13:34:50 +00:00 |
|
|
|
a7c81d66ba
|
test: step-by-step TMEM write/read debug for TS MMA
|
2026-05-28 13:33:36 +00:00 |
|
|
|
c05cc1ac93
|
test: separate TMEM regions for A and C in TS MMA
|
2026-05-28 13:32:22 +00:00 |
|
|
|
37a502e476
|
test: minimal tcgen05.mma TS debug (PV GEMM)
|
2026-05-28 13:31:18 +00:00 |
|
|
|
efa03f53d4
|
docs: update CURRENT_ISSUE and MEMORY — full FMHA HD=64 pipeline working
|
2026-05-28 13:11:32 +00:00 |
|
|
|
654a2ae7f4
|
test: merge softmax+PV into single warp0 block (s_vals scope fix)
|
2026-05-28 13:10:02 +00:00 |
|
|
|
5c9e3c41af
|
test: full FMHA HD=64 — QK+softmax+PV(register math)
|
2026-05-28 13:09:20 +00:00 |
|
|
|
0672373e51
|
test: debug — just QK+softmax+P read (no PV)
|
2026-05-28 13:08:06 +00:00 |
|
|
|
5d75decd57
|
test: full FMHA HD=16 — PV via register math (decode T=1)
|
2026-05-28 13:06:52 +00:00 |
|
|
|
f62772992b
|
test: full FMHA HD=16 with PV GEMM (separate TMEM for P and O)
|
2026-05-28 13:05:27 +00:00 |
|
|
|
bd15bce853
|
test: HD=16 QK+softmax (no PV)
|
2026-05-28 13:04:10 +00:00 |
|
|
|
38d7bcd776
|
test: HD=16 FMHA softmax only (skip PV for now)
|
2026-05-28 13:03:06 +00:00 |
|
|
|
834d682443
|
test: full FMHA HD=16 pipeline (QK→softmax→PV→epilogue)
|
2026-05-28 13:02:00 +00:00 |
|
|
|
3b8be4b2db
|
test: FMHA softmax (QK→read S→softmax→write P→read P→verify)
|
2026-05-28 13:00:37 +00:00 |
|
|
|
c936940428
|
test: separate (128,16) SMEM per K-tile with correct source stride
|
2026-05-28 12:57:38 +00:00 |
|
|
|
f244c4fdd2
|
test: single-thread MMA (tid==0) for Layout D
|
2026-05-28 12:56:39 +00:00 |
|
|
|
ba2e390e1e
|
test: debug single K-tile from full (128,64) SMEM
|
2026-05-28 12:55:52 +00:00 |
|
|
|
a7e8b483cd
|
test: HD=64 multi-K-tile with correct source stride in SMEM writes
|
2026-05-28 12:54:57 +00:00 |
|
|
|
926ae5d7bf
|
test: fix K source stride mismatch in manual SMEM write
|
2026-05-28 12:54:03 +00:00 |
|
|
|
7d16a30cb6
|
test: exact HD=16 pattern with HD=64 data
|
2026-05-28 12:53:13 +00:00 |
|
|
|
db4f661843
|
test: debug with (128,16) SMEM matching HD=16 exactly
|
2026-05-28 12:52:19 +00:00 |
|
|
|
b703dc0a50
|
test: debug single K-tile with offset descriptor
|
2026-05-28 12:51:33 +00:00 |
|
|
|
435ca037cf
|
test: use accumulate=false for first K-tile, skip TMEM zero
|
2026-05-28 12:50:44 +00:00 |
|
|
|
e8ac2120ad
|
test: HD=64 QK with contiguous SMEM + offset descriptors
|
2026-05-28 12:50:07 +00:00 |
|
|
|
1c01e8e412
|
test: fix inline asm line continuation for nvcc
|
2026-05-28 12:48:45 +00:00 |
|
|
|
71c774027c
|
test: fix HD=64 QK — zero TMEM, fence after MMA, single-thread MMA call
|
2026-05-28 12:47:51 +00:00 |
|
|
|
1bf76388c8
|
test: always accumulate, separate SMEM per K-tile, TMEM starts at 0
|
2026-05-28 12:23:47 +00:00 |
|
|
|
8707f555c2
|
test: add extra syncwarp + syncthreads for MMA safety
|
2026-05-28 12:20:01 +00:00 |
|
|
|
5a65d46c26
|
test: HD=64 with separate SMEM per K-tile — no offset descriptors needed
|
2026-05-28 12:18:06 +00:00 |
|
|
|
526fafb808
|
test: revert volatile, fix wid==0, full 4 K-tiles
|
2026-05-28 12:16:09 +00:00 |
|
|
|
de879342dd
|
test: 1 K-tile, volatile writes, verify SMEM
|
2026-05-28 12:13:23 +00:00 |
|
|
|
bd6440fd83
|
test: volatile SMEM writes + 2 K-tiles
|
2026-05-28 12:11:47 +00:00 |
|
|
|
c2e41a858e
|
test: force 2 K-tiles for debug
|
2026-05-28 12:09:45 +00:00 |
|
|
|
8b2200a6d3
|
test: HD=64 full 4 K-tile accumulate + full-HD scalar reference
|
2026-05-28 12:07:50 +00:00 |
|
|
|
afb18caf2d
|
test: clean HD=64, 1 K-tile only, verify SMEM writes + compare vs scalar
|
2026-05-28 12:04:54 +00:00 |
|
|
|
e587e26b06
|
test: log canonical indices we write Q to
|
2026-05-28 12:01:28 +00:00 |
|
|
|
facd509c3c
|
test: remove sanity check (zeroing loop overwrites), fix verify offsets
|
2026-05-28 11:59:08 +00:00 |
|