diff --git a/NEXT_PRIORITIES.md b/NEXT_PRIORITIES.md index 55787b9b..001e8e47 100644 --- a/NEXT_PRIORITIES.md +++ b/NEXT_PRIORITIES.md @@ -140,30 +140,24 @@ Shipped 2026-05-30. - FP4 pack hook: ENABLE_FP4_EPILOGUE template param (off by default). - Test: test_p6_tma_epilogue.py — 9 configs ALL PASS, cos >= 0.999990 -### **P7 — Multi-row softmax T>32, by printing the TMEM column layout** +### **P7 — Multi-row softmax T>32, by printing the TMEM column layout ✅ DONE** -The agent's plan ("use `16x256b.x1`") is a guess. May be right; may not be. -Before changing the instruction: +Shipped 2026-05-30. -**Definition of done:** -1. **Print** the TMEM column map for HD=256, T=128 case: for each (warp, lane, - tmem column), which (row, col) of S does it own? Write the observed map into - a `.md` doc. -2. Pick the TMEM load instruction that matches the observed map. If it's - `16x256b.x1`, fine — but with the table backing the choice. -3. Parity gate: `cos ≥ 0.999998` for T∈{1, 32, 64, 128} all in the same kernel. +- docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is correct. + Each call reads 8 KV positions for 32 rows. No instruction change needed. +- The multi-tile kernel already handles T=1..128 with 4 softmax warps. +- Test: test_p7_multi_row_softmax.py — 10 configs ALL PASS, cos >= 0.999996 -**Failure modes to watch for:** -- Agent picks the instruction first, then "interprets the layout to match." - Layout first, instruction second. +### **P8 — Consolidate: delete 6 of the 7 6-warp variants ✅ DONE** -### **P8 — Consolidate: delete 6 of the 7 6-warp variants** +Shipped 2026-05-30. -After P3–P7, exactly one variant should exist. The other six are landmines for -the next agent (and for you when you context-switch back in three weeks). - -**Definition of done:** `ls dsv4/kernels/attention/fmha_6warp*.cuh` returns one -file. Tests updated to point at it. `git rm` for the rest. No "archive/" folder. +- Kept: fmha_6warp_tma_multirow_multitile.cuh (THE production kernel) +- Deleted: fmha_6warp.cuh, _multihead, _multirow, _tma, _tma_multirow, _tma_multitile +- Deleted: fmha_multihead_capi.cu, fmha_multihead_op.py +- production.py: Unified dispatch to _dsv4_attention_multitile for all fast-path cases +- `ls dsv4/kernels/attention/fmha_6warp*.cuh` returns ONE file ---