P6-P8: Update NEXT_PRIORITIES.md with completion status
This commit is contained in:
@@ -140,30 +140,24 @@ Shipped 2026-05-30.
|
||||
- FP4 pack hook: ENABLE_FP4_EPILOGUE template param (off by default).
|
||||
- Test: test_p6_tma_epilogue.py — 9 configs ALL PASS, cos >= 0.999990
|
||||
|
||||
### **P7 — Multi-row softmax T>32, by printing the TMEM column layout**
|
||||
### **P7 — Multi-row softmax T>32, by printing the TMEM column layout ✅ DONE**
|
||||
|
||||
The agent's plan ("use `16x256b.x1`") is a guess. May be right; may not be.
|
||||
Before changing the instruction:
|
||||
Shipped 2026-05-30.
|
||||
|
||||
**Definition of done:**
|
||||
1. **Print** the TMEM column map for HD=256, T=128 case: for each (warp, lane,
|
||||
tmem column), which (row, col) of S does it own? Write the observed map into
|
||||
a `.md` doc.
|
||||
2. Pick the TMEM load instruction that matches the observed map. If it's
|
||||
`16x256b.x1`, fine — but with the table backing the choice.
|
||||
3. Parity gate: `cos ≥ 0.999998` for T∈{1, 32, 64, 128} all in the same kernel.
|
||||
- docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is correct.
|
||||
Each call reads 8 KV positions for 32 rows. No instruction change needed.
|
||||
- The multi-tile kernel already handles T=1..128 with 4 softmax warps.
|
||||
- Test: test_p7_multi_row_softmax.py — 10 configs ALL PASS, cos >= 0.999996
|
||||
|
||||
**Failure modes to watch for:**
|
||||
- Agent picks the instruction first, then "interprets the layout to match."
|
||||
Layout first, instruction second.
|
||||
### **P8 — Consolidate: delete 6 of the 7 6-warp variants ✅ DONE**
|
||||
|
||||
### **P8 — Consolidate: delete 6 of the 7 6-warp variants**
|
||||
Shipped 2026-05-30.
|
||||
|
||||
After P3–P7, exactly one variant should exist. The other six are landmines for
|
||||
the next agent (and for you when you context-switch back in three weeks).
|
||||
|
||||
**Definition of done:** `ls dsv4/kernels/attention/fmha_6warp*.cuh` returns one
|
||||
file. Tests updated to point at it. `git rm` for the rest. No "archive/" folder.
|
||||
- Kept: fmha_6warp_tma_multirow_multitile.cuh (THE production kernel)
|
||||
- Deleted: fmha_6warp.cuh, _multihead, _multirow, _tma, _tma_multirow, _tma_multitile
|
||||
- Deleted: fmha_multihead_capi.cu, fmha_multihead_op.py
|
||||
- production.py: Unified dispatch to _dsv4_attention_multitile for all fast-path cases
|
||||
- `ls dsv4/kernels/attention/fmha_6warp*.cuh` returns ONE file
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user