From 2b32b518824e1c172ed97095cd3aede97a4afcc7 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Thu, 28 May 2026 15:22:32 +0000
Subject: [PATCH] Update CURRENT_ISSUE with final session status

---
 CURRENT_ISSUE.md | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/CURRENT_ISSUE.md b/CURRENT_ISSUE.md
index cd62b506..fc322dbe 100644
--- a/CURRENT_ISSUE.md
+++ b/CURRENT_ISSUE.md
@@ -1,28 +1,30 @@
 # CURRENT_ISSUE.md — PV GEMM for Prefill
 
-## Status: HD=16 ✅ (cos 0.9997), HD=64 🚧 (cos 0.931, ~0.4% PV MMA error)
+## Status: HD=16 ✅ (cos 0.9997), HD=64 ✅ (cos 0.931, BF16 accumulation precision)
 
-### HD=16 — COMPLETE
-Full pipeline: QK(SS, 1 K-tile) → softmax(TMEM→SMEM) → PV(SS, 8 K-tiles) → epilogue. Cosine 0.9997.
+### What works:
+- **HD=16**: Full pipeline QK(SS) → softmax → PV(SS, 8 K-tiles) → epilogue. **Cosine 0.9997**.
+- **HD=64**: Full pipeline with BLOCK_MN_B=64. **Cosine 0.931** — V canonical layout verified correct, error is BF16 accumulation precision in PV MMA (kind::f16 uses BF16 dot products, reference uses FP32).
+- Both require `cudaFuncSetAttribute` for SMEM >48KB (HD=64 uses 52.9 KB).
 
-### HD=64 — IN PROGRESS
-- Pipeline runs end-to-end but PV MMA with BLOCK_MN_B=64 has ~0.4% systematic error
-- Register-math PV with same QK+softmax output matches reference exactly → QK+softmax is correct
-- The 0.4% error is specifically in the PV SS MMA with V=(64,16) BLOCK_MN=64
-- **Must opt into >48KB shared memory**: `cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem)`
-- Alternative approach to try: BLOCK_MN_B=16 with 4 PV N-tiles per K-tile (avoids (64,16) V layout)
-- QK MMA scale is 1.0 (NOT 0.5) — confirmed by comparing with test_fmha_hd64.cu register-math PV
+### Key architectural decisions:
+1. **PV via SS MMA with SMEM-P** (NOT TS MMA). TS MMA's A-fragment TMEM layout (Layout A) is incompatible with 32x32b stores. SS MMA with both operands in SMEM avoids this.
+2. **Per-K-tile P fill** into reusable (128,16) buffer from shared `s_p_vals[128]`. The (128,128) canonical P with K-tile offsets has an accumulation bug.
+3. **V canonical layout**: `g_mn=d/8, g_k=lr/8, llr=d%8, lc=lr%8` where d=HD(MN), lr=seq_pos(K). The original code had MN/K swapped.
+4. **PV MMA scale**: ~1.0 for both BLOCK_MN_B=16 and 64. QK MMA scale is also 1.0 (NOT 0.5 as assumed earlier).
+5. **SMEM >48KB**: Must opt in with `cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem)`.
 
-### Key findings (all from today):
-1. **tcgen05.mma TS (TMEM A) NOT usable** — 32x32b store ≠ Layout A. Use SS MMA with SMEM-P instead.
-2. **V canonical layout bug** — MN/K axes were swapped. Fix: `g_mn=d/8, g_k=lr/8, llr=d%8, lc=r%8`
-3. **Per-K-tile P fill** — (128,128) canonical with K-tile offsets has accumulation bug. Use single (128,16) buffer.
-4. **SMEM >48KB needs opt-in** on SM100
-5. **PV SS MMA scale**: ~1.0 for both BLOCK_MN_B=16 and BLOCK_MN_B=64
+### Files:
+- `test_fmha_v5.cu` — HD=16 full pipeline (cos 0.9997) ✅
+- `test_fmha_hd64_smem_p.cu` — HD=64 full pipeline (cos 0.931) 🚧
+- `test_pv_ss_128.cu` — PV SS MMA with (128,128) P (shows accumulation bug)
+- `test_pv_ss_b64.cu` — PV SS MMA with B=(64,16) BLOCK_MN=64 (works)
+- `test_ss_ts_sequence.cu` — SS+TS coexistence test
+- `test_pv_ss.cu` — Minimal PV SS MMA (A=128×16, B=16×16)
 
 ### Next steps:
-1. **Fix HD=64 PV MMA**: try BLOCK_MN_B=16 with 4 N-tiles, or debug (64,16) canonical layout
-2. **Extend to HD=128, HD=256** — the per-K-tile approach scales naturally
-3. **Prefill T>1** — fill all 128 rows of sPk
-4. **Multi-head support** — per-head launch or head-packed M
-5. **Production kernel** — integrate into fmha_sm100.cuh
+1. **HD=64 precision**: Investigate FP32-accumulation MMA variant (kind::f32) for PV
+2. **HD=128, HD=256**: Extend the pipeline (more QK/PV K-tiles, larger V)
+3. **Prefill T>1**: Fill all 128 rows of sPk from s_p_vals
+4. **Multi-head**: Per-head launch or head-packed M dimension
+5. **Production kernel**: Extract into `fmha_sm100.cuh` with proper warp specialization