Update STAGE_D.md: D5a done, CG-2/CG-3 status updated, tOrP0 offset rule added

2026-05-23 21:16:52 +00:00
parent d02069094a
commit 6edb7a91a7
1 changed files with 3 additions and 2 deletions
--- a/STAGE_D.md
+++ b/STAGE_D.md
@@ -30,6 +30,7 @@
 - **`p_cols_fp32` uses `pv_mma_tiler[2]` (K-dim), NOT `pv_mma_tiler[1]` (N-dim).** We got this wrong twice.
 - **PV A-operand major mode is `OperandMajorMode.K` for TMEM-P.** Not `a_major` from Q.
 - **`tOrP0` uses 3-dim indexing `(None, None, kb)`, NOT 4-dim `(None, None, kb, 0)`.** The 4th mode was already sliced away by `tOrP_base[(None,None,None,0)]`.
+- **`tOrP0` MUST include the `tmem_p0_offset` column offset.** The softmax warps store P at `tmem_p0_offset=32` (FP32 columns = 64 BF16 elements). PV MMA must read from the same offset. Missing this offset causes NaN/zeros (the MMA reads from column 0 where S is, not column 32 where P is). Use `const_expr` for the conditional: `if const_expr(self.tOrP0_offset > 0): tOrP0 = cute.make_tensor(tOrP.iterator + self.tOrP0_offset, tOrP.layout) else: tOrP0 = tOrP`
 - **After every P store to TMEM, call `cute.arch.fence_view_async_tmem_store()`.** Missing this produces NaN.
 - **PRINT THE SHAPES. ALWAYS.** Run `print(f"tensor: shape={cute.shape(tensor)}")` inside `@cute.kernel` at trace time. Reasoning about layouts without evidence is how we waste days.

@@ -131,7 +132,7 @@ acc_vec = cute.math.fmin(cute.math.fmax(acc_vec, -swiglu_limit), swiglu_limit)

 **Why it matters:** This is the D1 work. The path forward is correct (`make_tiled_copy_C(store_atom, qk_mma)` to partition P registers for SMEM staging). But TMEM column budget at hd=512 must be verified first (see budget section above).

-**Status:** 🔴 D1.2–D1.3 TODO. This document IS the plan.
+**Status:** 🟡 D1.3 SMEM-P still a stub. hd=64 TMEM-P works (cos 0.973). `make_tiled_copy_C` gives rank mismatch. Need proper layout-aware P register→SMEM copy.

 ### CG-3: SWA + Sink Merge Not Fused in FMHA ⚠️ CRITICAL

@@ -145,7 +146,7 @@ acc_vec = cute.math.fmin(cute.math.fmax(acc_vec, -swiglu_limit), swiglu_limit)
 3. **D5c:** Fuse two passes into one kernel launch (Q stays in SMEM, two sequential MMA loops). Pure optimization.
 4. **D5d:** Fuse sink merge into kernel epilogue. Pure optimization.

-**Status:** 🔴 D5 TODO. D5a must be done FIRST — it unblocks D5b which gives us correctness.
+**Status:** 🟡 D5a DONE (May 23, 2026). `normalize` flag added, LSE output works (err=0.000000). Un-norm O cosine 0.963 (TME-P layout mismatch in epilogue_tma_store). D5b (Python SWA+sink merge) is NEXT.

 ### CG-4: Inverse RoPE Verification ⚠️ HIGH