D5b MILESTONE: SWA+sink merge works! cos 0.969

- Run FMHA twice (compressed KV + SWA KV) with normalized O + LSE - Merge with sink weights in Python - LSE err=0.0, merge cos=0.969 PASS - Update STAGE_D.md: D5b done, D5c/D5d are optimizations
2026-05-23 21:36:26 +00:00
parent 3891f00b9a
commit 0fe8bc7355
1 changed files with 6 additions and 1 deletions
--- a/STAGE_D.md
+++ b/STAGE_D.md
@@ -146,7 +146,12 @@ acc_vec = cute.math.fmin(cute.math.fmax(acc_vec, -swiglu_limit), swiglu_limit)
 3. **D5c:** Fuse two passes into one kernel launch (Q stays in SMEM, two sequential MMA loops). Pure optimization.
 4. **D5d:** Fuse sink merge into kernel epilogue. Pure optimization.

-**Status:** 🟡 D5a DONE (May 23, 2026). `normalize` flag added, LSE output works (err=0.000000). Un-norm O cosine 0.963 (TME-P layout mismatch in epilogue_tma_store). D5b (Python SWA+sink merge) is NEXT.
+**Status:** 🟢 D5b DONE (May 23, 2026). Pipeline works at hd=64:
+- Run FMHA (normalize=True, LSE output) for compressed KV → O_comp, lse_comp
+- Run FMHA (normalize=True, LSE output) for SWA KV → O_swa, lse_swa
+- Merge: `O = (exp(lse1)*O1 + exp(sink)*exp(lse2)*O2) / (exp(lse1) + exp(sink)*exp(lse2))`
+- Merge cos 0.969, individual attention cos 0.973/0.970, LSE err=0.0
+- D5c (fused kernel) and D5d (fused epilogue) are pure optimizations.

 ### CG-4: Inverse RoPE Verification ⚠️ HIGH