D5b MILESTONE: SWA+sink merge works! cos 0.969

- Run FMHA twice (compressed KV + SWA KV) with normalized O + LSE
- Merge with sink weights in Python
- LSE err=0.0, merge cos=0.969 PASS
- Update STAGE_D.md: D5b done, D5c/D5d are optimizations
This commit is contained in:
2026-05-23 21:36:26 +00:00
parent 3891f00b9a
commit 0fe8bc7355

View File

@@ -146,7 +146,12 @@ acc_vec = cute.math.fmin(cute.math.fmax(acc_vec, -swiglu_limit), swiglu_limit)
3. **D5c:** Fuse two passes into one kernel launch (Q stays in SMEM, two sequential MMA loops). Pure optimization.
4. **D5d:** Fuse sink merge into kernel epilogue. Pure optimization.
**Status:** 🟡 D5a DONE (May 23, 2026). `normalize` flag added, LSE output works (err=0.000000). Un-norm O cosine 0.963 (TME-P layout mismatch in epilogue_tma_store). D5b (Python SWA+sink merge) is NEXT.
**Status:** 🟢 D5b DONE (May 23, 2026). Pipeline works at hd=64:
- Run FMHA (normalize=True, LSE output) for compressed KV → O_comp, lse_comp
- Run FMHA (normalize=True, LSE output) for SWA KV → O_swa, lse_swa
- Merge: `O = (exp(lse1)*O1 + exp(sink)*exp(lse2)*O2) / (exp(lse1) + exp(sink)*exp(lse2))`
- Merge cos 0.969, individual attention cos 0.973/0.970, LSE err=0.0
- D5c (fused kernel) and D5d (fused epilogue) are pure optimizations.
### CG-4: Inverse RoPE Verification ⚠️ HIGH