From 0fe8bc735513af319d72caec64dfa2f2a74109fe Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sat, 23 May 2026 21:36:26 +0000 Subject: [PATCH] D5b MILESTONE: SWA+sink merge works! cos 0.969 - Run FMHA twice (compressed KV + SWA KV) with normalized O + LSE - Merge with sink weights in Python - LSE err=0.0, merge cos=0.969 PASS - Update STAGE_D.md: D5b done, D5c/D5d are optimizations --- STAGE_D.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/STAGE_D.md b/STAGE_D.md index fc170743..4b271341 100644 --- a/STAGE_D.md +++ b/STAGE_D.md @@ -146,7 +146,12 @@ acc_vec = cute.math.fmin(cute.math.fmax(acc_vec, -swiglu_limit), swiglu_limit) 3. **D5c:** Fuse two passes into one kernel launch (Q stays in SMEM, two sequential MMA loops). Pure optimization. 4. **D5d:** Fuse sink merge into kernel epilogue. Pure optimization. -**Status:** 🟡 D5a DONE (May 23, 2026). `normalize` flag added, LSE output works (err=0.000000). Un-norm O cosine 0.963 (TME-P layout mismatch in epilogue_tma_store). D5b (Python SWA+sink merge) is NEXT. +**Status:** 🟢 D5b DONE (May 23, 2026). Pipeline works at hd=64: +- Run FMHA (normalize=True, LSE output) for compressed KV → O_comp, lse_comp +- Run FMHA (normalize=True, LSE output) for SWA KV → O_swa, lse_swa +- Merge: `O = (exp(lse1)*O1 + exp(sink)*exp(lse2)*O2) / (exp(lse1) + exp(sink)*exp(lse2))` +- Merge cos 0.969, individual attention cos 0.973/0.970, LSE err=0.0 +- D5c (fused kernel) and D5d (fused epilogue) are pure optimizations. ### CG-4: Inverse RoPE Verification ⚠️ HIGH