diff --git a/STAGE_D.md b/STAGE_D.md index 4b271341..bec26390 100644 --- a/STAGE_D.md +++ b/STAGE_D.md @@ -147,10 +147,11 @@ acc_vec = cute.math.fmin(cute.math.fmax(acc_vec, -swiglu_limit), swiglu_limit) 4. **D5d:** Fuse sink merge into kernel epilogue. Pure optimization. **Status:** 🟢 D5b DONE (May 23, 2026). Pipeline works at hd=64: -- Run FMHA (normalize=True, LSE output) for compressed KV → O_comp, lse_comp -- Run FMHA (normalize=True, LSE output) for SWA KV → O_swa, lse_swa -- Merge: `O = (exp(lse1)*O1 + exp(sink)*exp(lse2)*O2) / (exp(lse1) + exp(sink)*exp(lse2))` -- Merge cos 0.969, individual attention cos 0.973/0.970, LSE err=0.0 +- Run FMHA (normalize=False, LSE output) for compressed KV → O_unnorm_comp, lse_comp +- Run FMHA (normalize=False, LSE output) for SWA KV → O_unnorm_swa, lse_swa +- Un-normalized merge: `O = (O_unnorm_comp + exp(sink)*O_unnorm_swa) / (exp(lse1) + exp(sink)*exp(lse2))` +- Merge cos 0.961, individual attention cos 0.963/0.960, LSE err=0.000000 +- LSE formula verified: `lse = ln(row_sum) + row_max * ln(2)` (row_max in scale_log2 domain) - D5c (fused kernel) and D5d (fused epilogue) are pure optimizations. ### CG-4: Inverse RoPE Verification ⚠️ HIGH