diff --git a/STAGE_D.md b/STAGE_D.md index adeb1b62..da83cd5c 100644 --- a/STAGE_D.md +++ b/STAGE_D.md @@ -1,5 +1,22 @@ # Stage D — Parameterized FMHA for DSV4 + +## 🎉 VICTORY: D1.3 SOLVED! (2026-05-23) + +**After intensive debugging, SMEM-P rank mismatch issue resolved!** + +**Problem:** SMEM-P copy failed with "Expected source and destination tensors to have the same rank, but got 5 and 3" + +**Root Cause:** tensor used TMEM layout () with extra singleton modes, while SMEM copy expected QK C-fragment layout. + +**Solution:** Create tensor viewing same data with QK C-fragment layout (): + + +**Impact:** Enables hd>64 support (128, 256, 512). Multi-PV-tile works for hd=512 (2 tiles of 256 each). + +**Status:** Kernel compiles and runs for all head dimensions. SMEM-P path enabled for hd>64. + + ## ⚠️ IKEA INSTRUCTIONS — READ EVERY TIME BEFORE CODING ### The Workflow (DO NOT SKIP STEPS)