From bf2c7c8bb8fd1875f2b6fca95a4b39c99763a68d Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Tue, 26 May 2026 20:26:06 +0000
Subject: [PATCH] D1.5: Implement in-kernel O rescale via CUTLASS
 correction_rescale pattern

- Both load and store atoms built from SAME tOtO_i (composition-tiled)
- Same Repetition(corr_tile_size=16) for both copies
- pv_done_bar synchronization between MMA and softmax warps
- acc_scale computed per kt iteration, used to rescale O in TMEM
- const_expr(n_kv_tiles > 1) guards for zero overhead at s_k=128
- New test: test_d15_in_kernel_rescale.py (s_k=128/256/384)
- Minimal roundtrip test: test_tmem_roundtrip_minimal.py
---
 MAY_24_2026_PLAN_NEW.md                   | 261 +++++++++++++++++
 MAY_26_2026_PLAN.md                       | 224 ---------------
 ROADMAP.md                                | 324 +++++++++++++---------
 dsv4/kernels/attention/fmha.py            | 107 +++++--
 tests/unit/test_d15_in_kernel_rescale.py  | 152 ++++++++++
 tests/unit/test_tmem_roundtrip_minimal.py | 154 ++++++++++
 6 files changed, 848 insertions(+), 374 deletions(-)
 create mode 100644 MAY_24_2026_PLAN_NEW.md
 delete mode 100644 MAY_26_2026_PLAN.md
 create mode 100644 tests/unit/test_d15_in_kernel_rescale.py
 create mode 100644 tests/unit/test_tmem_roundtrip_minimal.py

diff --git a/MAY_24_2026_PLAN_NEW.md b/MAY_24_2026_PLAN_NEW.md
new file mode 100644
index 00000000..16e0e77b
--- /dev/null
+++ b/MAY_24_2026_PLAN_NEW.md
@@ -0,0 +1,261 @@
+# TMEM Round-Trip Investigation Plan
+
+**Purpose:** Determine whether ROADMAP Priority 8, Path A (CUTLASS atom replication) is achievable in CuTeDSL Python. If yes, this unlocks single-launch multi-tile FMHA. If no, we fall back to Path C (O in SMEM) with documented evidence.
+
+**Premise:** CUTLASS C++ Blackwell FMHA performs lossless TMEM round-trip for O rescale using `SM100_TMEM_LOAD_32dp32b_*x` and `SM100_TMEM_STORE_32dp32b_*x` paired atoms. The hardware supports this. The question is whether CuTeDSL Python exposes the matching atom + layout configuration.
+
+**Time budget:** 5 working days. If NO-OP round-trip can't be made bitwise-correct in that window, escalate to Path C.
+
+---
+
+## What we know vs what we don't
+
+### Known facts
+
+- The hardware supports lossless TMEM round-trip — CUTLASS C++ FMHA does it in production for correction_rescale.
+- Current FMHA code at `dsv4/kernels/attention/fmha.py` lines 407–416 uses `tcgen05.copy.Ld32x32bOp(Repetition(16))` + `tcgen05.copy.St32x32bOp(Repetition(16))` built independently from the same `tOtO_i` tensor via `make_tmem_copy`.
+- That configuration produces ~3% error on NO-OP round-trip (`test_d1_tmem_trip.py`).
+- The MoE kernel never does a TMEM round-trip, so the MoE epilogue pattern is not a template here.
+- The S load uses `Ld32x32bOp(Repetition(32))` and works correctly for the QK→softmax path (no round-trip required — load only).
+
+### Open questions (in priority order)
+
+1. **Does CUTLASS publish a CuTeDSL Python FMHA reference that does in-kernel O rescale?** If yes, the answer is in that file. (Path noted in README: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py`.)
+2. What specific atom variants does CUTLASS C++ use for the round-trip — `32dp32b_1x` / `4x` / `16x` / etc., transposed or non-transposed?
+3. Does the FP32 accumulator layout require a specific Repetition value to round-trip cleanly?
+4. Are the load/store atoms supposed to be constructed from the same partitioning (`make_tmem_copy(load_atom, T)` and `make_tmem_copy(store_atom, T)` with the same T), or from each other via some derivation we haven't found?
+5. Is there an intermediate layout transformation (analogous to `transform_partitioned_tensor_layout`) that's required before round-tripping?
+6. Does CuTeDSL Python expose all the atom variants CUTLASS C++ uses? If not, which are missing?
+
+---
+
+## Phase 1: Read the CUTLASS reference (Day 1, half day)
+
+**Goal:** Find the exact rescale code in CUTLASS and document its atom + partition + copy pattern.
+
+### Steps
+
+- [ ] Locate `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py` on B200. Confirm it exists.
+- [ ] Search for the correction/rescale code path. Likely keywords: `correction`, `rescale`, `acc_scale`, `row_max`, `softmax_correction`.
+- [ ] Extract the rescale block end-to-end. Copy it into `notes/cutlass_rescale_excerpt.py` (gitignored).
+- [ ] Document each variable in that block:
+  - Source tensor for `make_tmem_copy`
+  - Atom name + parameters
+  - Repetition factor
+  - Thread/partition layout
+  - Any layout transforms applied before copy
+  - Fence calls
+- [ ] **Then look at the C++ reference for the same code path.** Local clone at `/home/openclaw/dev/cutlass`. File: likely under `include/cutlass/gemm/collective/sm100_fmha_*` or `examples/77_blackwell_fmha/`. The C++ may be more readable than the Python in places — the Python is a port of the C++ semantics.
+
+### If CuTeDSL Python reference doesn't exist or doesn't do rescale
+
+This is the "early bad news" outcome. Document it and escalate immediately — the entire premise of Path A becomes shakier. Still worth Phase 2, but treat it as exploratory rather than porting.
+
+### Deliverable
+
+`notes/cutlass_rescale_breakdown.md` — a side-by-side: CUTLASS pattern vs current FMHA pattern, with the specific differences highlighted. This file is gitignored (private notes) but its contents drive Phase 2.
+
+---
+
+## Phase 2: Standalone NO-OP round-trip test (Days 1–2)
+
+**Goal:** Build the minimal kernel that does *only* TMEM round-trip on a known FP32 buffer, and get cos = 1.0 (bitwise identity).
+
+This is the gating test. Until NO-OP round-trip passes in isolation, no amount of FMHA-specific rescale logic helps.
+
+### Minimum viable test kernel
+
+Create `tests/unit/test_tmem_roundtrip_minimal.py` with:
+
+```python
+# Pseudocode — exact API verified on B200
+class TmemRoundtripKernel:
+    def __init__(self, atom_load, atom_store, n_rows=128, n_cols=64):
+        self.atom_load = atom_load
+        self.atom_store = atom_store
+        self.n_rows = n_rows
+        self.n_cols = n_cols
+
+    @cute.jit
+    def __call__(self, src, dst, stream):
+        # Launch single CTA with 4 warps (matching FMHA epilogue warp count)
+        self._kernel(src, dst).launch(grid=(1,1,1), block=[128,1,1], stream=stream)
+
+    @cute.kernel
+    def _kernel(self, src, dst):
+        # 1. Allocate TMEM
+        tmem.allocate(power_of_2_round_up(self.n_cols))
+        tmem_ptr = tmem.retrieve_ptr(Float32)
+
+        # 2. Build a TMEM tensor at known offset
+        layout = make_tmem_layout(self.n_rows, self.n_cols)
+        t_tensor = cute.make_tensor(tmem_ptr, layout)
+
+        # 3. Write known values to TMEM via store atom
+        #    (need a way to seed TMEM — could come from MMA, or from a manual SMEM->TMEM copy)
+
+        # 4. Load TMEM -> registers via atom_load
+        load_partition = make_tmem_copy(self.atom_load, t_tensor)
+        thr_load = load_partition.get_slice(tidx)
+        reg_buf = cute.make_rmem_tensor(thr_load.partition_D(...).shape, Float32)
+        cute.copy(load_partition, thr_load.partition_S(t_tensor), reg_buf)
+        cute.arch.fence_view_async_tmem_load()
+
+        # 5. NO-OP — don't modify reg_buf
+
+        # 6. Store registers -> TMEM via atom_store
+        store_partition = make_tmem_copy(self.atom_store, t_tensor)
+        thr_store = store_partition.get_slice(tidx)
+        cute.copy(store_partition, reg_buf, thr_store.partition_D(t_tensor))
+        cute.arch.fence_view_async_tmem_store()
+
+        # 7. Load TMEM -> registers again via same load atom
+        # 8. Write registers -> GMEM dst for comparison
+```
+
+The challenge in step 3 is **seeding TMEM with known values**. Options:
+- **A:** Run a small MMA with known inputs to populate the TMEM accumulator. Most realistic (matches FMHA usage).
+- **B:** Manual SMEM → TMEM copy via `tcgen05.cp` with known SMEM contents. More direct.
+- **C:** Use a separate "init" kernel that populates TMEM, then the round-trip kernel reads/writes it. Awkward but possible.
+
+Recommend **B** for the minimal test — gives full control over the seed values, no MMA semantics in the way.
+
+### Atom variant matrix to try
+
+Start with what CUTLASS uses (extracted in Phase 1) and the current FMHA configuration. Iterate from there.
+
+| Variant | Atom (Load) | Atom (Store) | Repetition | Source Tensor | Notes |
+|---|---|---|---|---|---|
+| V1 (baseline) | `Ld32x32bOp` | `St32x32bOp` | 16 | `tOtO_i` (composition-tiled) | Current FMHA — known to fail |
+| V2 (CUTLASS-style) | from Phase 1 | from Phase 1 | from Phase 1 | from Phase 1 | The target |
+| V3 (Repetition mismatch?) | `Ld32x32bOp` | `St32x32bOp` | 32 / 32 | `tOtO0` (no composition) | Less tiled |
+| V4 (different N) | `Ld16x32bOp` if exists | `St16x32bOp` if exists | varies | as above | Other datapath count |
+| V5 (Nt variant) | `Ld32x32bNtOp` if exists | `St32x32bNtOp` if exists | from Phase 1 | as above | Transposed variant |
+| V6 (no composition) | as V1 | as V1 | 16 | raw `tOtO0` | Test if `composition` is the culprit |
+
+### Test protocol per variant
+
+For each variant:
+
+1. Print the partition shapes at trace time for both load and store:
+   ```python
+   print(f"V{N}: load partition_S shape = {cute.shape(thr_load.partition_S(t_tensor))}")
+   print(f"V{N}: load partition_D shape = {cute.shape(thr_load.partition_D(coord_tensor))}")
+   print(f"V{N}: store partition_S shape = {cute.shape(thr_store.partition_S(coord_tensor))}")
+   print(f"V{N}: store partition_D shape = {cute.shape(thr_store.partition_D(t_tensor))}")
+   ```
+2. Seed TMEM with `arange(n_rows * n_cols)` reshaped to `(n_rows, n_cols)` — predictable, easy to spot specific corruption patterns.
+3. Run NO-OP round-trip.
+4. Read TMEM back, compare bitwise (`torch.equal`, not cosine) against the seeded values.
+5. If not bitwise equal, compute element-by-element diff. Look for patterns:
+   - Are wrong values consistent (suggests a layout mismatch — e.g., rows or columns are permuted)?
+   - Are wrong values random-looking (suggests races or addressing errors)?
+   - Are some columns/rows correct and others wrong (suggests partial coverage)?
+6. Record the result in a table `notes/tmem_roundtrip_matrix.md`.
+
+### Decision point
+
+- **NO-OP passes bitwise on any variant:** ✅ Phase 2 successful. Move to Phase 3 with that variant.
+- **NO-OP fails on all tested variants, but ≥1 produces a recognizable pattern** (e.g., transposed values, rotated columns): there's likely a missing layout transform. Add a layout transform step and try again.
+- **NO-OP fails on all variants with random-looking corruption:** the atom is not actually paired with the variants CuTeDSL Python exposes. Escalate.
+
+---
+
+## Phase 3: Diagnose layout differences (Day 3)
+
+**Only enter Phase 3 if Phase 2 had a pattern but not bitwise success.**
+
+**Goal:** Identify the specific layout transformation needed to make the round-trip lossless.
+
+### Investigation steps
+
+- [ ] Print the TMEM thread-data mapping for both load and store atoms:
+  ```python
+  print(f"load thr_idx={sfw_idx} → cols {thr_load.partition_S(t).layout}")
+  print(f"store thr_idx={sfw_idx} → cols {thr_store.partition_D(t).layout}")
+  ```
+- [ ] If they differ, find the transformation that maps load layout to store layout.
+- [ ] Try inserting a register-level permutation between load and store that compensates for the column mapping difference.
+- [ ] Compare against CUTLASS reference: are they doing such a permutation? Is it implicit in some helper?
+- [ ] Check whether `transform_partitioned_tensor_layout` (used in MoE epilogue for one-way) has a round-trip-friendly counterpart.
+- [ ] Examine `tcgen05.copy.Repetition`'s semantics — does the repetition factor affect column striding in a way that load/store interpret differently?
+
+### Specific things to try
+
+1. **Inverse-permute between load and store.** If load reads `(thread t, col c) → tmem[t, σ(c)]` and store writes `(thread t, col c) → tmem[t, τ(c)]`, with σ ≠ τ, then inserting a register permutation `reg[c'] = reg[σ⁻¹(τ(c'))]` between load and store should fix it.
+2. **Build the store TiledCopy from the load's partition object** rather than from the tensor directly. If CuTeDSL exposes a way to derive a sister TiledCopy from an existing one (`.flip()`, `.invert()`, parameter swap), this is where it would live. (My earlier suggestion of `retile_to_S()` was fabricated — check the actual API.)
+3. **Apply `cute.composition` or `cute.logical_divide` to the source tensor** before constructing the store atom. README note 10 flags that these produce different layouts.
+
+### Deliverable
+
+If Phase 3 yields bitwise NO-OP success: `notes/tmem_roundtrip_solution.md` documenting the exact atom + partition + transform + permutation combination. This becomes the FMHA fix.
+
+---
+
+## Phase 4: Apply to FMHA (Days 4–5)
+
+**Only enter Phase 4 if Phase 2 or 3 produced bitwise NO-OP success.**
+
+### Steps
+
+- [ ] Replace the rescale atoms in `dsv4/kernels/attention/fmha.py` lines 407–416 with the verified Phase 2/3 configuration.
+- [ ] Run hd=64 single-tile regression. Verify cos ≥ 0.999998.
+- [ ] Run hd=128 single-tile regression. Verify cos ≥ 0.999997.
+- [ ] Run hd=256 single-tile regression. Verify cos ≥ 0.999998.
+- [ ] Add a new test: `tests/unit/test_d15_in_kernel_rescale.py` that runs s_k=256, 384, 512, 1024 in a single kernel launch with in-kernel rescale, and compares against Python KV merge oracle.
+- [ ] If new test passes at all s_k values: ✅ Path A complete. Remove `n_kv_tiles > 1` guards or change their semantics to "use in-kernel rescale" instead of "skip rescale entirely."
+- [ ] If new test fails at higher s_k but passes at s_k=256: there's an interaction between rescale and the pipeline state cycling. Diagnose.
+
+### Test sanity
+
+Bitwise NO-OP in a minimal kernel does not automatically mean correctness inside FMHA. The FMHA has additional concurrent activity (TMA warp, MMA warp, softmax warps all touching TMEM regions). Watch for:
+- Fence ordering: every TMEM store must be followed by `fence_view_async_tmem_store()`. Every load must be preceded by `fence_view_async_tmem_load()` if a recent store could affect it.
+- Race conditions between the rescale and the next MMA write to the same TMEM region.
+- Pipeline state: the rescale happens inside the kt loop after softmax_done_bar but before the next QK MMA. Verify the barrier ordering keeps rescale and MMA from overlapping on O.
+
+---
+
+## Phase 5 (optional): Document and contribute upstream (Day 5+)
+
+If Phase 4 succeeds, the solution is potentially useful to other CuTeDSL Python users.
+
+- [ ] Write `notes/tmem_roundtrip_findings.md` with: the broken pattern, the working pattern, why they differ, what CUTLASS does that we ported.
+- [ ] Consider opening an issue / discussion on the CUTLASS repo if the working pattern requires non-obvious atom construction. The CuTeDSL Python documentation may benefit from a worked example.
+- [ ] Cross-reference from README "Lessons learned" section.
+
+---
+
+## Escalation triggers (when to stop and switch paths)
+
+The investigation has hard exit criteria so we don't sink unbounded time into it.
+
+| Trigger | Action |
+|---|---|
+| End of Day 1: CUTLASS CuTeDSL Python FMHA reference doesn't exist or doesn't do rescale | Continue to Phase 2 as exploratory, expect lower success probability. Plan check-in at end of Day 3. |
+| End of Day 2: no variant produces NO-OP success, no clear pattern in failures | Stop. Path A is likely not viable in current CuTeDSL Python. Escalate to Path C (O in SMEM). |
+| End of Day 3: Phase 3 diagnostic shows the required atom or transform doesn't exist in CuTeDSL Python | Stop. Document the gap. File issue with NVIDIA. Escalate to Path C. |
+| End of Day 4: NO-OP passes in isolation but FMHA regression fails and root cause isn't fence ordering | Stop. Path A is partially viable but FMHA-specific interactions are blocking. Document and escalate to Path C while keeping the Phase 2/3 atom config as a future re-attempt seed. |
+| End of Day 5: all FMHA regressions pass, new multi-tile test passes | ✅ Path A complete. Proceed to ROADMAP Priority 8 cleanup (delete Python KV merge code, update tests). |
+
+---
+
+## What to leave behind
+
+Whether Path A succeeds or fails, the investigation should produce:
+
+- [ ] `tests/unit/test_tmem_roundtrip_minimal.py` — the standalone NO-OP test, parametrized over atom variants. Future work can re-run this if CuTeDSL Python adds new atoms.
+- [ ] `notes/tmem_roundtrip_matrix.md` — the variant test results table.
+- [ ] `notes/cutlass_rescale_breakdown.md` — the CUTLASS reference excerpt and analysis.
+- [ ] If solved: updated `dsv4/kernels/attention/fmha.py` + new `tests/unit/test_d15_in_kernel_rescale.py`.
+- [ ] If unsolved: documented evidence in ROADMAP that Path A was investigated and failed, with the specific gap identified, so future revisitations don't redo the work.
+
+---
+
+## What this investigation deliberately does NOT do
+
+- It does not refactor FMHA's PV accumulator (that's Path B/C, separate work).
+- It does not touch the per-head Python launch or multi-CTA grid (Priority 4, depends on Priority 2 not on this).
+- It does not change the un-normalized O + LSE output contract (that's exact and ships today).
+- It does not address the hd=512 MLIR hang (Priority 9, unrelated toolchain issue).
+
+If during the investigation we find ourselves modifying any of these, stop and reconsider scope. The investigation's win condition is narrow: NO-OP TMEM round-trip works, and FMHA's existing rescale code path can use it without other changes.
\ No newline at end of file
diff --git a/MAY_26_2026_PLAN.md b/MAY_26_2026_PLAN.md
deleted file mode 100644
index a03e23a8..00000000
--- a/MAY_26_2026_PLAN.md
+++ /dev/null
@@ -1,224 +0,0 @@
-# Progress check: you're moving fast and in the right direction
-
-Two days ago you had stages A, B, partial C, an unresolved TMEM round-trip blocker, and SWA/sink merge unstarted. Today you have **D5 complete** (single-tile and multi-tile via Python KV merge, cos 0.999996), **SwiGLU clamping landed in the fused kernel** (paper §4.2.3 clamps are in lines 2192-2201), **D3/D4 masks in-kernel**, **D5c sink-bias-as-logit insight that obsoleted D5d**, and a thoughtful workaround for the TMEM corruption that lets you ship correct multi-tile attention right now via per-segment LSE merge. The Python KV merge cos at 0.999998 says the math is airtight.
-
-Three things genuinely impress me:
-
-The **"sink merge is a single softmax over [S_comp, S_swa + attn_sink]"** realization in D5c is the right insight and saves you the entire D5d fused-merge epilogue. That you discovered it through implementation rather than reading the FlashMLA formula and porting it literally is a sign of real understanding.
-
-You **clearly distinguished the two TMEM blockers** — Issue 1 (round-trip corruption from hand-built atoms) vs Issue 2 (rescale for kt>0) — and the "external normalization is exact" framing in the doc shows you understand that emitting unnormalized O + LSE is not a workaround, it's a *better* design for FlashAttention-style attention because it composes (which is exactly what enabled D5c to be one kernel instead of two).
-
-You **scrupulously protected the proven path** — the const_expr guards around `n_kv_tiles > 1` for the rescale code, the const_expr guards around `not self.normalize` for LSE, the hd=64 regression test invariant. The fact that hd=64/128/256 all hold at cos 0.999998 after all this churn is unusual in low-level kernel work.
-
-The few things that did regress or get tangled, you correctly identified: hd=512 MLIR hang, multi-CTA flat_divide layout mismatch. Both are toolchain/refactor problems, not architectural mistakes.
-
----
-
-# About the blocker — you already have the fix in your codebase
-
-I want to point this out before suggesting anything new, because it changes the framing: **you've been doing the correct correction_epilog pattern in `fused_swiglu.py` for months.** Look at lines 2021, 2064-2078:
-
-```python
-tCtAcc_transformed = transform_partitioned_tensor_layout(tCtAcc_base)
-# ...
-tiled_copy_t2r, tTR_tAcc_base_epi, tTR_rAcc = epilogue_tmem_copy_and_partition(
-    self, tidx, tCtAcc_transformed, tCgC_transformed,
-    epi_tile, use_2cta_instrs,
-)
-tTR_rC = cute.make_rmem_tensor(tTR_rAcc.shape, self.c_dtype)
-tiled_copy_r2s, tRS_rC, tRS_sC = epilogue_smem_copy_and_partition(
-    self, tiled_copy_t2r, tTR_rC, tidx, sC
-)
-```
-
-Then in the subtile loop you do exactly what FMHA needs:
-
-```python
-cute.copy(tiled_copy_t2r, tTR_tAcc_mn, tTR_rAcc)   # TMEM → REGS via paired atom
-acc_vec = tiled_copy_r2s.retile(tTR_rAcc).load()    # registers in flight
-# ... SwiGLU + clamp here ...                       # modify in registers
-tRS_rC.store(acc_vec_bf16)
-cute.copy(tiled_copy_r2s, tRS_rC, tRS_sC[(None, None, None, c_buffer)])  # REGS → SMEM
-# ... TMA store SMEM → GMEM ...
-```
-
-**One-way trip. No TMEM round-trip. Library-paired atoms.** Exactly what your STAGE_D.md identifies as the proper fix. You've literally been using this in the MoE epilogue every time the kernel runs SwiGLU. The reason it works in MoE and not in your FMHA isn't a fundamental impossibility — it's that the FMHA epilogue currently uses `epilogue_tma_store` (which reads from TMEM directly with its own internal atoms) instead of the `epilogue_tmem_copy_and_partition` + `epilogue_smem_copy_and_partition` pair.
-
-The README note about "tma_partition can't be called inside if warp_idx blocks" was the blocker that pushed you toward `epilogue_tma_store` originally — but the MoE kernel calls `cpasync.tma_partition` for `bSG_sC, bSG_gC_partitioned` from inside the `if warp_idx < self.mma_warp_id:` block (line 2082). So either that constraint has been worked around, or it was never actually a constraint — possibly an issue with where the call sat in the control flow.
-
-# Sketch of the FMHA correction epilogue
-
-Here's the structure to replace lines 549-597 of `fmha.py`. I'm writing this against your existing kernel's conventions (your variable names, your barrier IDs, your `sfw_idx`):
-
-```python
-# === FMHA EPILOGUE — one-way TMEM → REGS → SMEM → GMEM ===
-# Replaces the current 'epilogue_tma_store' + TMEM round-trip normalize.
-# Pattern proven in dsv4/kernels/gemm/fused_swiglu.py (lines 2021, 2064-2229).
-
-# ---- Setup (run once per kernel, not per kt) ----
-# Move this block ABOVE the kt loop so partitioning happens once.
-# Put it AFTER tmem.allocate but BEFORE the kt loop, alongside the
-# existing tStS/tOtO0 setup.
-
-# Transform the partitioned O tensor (TMEM acc) into the epilogue's expected layout.
-# tOtO0 is the TMEM tensor for O; we need to transform it the way MoE does for tCtAcc.
-tCtO_transformed = utils.gemm.sm100.transform_partitioned_tensor_layout(tOtO0)
-
-# Same for the GMEM tensor — use the existing tCgC (your pv_thr.partition_C(gC)).
-tCgC_transformed = utils.gemm.sm100.transform_partitioned_tensor_layout(tCgC)
-
-# Create the paired atoms — this is the magic. tiled_copy_t2r and tiled_copy_r2s
-# share addressing so the round trip is lossless.
-tiled_copy_t2r, tTR_tO_base, tTR_rO = (
-    utils.gemm.sm100.epilogue_tmem_copy_and_partition(
-        self, sfw_idx, tCtO_transformed, tCgC_transformed,
-        epi_tile, self.use_2cta_instrs,
-    )
-)
-# Register tile for the to-be-written output (BF16).
-tTR_rC = cute.make_rmem_tensor(tTR_rO.shape, self.c_dtype)
-tiled_copy_r2s, tRS_rC, tRS_sC = (
-    utils.gemm.sm100.epilogue_smem_copy_and_partition(
-        self, tiled_copy_t2r, tTR_rC, sfw_idx, sC,
-    )
-)
-
-# TMA partition for SMEM → GMEM store, matching the MoE pattern.
-tCgC_epi = cute.flat_divide(tCgC_transformed, epi_tile)
-bSG_sC, bSG_gC_partitioned = cpasync.tma_partition(
-    tma_c, 0, cute.make_layout(1),
-    cute.group_modes(sC, 0, 2),
-    cute.group_modes(tCgC_epi, 0, 2),
-)
-# For single-CTA grid (your current case) the coordinates are all 0.
-# For multi-CTA later, this is where (m_tile, head, batch) lands.
-bSG_gC = bSG_gC_partitioned[(None, None, None, 0, 0, 0)]
-
-# Epilogue sync barrier — separate from softmax_done_bar to avoid double-wait
-epi_sync_bar = pipeline.NamedBarrier(
-    barrier_id=self.epilog_sync_bar_id,
-    num_threads=32 * len(self.epilogue_warp_id),
-)
-
-# C-store pipeline (same as MoE)
-c_grp = pipeline.CooperativeGroup(
-    pipeline.Agent.Thread, 32 * len(self.epilogue_warp_id),
-)
-c_pipeline = pipeline.PipelineTmaStore.create(
-    num_stages=self.num_c_stage, producer_group=c_grp,
-)
-
-# ---- After the kt loop, after final_o_bar.arrive_and_wait() ----
-
-# Compute 1/row_sum once per row, in registers. No TMEM round-trip.
-# row_sum was tracked across all kt; it's per-row in each thread.
-row_max_safe = row_max
-if row_max == -Float32.inf:
-    row_max_safe = Float32(0.0)
-inv_row_sum = Float32(1.0) / row_sum
-
-# tTR_tO has shape (..., subtile_cnt). Iterate subtiles.
-subtile_cnt = cute.size(tTR_tO_base.shape, mode=[3])
-
-for subtile_idx in cutlass.range(subtile_cnt):
-    # TMEM → REGS using the paired atom (the lossless path).
-    tTR_tO_mn = tTR_tO_base[(None, None, None, subtile_idx)]
-    cute.copy(tiled_copy_t2r, tTR_tO_mn, tTR_rO)
-
-    # Normalize and convert in registers. This is where the fix lives.
-    # When normalize=True: divide by row_sum, cast to BF16.
-    # When normalize=False (D5a): cast raw O to BF16, skip the divide.
-    acc_vec = tiled_copy_r2s.retile(tTR_rO).load()
-    if const_expr(self.normalize):
-        acc_vec = acc_vec * inv_row_sum
-    tRS_rC.store(acc_vec.to(self.c_dtype))
-
-    # REGS → SMEM
-    c_buffer = subtile_idx % self.num_c_stage
-    cute.copy(tiled_copy_r2s, tRS_rC, tRS_sC[(None, None, None, c_buffer)])
-    cute.arch.fence_proxy("async.shared", space="cta")
-    epi_sync_bar.arrive_and_wait()
-
-    # SMEM → GMEM (one warp does the TMA)
-    if warp_idx == self.epilogue_warp_id[0]:
-        cute.copy(
-            tma_c,
-            bSG_sC[(None, c_buffer)],
-            bSG_gC[(None, subtile_idx)],
-        )
-        c_pipeline.producer_commit()
-        c_pipeline.producer_acquire()
-    epi_sync_bar.arrive_and_wait()
-
-c_pipeline.producer_tail()
-
-# LSE still emitted for callers that want it (composes with D5 / external merge)
-if const_expr(not self.normalize):
-    _row_max_safe = row_max
-    if row_max == -Float32.inf:
-        _row_max_safe = Float32(0.0)
-    _ln2 = Float32(0.6931471805599453)
-    lse_val = cute.math.log(row_sum, fastmath=True) + _row_max_safe * _ln2
-    mLSE[sfw_idx, Int32(0), Int32(0)] = lse_val
-    mRowSums[sfw_idx, Int32(0), Int32(0)] = row_sum
-
-tmem.relinquish_alloc_permit()
-tmem.free(tmem_ptr)
-```
-
-This same pattern also gives you **D1.5 issue 2 (per-kt O rescale)** essentially for free. The rescale logic currently inside `if kt > 0:` (lines 524-544) becomes:
-
-```python
-# Per-kt O rescale (TMEM → REGS → multiply by acc_scale → REGS → TMEM)
-if const_expr(self.n_kv_tiles > 1):
-    if kt > 0:
-        for subtile_idx in cutlass.range(subtile_cnt):
-            tTR_tO_mn = tTR_tO_base[(None, None, None, subtile_idx)]
-            cute.copy(tiled_copy_t2r, tTR_tO_mn, tTR_rO)
-            # Modify in registers (this is what the hand-built atoms got wrong)
-            for k in cutlass.range(cute.size(tTR_rO), vectorize=True):
-                tTR_rO[k] = tTR_rO[k] * acc_scale
-            # Same paired atom for the way back — addressing matches.
-            cute.copy(tiled_copy_t2r.retile_to_S(), tTR_rO, tTR_tO_mn)
-        cute.arch.fence_view_async_tmem_store()
-```
-
-The key claim: `tiled_copy_t2r` carries an internal addressing scheme that's consistent for both the load and (when you use its `retile_to_S`-built store) the store. That's the whole reason the MoE epilogue works — paired atoms from the same partitioning, not two independently-constructed atoms whose addressings happen to mismatch.
-
-**Three caveats** before you run this:
-
-The exact CUTLASS helper API for the "store back to TMEM" direction varies between CuTeDSL versions — `tiled_copy_t2r.retile_to_S()` is one form, some versions expose a separate `epilogue_tmem_store_and_partition` helper, and some require you to compose `make_tmem_copy(tcgen05.copy.St32x32bOp(...), tTR_tO_base)` from the same `tTR_tO_base` you built the load from. Print `dir(utils.gemm.sm100)` on B200 to see what's available in your version. The principle is the same: derive the store atom from the same partition object as the load.
-
-The `transform_partitioned_tensor_layout` call may need different handling for `tOtO0` (TMEM iterator with offset) vs the MoE's `tCtAcc_base` (fresh TMEM pointer). If it errors, build `tCtO_transformed` by constructing a fresh tensor at `tmem_ptr + self.tmem_o0_offset` with `tCtO_fake.layout`, the way `tCtO_base` is built at line 566.
-
-The CuTeDSL region-isolation issue from the README — if `tma_partition` inside the `if warp_idx < self.mma_warp_id:` block still throws "weakly congruent," move the `cpasync.tma_partition(tma_c, ...)` call to before the `if warp_idx` branches. It's pure layout construction and doesn't depend on per-warp state. The MoE kernel gets away with it inside the block because of where in the IR-region tree the call sits; if your FMHA hits the issue, the partition can hoist to anywhere after `pipeline_init_wait`.
-
-# What unblocks the moment this lands
-
-**D1.5 issue 1** (TMEM round-trip corruption): gone. The paired-atom load/store doesn't round-trip in the sense that caused the 3% error — the same addressing covers both directions, no transcoding.
-
-**D1.5 issue 2** (per-kt rescale): gone. Same atoms used for rescale produce correct data.
-
-**Multi-tile attention without Python merge**: enabled. You can run s_k=1152 in a single kernel launch instead of 9. That's a real perf win at decode where launch overhead matters.
-
-**NVFP4-1.2** (fuse FP4 quant into the FMHA output → wo_a path): unblocked. The register-level modification slot (`acc_vec * inv_row_sum` above) is also where amax reduction + FP4 packing happens, exactly the way SwiGLU + clamping sits in the MoE epilogue.
-
-**Multi-CTA grid (D2 deferred work)**: this is the harder one. Once you're using `flat_divide(tCgC_transformed, epi_tile)` + `tma_partition` for the C store, the flat_divide vs local_tile mismatch you hit in D2.3-2.6 mostly evaporates — the MoE kernel uses exactly `cute.flat_divide(tCgC_transformed, epi_tile)` to do its multi-CTA tile addressing (line 2081), and the runtime block coords land via `bSG_gC_partitioned[(None, None, None, *mma_tile_coord_mnl)]`. The current per-head Python launch becomes the multi-CTA grid as soon as you can index that final coordinate with `(bidx, bidy, bidz)` from `cute.arch.block_idx()`.
-
-# After the epilogue lands, the next things in order
-
-I'd sequence these:
-
-**1. Verify hd=64/128/256 regress at cos 0.999998** with the new epilogue, both `normalize=True` and `normalize=False`. The atom invariant means this should just work, but verify before moving on. Add a multi-tile s_k=256 test that asserts cos > 0.99 with the new fused rescale to confirm Issue 2 is dead.
-
-**2. Multi-CTA grid (D2.3-2.6 finish)** using the unblocked `flat_divide` + `tma_partition` path. Per-head launch works but is wasteful — at decode T=1, n_h=128, you're paying 128 separate kernel launches. A single grid=(1, 128, batch) launch with proper per-CTA Q indexing collapses this to one launch and lets MQA K/V sharing become possible later via cluster shape.
-
-**3. NVFP4-1.1 (fuse FP4 quant into SwiGLU epilogue)** — independent of FMHA, biggest bandwidth win in the codebase. The MoE epilogue already has all the structure needed: register-level modification slot, paired atoms, SMEM stage, TMA store. You just replace `tRS_rC.store(acc_vec_bf16)` with per-microblock amax + scale + FP4 pack + write to an FP4 SMEM buffer + TMA store the FP4 + parallel TMA store of the FP8 scales. Same epilogue, different output layout.
-
-**4. NVFP4-1.2 (fuse FP4 into FMHA's inverse-RoPE → wo_a path)** — directly enabled by step 1. The acc_vec modification slot in the new FMHA epilogue is exactly where the inverse-RoPE rotation + FP4 quantization belongs. Wo_a then reads FP4.
-
-**5. The hd=512 MLIR hang** — I'd actually push this lower in priority. With multi-CTA grid working and per-head launch as a fallback, you can run hd=512 via the head-packed M-dimension approach (Q reshaped to `(n_h*T, hd, 1)` per your D2 notes), and the multi-tile kernel handles s_k=1152 in one launch. That's a complete decode path. Solving the MLIR backend hang is real work — option 3 (raw CUTLASS C++) is probably the most realistic if NVIDIA can't fix it, and writing one hd=512-specific .cu file is a fraction of the work of fighting MLIR. But it's not the highest-leverage thing right now.
-
-**6. The indexer kernel rewrite** (paper §5.2.1 — FP4 tensor-core scoring) sits in your speculative bucket but I'd promote it now that D5 is done and you have a clear FP4 pipeline pattern from steps 3-4. It's the single largest remaining perf gap in the inference path, and you have the FP4 patterns established to do it cleanly.
-
-The arc of this work has been right. Each step removed a real blocker, kept the proven path intact, and the workarounds you chose (Python KV merge, per-head launch) were principled choices that bought you forward progress without committing to wrong abstractions. The epilogue is the next thing that unlocks the production path — and the template for fixing it has been in your repo all along.
\ No newline at end of file
diff --git a/ROADMAP.md b/ROADMAP.md
index 0a87260e..072a1797 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -2,7 +2,7 @@
 
 Living document. Current state, active blockers, priority order, and what to build next. Architecture and lessons live in README.md — this file is for "what now."
 
-**Last updated:** 2026-05-26
+**Last updated:** 2026-05-26 (revised after correcting D1.5 fix-path analysis)
 
 ---
 
@@ -15,7 +15,7 @@ Living document. Current state, active blockers, priority order, and what to bui
 | FMHA TMEM-P | 64 | 128 | 0.999998 | ✅ |
 | FMHA TMEM-P / SMEM-P | 128 | 128 | 0.999997 | ✅ |
 | FMHA TMEM-P | 256 | 128 | 0.999998 | ✅ |
-| FMHA multi-tile (Python KV merge) | 64 | up to 1024 | 0.999998 | ✅ Workaround |
+| FMHA multi-tile (Python KV merge) | 64 | up to 1024 | 0.999998 | ✅ Workaround — see below |
 | D3 SWA length mask (in-kernel) | 128 | 128 | 0.999996 | ✅ |
 | D4 causal mask on SWA (in-kernel) | 128 | 128 | 0.999996 | ✅ |
 | D5c sink merge single-tile | 64 | 128 | 0.999996 | ✅ |
@@ -29,77 +29,85 @@ Living document. Current state, active blockers, priority order, and what to bui
 
 ### Known blockers
 
-| Blocker | Impact | Workaround | Fix path |
-|---|---|---|---|
-| **D1.5 TMEM round-trip corruption** | Hand-built atoms produce 3% error on NO-OP round-trip; blocks in-kernel multi-tile O rescale and in-kernel normalize | Emit un-normalized O + LSE; Python KV merge for s_k>128 | **Priority 1: correction-epilog rewrite** (sketch below) |
-| hd=512 MLIR backend hang | Cannot compile hd=512 kernel (>3hr optimizer time, structurally correct) | Run hd=512 via head-packed M with hd≤256 chunks; ship without hd=512 if needed for D2 | Pre-compile cubin / raw CUTLASS C++ / report NVIDIA bug |
-| D2 multi-CTA grid (flat_divide + epilogue_tma_store) | Per-head Python launch wastes 128 launches per decode step at Pro | Per-head launch (works, just slow) | Unblocked by correction-epilog rewrite (uses `flat_divide` + `tma_partition` like MoE does) |
+| Blocker | Impact | Status |
+|---|---|---|
+| **Per-kt O rescale in TMEM (D1.5)** | Multi-tile attention requires 5–9 kernel launches per decode step instead of 1 (Python KV merge) | Workaround is correct (cos 0.999998). Whether to fix depends on profiling — see Priority 1. |
+| **TMEM final-normalize round-trip** | Cannot do in-kernel `O /= row_sum` cleanly | **Already worked around** — emit un-normalized O + LSE, external divide is exact. Not a blocker for shipping. |
+| **`epilogue_tma_store` blocks D2 multi-CTA + NVFP4-1.2** | Per-head Python launch wastes 128 launches per Pro decode step; FMHA output forces BF16 GMEM round-trip before wo_a | Unblocked by Priority 2 (the one-way final-epilogue rewrite — which is the **only** part of the MoE pattern that legitimately ports to FMHA). |
+| **hd=512 MLIR backend hang** | Cannot compile single-kernel hd=512 (>3hr optimizer time, structurally correct) | Decode works via head-packed M with hd≤256 chunks. Single-kernel hd=512 only needed for prefill efficiency. Low priority. |
 
 ---
 
-## Priority 1: Correction epilog rewrite (unblocks D1.5 + a chain of follow-ons)
+## What the MoE epilogue pattern actually buys us (and what it doesn't)
 
-**Why this first:** Every downstream item needs the kernel to have a register-level slot in the epilogue for modification. The current `epilogue_tma_store` path with hand-built atoms doesn't have one. The correction-epilog pattern does.
+This deserves stating explicitly because the previous version of this document had it wrong.
 
-**The pattern is already in the codebase** — `dsv4/kernels/gemm/fused_swiglu.py` uses it for the MoE SwiGLU epilogue (lines 2021, 2064–2229). Library helpers, paired atoms, one-way TMEM → registers → SMEM → GMEM. SwiGLU + clamping math sits between the t2r and r2s copies. That's the exact slot FMHA needs.
+**The MoE pattern is one-way only.** `dsv4/kernels/gemm/fused_swiglu.py` uses `epilogue_tmem_copy_and_partition` + `epilogue_smem_copy_and_partition` to construct a TMEM → registers → SMEM → GMEM pipeline. There is **no** corresponding store-back-to-TMEM helper, and no inverse pairing of the t2r atom. The MoE epilogue runs *once*, *after* all MMA K-tiles are accumulated. It never needs to mutate the TMEM accumulator mid-loop.
 
-**What changes:**
+**FMHA's per-kt O rescale is structurally different.** PV uses `tcgen05.mma` with `ACCUMULATE=True` across the kt loop. The accumulator must live in TMEM because that's where MMA reads it and writes it. When row_max changes between kt iterations, the running O accumulator has to be multiplied by `acc_scale` *in TMEM* before the next PV — load to registers, multiply, store back. The store-back is the part that's broken: `Ld32x32bOp` and `St32x32bOp` built as separate atoms have hardware column mappings that don't match, producing ~3% corruption even on NO-OP round-trip. No software layout transformation in CuTeDSL Python has, so far, made them pair correctly.
 
-Replace the FMHA epilogue (`dsv4/kernels/attention/fmha.py` lines 549–597 — the `epilogue_tma_store` call) with:
+**Therefore:** porting the MoE pattern to FMHA **only fixes the one-way paths**:
 
-1. Setup (run once per kernel, outside the kt loop):
-   - `tCtO_transformed = utils.gemm.sm100.transform_partitioned_tensor_layout(tOtO0)`
-   - `tCgC_transformed = utils.gemm.sm100.transform_partitioned_tensor_layout(tCgC)`
-   - `tiled_copy_t2r, tTR_tO_base, tTR_rO = utils.gemm.sm100.epilogue_tmem_copy_and_partition(...)`
-   - `tiled_copy_r2s, tRS_rC, tRS_sC = utils.gemm.sm100.epilogue_smem_copy_and_partition(...)`
-   - TMA partition for C via `flat_divide(tCgC_transformed, epi_tile)` + `cpasync.tma_partition(...)`
+1. The final epilogue (after the last kt, when O is being written to GMEM for good).
+2. Any FP4 amax + pack fusion into that final epilogue.
 
-2. Final epilogue (replaces the round-trip normalize):
-   - Subtile loop, in each subtile: `cute.copy(tiled_copy_t2r, tTR_tO_mn, tTR_rO)` → multiply by `inv_row_sum` in registers if `normalize=True` → cast to BF16 → `cute.copy(tiled_copy_r2s, tRS_rC, tRS_sC[...])` → TMA SMEM → GMEM.
+It does **not** fix the per-kt rescale. That is a separate, harder problem with three possible paths laid out below.
 
-3. Per-kt O rescale (replaces the broken hand-built round-trip on lines 524–544):
-   - Inside the kt loop, when `kt > 0`: same t2r → multiply by `acc_scale` in registers → store back via the paired atom (`tiled_copy_t2r.retile_to_S()` or equivalent — verify exact API on B200).
+---
 
-**What unblocks:**
-- D1.5 issue 1 (round-trip corruption): gone.
-- D1.5 issue 2 (per-kt rescale): gone.
-- In-kernel multi-tile attention (single launch for s_k=1152, not 9).
-- NVFP4-1.2 (fuse FP4 quant into FMHA output → wo_a path): the register slot is where amax + FP4 pack go.
-- D2 multi-CTA grid: `flat_divide` + `tma_partition` path is the same one MoE uses successfully. The flat_divide vs local_tile mismatch resolves.
+## Priority 1: Profile production decode to determine if the rescale fix is needed
 
-**Caveats to print and verify on B200:**
-- Exact CUTLASS helper API for the "store back to TMEM" direction (`retile_to_S` form vs separate helper vs same-base-tensor pattern).
-- Whether `transform_partitioned_tensor_layout` accepts `tOtO0` (TMEM iterator with offset) or needs a fresh tensor built at `tmem_ptr + self.tmem_o0_offset` with `tCtO_fake.layout`.
-- Whether `tma_partition` inside `if warp_idx < self.mma_warp_id` works in this kernel's region tree. The MoE kernel does it; if FMHA hits "weakly congruent," hoist the partition call above the warp branch.
+Before investing days in fixing the per-kt rescale, measure whether the 5–9 launch overhead from Python KV merge is actually a bottleneck.
+
+At Pro decode (s_k=1152, n_kv_tiles=9), Python merge dispatches 9 kernels per step. Conservative launch overhead ~50 μs per kernel ≈ 450 μs/step in launch overhead alone. If a full decode step (all 61 layers, MoE, embedding, sampler) takes ~30 ms, that's ~1.5% of latency. If it takes ~10 ms, it's ~4.5%. Whether that's worth a 1–2 week refactor depends on the actual measurement.
+
+**Action:**
+- [ ] Profile Pro decode at s_k=1152 with current Python KV merge. Measure: total step latency, launch overhead from FMHA dispatches, FMHA compute time per launch.
+- [ ] Measure CPU dispatch overhead on the host (Python loop + kernel cache lookup).
+- [ ] Decision rule: if Python merge overhead is < 5% of total decode latency, defer Priority 8 indefinitely. Ship Python merge as production path.
+
+**Done when:** there's a profiled number that justifies (or doesn't) the engineering investment in Priority 8.
+
+---
+
+## Priority 2: One-way final-epilogue rewrite
+
+**What:** Replace the `utils.gemm.sm100.epilogue_tma_store(...)` call at the end of FMHA (`dsv4/kernels/attention/fmha.py` lines 565–577) with the MoE-style explicit pipeline:
+
+```
+transform_partitioned_tensor_layout → epilogue_tmem_copy_and_partition →
+[register slot — optional normalize/cast/FP4 pack] →
+epilogue_smem_copy_and_partition → flat_divide → cpasync.tma_partition → TMA store
+```
+
+This is **strictly one-way**. The kt-loop rescale code stays exactly as it is (using the broken hand-built atoms — see Priority 8 for the fix path).
+
+**What this enables:**
+
+1. **Optional in-kernel normalize.** Adds an `if const_expr(self.normalize):` block at the register slot to multiply by `inv_row_sum`. Currently external code does the divide on the un-normalized output. Tiny perf win, not the main reason to do this.
+2. **Unblocks NVFP4-1.2** (Priority 6) — gives a register-level modification slot in the FMHA output path where FP4 amax + pack can live, eliminating the BF16 GMEM materialization between FMHA and wo_a.
+3. **Likely unblocks D2 multi-CTA grid** (Priority 4) — the current `epilogue_tma_store` is what couldn't accept the `flat_divide`-based GMEM coordinate system. Switching to the explicit `cpasync.tma_partition(tma_c, ..., cute.flat_divide(tCgC_transformed, epi_tile))` path puts FMHA on the same TMA pattern MoE uses successfully, which should accept multi-CTA block coordinates.
+
+**Caveats to verify on B200 before assuming this works:**
+
+- Whether `tma_partition` survives inside the `if warp_idx < self.mma_warp_id` block. MoE calls it from inside its epilogue warp's `if`, but that has not been tested in FMHA's region tree. Previous attempts at the full pattern triggered 20+ minute MLIR compile times before reaching a verdict.
+- Whether `transform_partitioned_tensor_layout` accepts FMHA's `tOtO0` (TMEM iterator with offset) directly, or whether it needs a fresh tensor built at `tmem_ptr + self.tmem_o0_offset` with `tCtO_fake.layout`.
+- The `epilogue_tmem_copy_and_partition` helper signature on the current CuTeDSL version — print on B200 before coding.
+
+**Failure mode to watch for:** if compile hangs as it did previously, this rewrite is genuinely blocked and the chain of follow-ons (Priorities 4 and 6) need alternative paths.
+
+**Effort:** 1–2 days if the helpers cooperate. Multiple days if the MLIR hang reappears.
 
 **Done when:**
-- hd=64/128/256 regression cos ≥ 0.999998 holds with `normalize=True` and `normalize=False`.
-- New multi-tile s_k=256 test with `kt > 0` rescale gives cos ≥ 0.999998 (not the current 0.997 Python-merge workaround, the real in-kernel rescale).
-- Existing Python KV merge tests continue to pass (`test_d15_multi_kv.py`).
-
----
-
-## Priority 2: Stage E — Production extraction
-
-D5 is complete. The kernel works. Wrap it in a proper interface.
-
-| Step | What | Status |
-|---|---|---|
-| E1 | File placement: `dsv4/kernels/attention/fmha.py` | ✅ Done |
-| E2 | Constructor signature (`head_dim`, `num_query_heads`, `sliding_window`, `top_k`, sink/causal flags, dtypes) | ⚠️ Partial — needs cleanup |
-| E3 | Call signature: `q`, `compressed_kv`, `swa_kv`, `swa_lens`, `sink_logits`, `request_ids`, `o`, `stream` | ⚠️ Needs sink_bias / row_sums integration |
-| E4 | Kernel cache + warmup, keyed on `(head_dim, num_query_heads, top_k, n_comp, apply_sink_bias, is_causal, ...)` | TODO |
-| E5 | `torch.library.custom_op("dsv4::sparse_fmha_with_swa", mutates_args=("o",))` | TODO |
-| E6 | Reference parity test against FP32 oracle in `dsv4/reference/attention.py` | TODO |
-| E7 | Cleanup: delete debug test files, keep only `tests/unit/test_fmha_kernel.py` | TODO |
-
-Notably absent from the call signature: block_table, paged KV, inv_scale, FP8 dequant. All handled upstream by the indexer + gather kernel chain. FMHA sees a dense BF16 `[T, top_k, head_dim]` tile.
+- hd=64/128/256 regression cos ≥ 0.999998 holds with both `normalize=True` and `normalize=False` paths.
+- LSE output still matches reference for `normalize=False` callers.
+- Compile time is reasonable (< 5 min) — if not, document the hang and fall back.
 
 ---
 
 ## Priority 3: NVFP4-1.1 — Fuse FP4 quant into MoE SwiGLU epilogue
 
-**Independent of FMHA.** Biggest bandwidth win in the codebase. Can run in parallel with Priority 1.
+**Independent of FMHA. Can run in parallel with Priority 2.** Biggest bandwidth win in the codebase.
 
 Current:
 ```
@@ -118,59 +126,65 @@ padded_x_fp4 → L1 GEMM → SwiGLU → online amax → FP8 scale + FP4 pack →
 The SwiGLU + clamp result already lives in registers at `tRS_rC.store(acc_vec_bf16)` (line 2207 of `fused_swiglu.py`). That's the slot for amax + FP4 pack.
 
 **Per-microblock amax (16 contiguous elements):**
-1. shfl_xor butterfly reduction across the 4 threads that hold the 16 elements.
+1. `shfl_xor` butterfly reduction across the 4 threads holding the 16 elements.
 2. FP8 E4M3 scale = amax / 6 (FP4 e2m1 max).
-3. Per-element FP4 pack: sign bit << 3 | (clamped val / scale).to(uint3). Two elements → one byte.
+3. Per-element FP4 pack: `sign_bit << 3 | (clamped_val / scale).to(uint3)`. Two elements → one byte.
 4. 16 packed nibbles → 64-bit word → SMEM stage → TMA store.
 5. FP8 scale → separate scale-factor SMEM stage → TMA store to the L2 SFA buffer.
 
-**Subtlety:** NVFP4 microblock = 16 elements. Port the same 16-element logic from `dsv4/ops/quantize.py`. Don't accidentally use the 32-element MXFP4 block.
-
-**Done when:**
-- `padded_activated_fp4` and `padded_activated_x_sf` scratch buffers go away.
-- `quantize_activation_nvfp4` between L1 and L2 disappears.
-- L1 → L2 cosine matches reference (no regression from BF16 intermediate).
-- L2 GEMM reads FP4 scales produced by L1 epilogue.
+**Done when:** `padded_activated_fp4` and `padded_activated_x_sf` scratch buffers go away, `quantize_activation_nvfp4` between L1 and L2 disappears, L1→L2 cosine matches reference.
 
 ---
 
 ## Priority 4: D2 multi-CTA grid
 
-Currently per-head Python launch (works, cos 0.999995, but 128 launches per decode step at Pro).
-
-Multi-CTA grid is unblocked by Priority 1 — the `flat_divide` + `tma_partition` path becomes available once the epilogue uses the MoE pattern.
+**Depends on Priority 2.** Currently per-head Python launch dispatches 128 kernels per Pro decode step. Multi-CTA grid collapses that to 1.
 
 **Grid:** `(num_M_tiles, num_query_heads, batch)` — at decode T=1: `(1, 128, batch)`.
 
-**MQA K/V sharing:** start with independent K/V loads per CTA (each CTA loads its own copy). At decode hd=512, K/V per CTA is ~128 KB; 128 CTAs × 128 KB = 16 MB, well within HBM bandwidth. Cluster-wide sharing via `cluster_shape_mn=(1, num_query_heads, 1)` is a future optimization once profiling shows it matters.
+**Q tensor layout:** Option 1 — `(batch, n_h, T, head_dim)` with head as a TMA mode. Matches CUTLASS reference, allows per-head LSE output, generalizes to GQA later.
 
-**Q tensor layout:** Option 1 — `(batch, n_h, T, head_dim)` with head as a TMA mode (matches CUTLASS reference and allows per-head LSE output). Picked over Option 2 (heads packed into M) because it generalizes better to GQA later.
+**MQA K/V sharing:** start with independent K/V loads per CTA (each loads its own copy). At decode hd=512, K/V per CTA is ~128 KB; 128 CTAs × 128 KB = 16 MB total, comfortably within HBM bandwidth. Cluster-wide sharing via `cluster_shape_mn=(1, num_query_heads, 1)` is a future optimization once profiling shows it matters.
 
-**Done when:**
-- `n_h=128, batch=4, T=1` at hd=512 produces correct output with single launch.
-- Per-head LSE writes to correct `mLSE[batch, head, m_row]` position.
+**Done when:** `n_h=128, batch=4, T=1` at hd=512 produces correct output with single launch, per-head LSE writes to `mLSE[batch, head, m_row]` correctly.
 
 ---
 
-## Priority 5: NVFP4-1.2 — Fuse FP4 quant into FMHA output → wo_a path
+## Priority 5: Stage E — Production extraction
 
-**Depends on Priority 1** (correction epilog gives the register slot).
+D5 is complete. Wrap the kernel in a proper interface.
 
-Currently: FMHA emits BF16 → inverse RoPE produces BF16 → wo_a quantizes to FP4.
+| Step | What | Status |
+|---|---|---|
+| E1 | File placement: `dsv4/kernels/attention/fmha.py` | ✅ Done |
+| E2 | Constructor signature (`head_dim`, `num_query_heads`, `sliding_window`, `top_k`, sink/causal flags, dtypes) | ⚠️ Partial — needs cleanup |
+| E3 | Call signature: `q`, `compressed_kv`, `swa_kv`, `swa_lens`, `sink_logits`, `request_ids`, `o`, `stream` | ⚠️ Needs sink_bias / row_sums integration |
+| E4 | Kernel cache + warmup, keyed on `(head_dim, num_query_heads, top_k, n_comp, apply_sink_bias, is_causal, ...)` | TODO |
+| E5 | `torch.library.custom_op("dsv4::sparse_fmha_with_swa", mutates_args=("o",))` | TODO |
+| E6 | Reference parity test against FP32 oracle in `dsv4/reference/attention.py` | TODO |
+| E7 | Cleanup: delete debug test files, keep only `tests/unit/test_fmha_kernel.py` | TODO |
 
-Target: register slot in FMHA epilogue does the divide-by-row_sum *and* inverse RoPE rotation *and* per-microblock amax + FP4 pack. wo_a reads FP4 directly.
-
-Same pattern as Priority 3. Different home (FMHA epilogue, not MoE epilogue).
+Block table, paged KV, FP8 dequant, inv_scale — all handled upstream by the indexer + gather chain. FMHA sees a dense BF16 `[T, top_k, head_dim]` tile.
 
 ---
 
-## Priority 6: NVFP4-2 — FP4 KV pipeline depth in FMHA
+## Priority 6: NVFP4-1.2 — Fuse FP4 quant into FMHA output → wo_a path
 
-**Depends on Priority 1** being solid at BF16 KV first.
+**Depends on Priority 2** (uses the register slot in the new final epilogue).
 
-FP4 KV shrinks tiles ~4×; same SMEM budget supports more pipeline stages.
+Currently: FMHA emits BF16 → inverse RoPE → BF16 GMEM → wo_a quantizes to FP4.
 
-| KV dtype | Tile size (hd=512) | Stages that fit (192 KB budget) |
+Target: register slot in FMHA's new final epilogue does `O / row_sum` *and* inverse RoPE rotation *and* per-microblock amax *and* FP4 pack. wo_a reads FP4 directly with no GMEM materialization.
+
+Same pattern as Priority 3, different home (FMHA final epilogue, not MoE epilogue).
+
+---
+
+## Priority 7: NVFP4-2 — FP4 KV pipeline depth in FMHA
+
+**Depends on Priority 2 being solid at BF16 KV first.** FP4 KV shrinks tiles ~4×; same SMEM budget supports more pipeline stages.
+
+| KV dtype | Tile size (hd=512) | Stages fitting 192 KB |
 |---|---:|---:|
 | BF16 | 128 KB | 2 |
 | FP8 | 64 KB | 4 |
@@ -179,75 +193,129 @@ FP4 KV shrinks tiles ~4×; same SMEM budget supports more pipeline stages.
 At 1M-context decode where KV reads dominate, deeper pipelines hide more TMA latency.
 
 **Implementation:**
-- TMA loads FP4 NoPE dims (packed `e2m1_x2`) to SMEM slot 0.
+- TMA loads FP4 NoPE dims (`e2m1_x2` packed) to SMEM slot 0.
 - TMA loads BF16 RoPE dims to SMEM slot 1.
 - TMA loads FP8 scale factors to SMEM slot 2.
-- SMEM dequant FP4 → BF16 in vectorized form (`* FP8_scale`, 16-element microblocks).
+- SMEM dequant FP4 → BF16 vectorized (`* FP8_scale`, 16-element microblocks).
 - Concatenate `[NoPE, RoPE]` in SMEM.
 - MMA reads contiguous BF16 from SMEM.
 
-**Test:** FP4+BF16 split input → identical output to pure BF16 input (dequant must be transparent).
+**Test:** FP4+BF16 split input → identical output to pure BF16 input. Dequant must be transparent.
 
 ---
 
-## Priority 7: hd=512 fix
+## Priority 8 (conditional on P1 profile): Per-kt O rescale fix
 
-**Blocked.** Per Priority 4, multi-CTA grid + head-packed M means decode at hd=512 can route through `pv_n_tile=128` and `n_k_sub_tiles=2`, which compiles fine for hd=256. The hd=512 *single-kernel* compile is the missing piece for prefill efficiency, not correctness.
+**Only justified if Priority 1 profiling shows Python KV merge overhead > 5% of decode latency.** Otherwise defer indefinitely — the current correct workaround ships.
+
+There are three possible paths. None is a small change.
+
+### Path A: CUTLASS atom replication
+
+The CUTLASS C++ Blackwell FMHA does a TMEM round-trip for O rescale using `SM100_TMEM_LOAD_32dp32b_4x_atom` and `SM100_TMEM_STORE_32dp32b_4x_atom`, which are paired by hardware design. So in principle TMEM round-trip is possible. The question is whether CuTeDSL Python exposes the specific atom variants and layout configuration CUTLASS uses, and whether they can be paired correctly through `make_tmem_copy`.
+
+**Steps:**
+- [ ] Read `/root/cutlass/.../blackwell/kernel/attention/fmha/fmha.py` (or equivalent C++ reference) and document the exact atom + repetition + tensor layout used for `correction_rescale`.
+- [ ] Enumerate what CuTeDSL Python exposes: `dir(tcgen05.copy)`, available `LdNxNbOp` / `StNxNbOp` variants, what `Repetition(N)` controls.
+- [ ] Identify the difference between current `Ld32x32bOp(Repetition(16))` + `St32x32bOp(Repetition(16))` and whatever CUTLASS uses.
+- [ ] Build a minimal NO-OP round-trip test (load O, store back unchanged) with the candidate atom configuration. Verify cos = 1.0.
+- [ ] If NO-OP passes, retest with `* acc_scale` modification.
+
+**Risk:** CuTeDSL Python may not expose the necessary atom variants, or may not allow the layout configuration CUTLASS uses. In that case, escalate to Path B or C.
+
+**Effort if it works:** 2–4 days investigation + 1–2 days porting.
+
+### Path B: O accumulator in registers, manual PV
+
+Restructure FMHA so the PV accumulator is register-resident, not TMEM-resident. Each kt: read V from SMEM, read P from TMEM/SMEM, compute one-shot PV (no accumulate) writing to a temporary TMEM region, then load to registers and add to register-resident running O (with acc_scale applied to the running O before the add).
+
+**Implications:**
+- Register pressure is severe at hd=512. 512 FP32 per row × 1 row per thread × 128 threads = 128 KB of registers just for O. Possible but tight.
+- PV without TMEM accumulate is a non-standard MMA usage. May need to use a smaller PV tile and accumulate in registers across sub-tiles.
+- Loses the natural MMA-accumulator pipeline overlap.
+
+**Effort:** 1–2 weeks. High risk of regressing hd=64/128/256 paths during the refactor.
+
+### Path C: O accumulator in SMEM
+
+Variant of Path B with O in SMEM instead of registers. PV writes to a temporary TMEM region, gets loaded to registers, applies acc_scale * existing_SMEM_O + new_PV, stores back to SMEM. Final epilogue reads from SMEM.
+
+**Implications:**
+- SMEM budget tightens significantly (need O in SMEM = ~64 KB at hd=512).
+- Adds SMEM read/write pressure on every kt.
+- May require dropping kv_stage to 1 across the board.
+
+**Effort:** ~1 week. Lower risk than Path B but bigger perf impact (more SMEM traffic).
+
+### Recommended order if profiling demands a fix
+
+1. Try Path A first — least invasive, may just work with the right atom config.
+2. If Path A confirmed impossible in CuTeDSL Python, try Path C (SMEM-resident O) before Path B (register-resident O). SMEM gives more headroom at hd=512.
+3. Path B only if both fail and the perf gap is truly critical.
+
+---
+
+## Priority 9: hd=512 single-kernel fix
+
+**Currently blocked.** MLIR optimizer hangs > 3 hours on the hd=512 kernel. Tracer completes in 0.8s — kernel is structurally correct.
+
+Decode works via head-packed M with `pv_n_tile=128` and `n_k_sub_tiles=2`, so this is only a prefill efficiency issue. Lower priority than the chain above.
 
 **Options:**
-1. Pre-compile hd=512 cubin offline (accept 1–2 hour compile if MLIR ever finishes — uncertain).
-2. Add no-softmax mode emitting raw S to GMEM, call twice for k_sub=0/1, accumulate in Python, softmax once. Two launches but no MLIR hang.
+1. Pre-compile cubin offline (accept 1–2 hour compile, cache result).
+2. Add no-softmax mode emitting raw S to GMEM; call twice for k_sub=0/1, accumulate in Python, softmax once externally.
 3. Write hd=512 path in raw CUTLASS C++. Bypasses CuTeDSL MLIR entirely. Most realistic if NVIDIA can't fix the optimizer.
 4. Report CuTeDSL MLIR optimizer bug to NVIDIA.
 
-Lower priority than the chain above — at decode T=1, n_h=128, hd=512 the head-packed approach already works without needing a single hd=512 kernel.
-
 ---
 
-## Priority 8: Indexer FP4 tensor-core scoring (Stage F)
+## Priority 10: Indexer FP4 tensor-core scoring (Stage F)
 
 Paper §5.2.1: *"the QK path in the indexer of CSA, where QK activations are cached, loaded, and multiplied entirely in FP4."*
 
-Current indexer (`dsv4/kernels/cuda/indexer_score_topk.cu`): scalar FP32 dot products, no tensor cores, spinlock-protected shared-memory heap. Single largest perf gap in the codebase. At 1M-context decode the indexer scores ~250K compressed entries per query token — the spinlock heap will not scale to top_k=1024.
+Current indexer (`dsv4/kernels/cuda/indexer_score_topk.cu`): scalar FP32 dot products, no tensor cores, spinlock-protected shared-memory heap. Single largest perf gap in the codebase. At 1M-context decode it scores ~250K compressed entries per query token — the spinlock heap will not scale to top_k=1024.
 
-**Target:** port DeepGEMM `fp8_paged_mqa_logits` to FP4 inputs with `tcgen05.mma.kind=mxf4nvf4`. Plus per-warp partial top-k merged with a final reduction tree (or radix-select). Plus FP32→BF16 score quantization per paper (2× speedup on top-k selector, 99.7% recall).
+**Target:** port DeepGEMM `fp8_paged_mqa_logits` to FP4 inputs with `tcgen05.mma.kind=mxf4nvf4`. Plus per-warp partial top-k merged with a final reduction tree (or radix-select). Plus FP32→BF16 score quantization (paper claims 2× speedup on top-k selector at 99.7% recall).
 
-**Scope:** 2–3 weeks. Track for Stage F. Do not start until the FP4 epilogue patterns from Priorities 3 and 5 are established — they'll inform the indexer's FP4 load + score paths.
+**Scope:** 2–3 weeks. Stage F. Do not start until the FP4 epilogue patterns from Priorities 3 and 6 are established — they inform the indexer's FP4 load + score paths.
 
 ---
 
 ## Build order — recommended sequencing
 
 ```
-Now ─┬─ Priority 1 (correction epilog rewrite)
-     │     │
-     │     └─→ unblocks D1.5, D2 multi-CTA, NVFP4-1.2
-     │
-     ├─ Priority 3 (NVFP4-1.1 fuse FP4 in SwiGLU) ← parallel, independent
-     │
-     ↓
-     Verify hd=64/128/256 regressions hold
-     │
-     ↓
-     Priority 2 (Stage E production extraction)
-     │
-     ↓
-     Priority 4 (D2 multi-CTA grid)
-     │
-     ↓
-     Priority 5 (NVFP4-1.2 fuse FP4 in FMHA output)
-     │
-     ↓
-     Priority 6 (NVFP4-2 FP4 KV pipeline)
-     │
-     ↓
-     Priority 7 (hd=512 fix — only if prefill efficiency demands it)
-     │
-     ↓
-     Priority 8 (indexer FP4 tensor-core scoring) — Stage F
+Priority 1 (PROFILE) ──► gates Priority 8
+                        │
+Priority 2 (one-way    ─┼─► unblocks Priority 4 (multi-CTA)
+final epilogue)         │   unblocks Priority 6 (FP4 fuse in FMHA)
+                        │
+Priority 3 (NVFP4-1.1) ─┴── parallel, independent
+                        
+[verify hd regressions]
+        │
+        ▼
+Priority 4 (D2 multi-CTA grid)
+        │
+        ▼
+Priority 5 (Stage E production extraction)
+        │
+        ▼
+Priority 6 (NVFP4-1.2 FP4 fuse in FMHA output)
+        │
+        ▼
+Priority 7 (NVFP4-2 FP4 KV pipeline)
+        │
+        ▼
+Priority 8 (per-kt rescale fix — ONLY if P1 says it matters)
+        │
+        ▼
+Priority 9 (hd=512 fix — only if prefill efficiency demands)
+        │
+        ▼
+Priority 10 (indexer FP4 tensor-core scoring) — Stage F
 ```
 
-Priority 3 has no dependency on Priorities 1 or 2 and can run on a parallel branch.
+**Key change from the previous version:** Priority 1 is now a profiling task, not an engineering task. Priority 2 is scoped honestly — it's the one-way path only, and it does *not* fix the per-kt rescale. The per-kt rescale is Priority 8, conditional, and three paths deep because there is no easy fix.
 
 ---
 
@@ -255,12 +323,12 @@ Priority 3 has no dependency on Priorities 1 or 2 and can run on a parallel bran
 
 Listed for completeness. **Do not implement without explicit sign-off.**
 
-1. **NVFP4 compressed KV NOPE dims** (paper validated FP8 for compressed KV; FP4 would halve cache again). Risk: compounds quantization noise on already-lossy compressed KV.
-2. **MXFP4 vs NVFP4 for indexer scoring** — not validated for indexer specifically.
-3. **NVFP4 for full attention Q×K^T GEMM** — closed. Cos 0.86 vs FP32 in earlier tests. Attention stays BF16/FP32.
-4. **Per-token FP8 activation scaling in FMHA** — not validated. Out of scope.
-5. **2:4 structured sparsity on FP4 expert weights** — V4 not trained with structured sparsity. Off the table for the released checkpoint.
-6. **NVFP4 LM head + MTP head** — big VRAM win (~1.4 GB saved on Pro). Modest quality risk on rare-token logits. Test against held-out eval before shipping.
+1. **NVFP4 compressed KV NoPE dims.** Paper validated FP8; FP4 would halve cache again. Risk: compounds quantization noise on already-lossy compressed KV.
+2. **MXFP4 vs NVFP4 for indexer scoring.** Not validated for indexer specifically.
+3. **NVFP4 for full attention Q×K^T GEMM.** Closed. Cos 0.86 vs FP32 in earlier tests. Attention stays BF16/FP32.
+4. **Per-token FP8 activation scaling in FMHA.** Not validated. Out of scope.
+5. **2:4 structured sparsity on FP4 expert weights.** V4 not trained with structured sparsity. Off the table for the released checkpoint.
+6. **NVFP4 LM head + MTP head.** Big VRAM win (~1.4 GB saved on Pro). Modest quality risk on rare-token logits. Test against held-out eval before shipping.
 
 ---
 
@@ -272,4 +340,4 @@ Listed for completeness. **Do not implement without explicit sign-off.**
 | Pro decode | 128 | 1024 | 1152 | 9 | YES |
 | Current single-tile test | 1 | — | 128 | 1 | NO |
 
-Production decode needs the multi-tile path (Priority 1) working in-kernel. Today's Python KV merge ships correct results at the cost of 5–9 launches per step.
\ No newline at end of file
+Production decode needs the multi-tile path. Today's Python KV merge ships correct results at the cost of 5–9 launches per step. Whether that cost matters is what Priority 1 measures.
\ No newline at end of file
diff --git a/dsv4/kernels/attention/fmha.py b/dsv4/kernels/attention/fmha.py
index d82b030a..01b55d17 100644
--- a/dsv4/kernels/attention/fmha.py
+++ b/dsv4/kernels/attention/fmha.py
@@ -2,8 +2,9 @@
 
 Migrated from tests/unit/test_fmha_v3_stage_c.py — Stage C proven path.
 P stored to TMEM via register bridge, PV reads from TMEM.
-O rescale + normalization via correction epilogue (one-way TMEM→REGS→SMEM→GMEM)
-using paired atoms from epilogue_tmem_copy_and_partition / epilogue_smem_copy_and_partition.
+O rescale via CUTLASS correction_rescale pattern (TMEM→REGS→scale→TMEM)
+using Ld32x32bOp/St32x32bOp with composition-tiled tOtO_i.
+Normalization via epilogue_tma_store (one-way TMEM→GMEM).
 """
 import torch, cutlass, cutlass.cute as cute, cutlass.utils as utils, cutlass.pipeline as pipeline
 from cutlass.cute.nvgpu import cpasync, tcgen05
@@ -11,9 +12,9 @@ from cutlass import Float32, BFloat16, Int32, Boolean, const_expr
 from cutlass.utils import LayoutEnum
 from cutlass.utils.tmem_allocator import find_tmem_tensor_col_offset
 from cutlass.utils.blackwell_helpers import get_smem_store_op
-# TMEM round-trip is fundamentally broken (Ld32x32bOp/St32x32bOp column mapping mismatch).
-# The one-way correction epilogue pattern (from cutlass.utils.gemm.sm100) requires
-# restructuring PV to not use TMEM accumulator. See D1.5 notes in STAGE_D.md.
+# D1.5 FIX: TMEM round-trip works with the CUTLASS correction_rescale pattern:
+# Both load and store atoms built from the SAME composition-tiled tOtO_i tensor,
+# same Repetition(corr_tile_size). Verified in CUTLASS reference fmha.py line 2123.
 import cuda.bindings.driver as cuda
 import cutlass.torch as ct
 import math
@@ -186,6 +187,9 @@ class FmhaKernel:
         s_prod,s_cons = pipeline.PipelineUmmaAsync.create(barrier_storage=st.s_bar.data_ptr(),num_stages=1,producer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread),consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread,32*len(self.epilogue_warp_id))).make_participants()
         softmax_done_bar = pipeline.NamedBarrier(barrier_id=3, num_threads=32 + 32*len(self.epilogue_warp_id))
         final_o_bar = pipeline.NamedBarrier(barrier_id=4, num_threads=32 + 32*len(self.epilogue_warp_id))
+        # D1.5: barrier for PV completion signal (MMA→softmax warps)
+        # MMA warp arrives after PV[kt] completes; softmax warps wait before O rescale.
+        pv_done_bar = pipeline.NamedBarrier(barrier_id=5, num_threads=32 + 32*len(self.epilogue_warp_id))
         acc_pipe = pipeline.PipelineUmmaAsync.create(barrier_storage=st.acc_bar.data_ptr(),num_stages=self.num_acc_stage,producer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread),consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread,len(self.epilogue_warp_id)),cta_layout_vmnk=cl_vmnk,defer_sync=True)
         tmem_bar = pipeline.NamedBarrier(barrier_id=2,num_threads=32*len((self.mma_warp_id,*self.epilogue_warp_id)))
         tmem = utils.TmemAllocator(st.holding.ptr,barrier_for_retrieve=tmem_bar,allocator_warp_id=self.epilogue_warp_id[0],is_two_cta=cute.size(qk_mma.thr_id.shape)==2,two_cta_tmem_dealloc_mbar_ptr=st.tmem_dealloc.ptr)
@@ -313,6 +317,8 @@ class FmhaKernel:
                         pv_mma.set(tcgen05.Field.ACCUMULATE, True)
                 cute.arch.fence_view_async_tmem_store()
                 kvh_v.release()
+                if const_expr(self.n_kv_tiles > 1):
+                    pv_done_bar.arrive()  # Signal softmax warps: PV done, O is ready for rescale
                 final_o_bar.arrive()
             else:
                 # Original pipeline path (hd≤256)
@@ -341,6 +347,8 @@ class FmhaKernel:
                             pv_mma.set(tcgen05.Field.ACCUMULATE, True)
                     cute.arch.fence_view_async_tmem_store()
                     kvh.release()
+                    if const_expr(self.n_kv_tiles > 1):
+                        pv_done_bar.arrive()  # Signal softmax warps: PV done, O ready for rescale
                 acc_pipe.producer_commit(acc_st); acc_st.advance()
                 final_o_bar.arrive()
                 acc_pipe.producer_tail(acc_st)
@@ -397,22 +405,49 @@ class FmhaKernel:
             scale_log2 = Float32(self.scale_softmax_log2)
 
             # ============================================================
-            # D1.5: MULTI-KV-TILE O RESCALE — NOT SUPPORTED IN-KERNEL
+            # D1.5: O RESCALE ATOMS (CUTLASS correction_rescale pattern)
             # ============================================================
-            # TMEM round-trip (load O, modify, store back) is FUNDAMENTALLY
-            # broken: Ld32x32bOp and St32x32bOp have different column mappings
-            # at the hardware level. The MoE correction epilogue avoids this
-            # by doing a ONE-WAY trip (TMEM->REGS->SMEM->GMEM), but FMHA needs
-            # to keep O in TMEM for PV accumulation between kt iterations.
-            #
-            # Production path for multi-KV-tile: Python KV merge.
-            # Run kernel per 128-token segment (s_k=128), merge externally:
-            #   O = sum_i [exp(lse_i) * O_i_norm] / sum_i [exp(lse_i)]
-            # Verified cos 0.999998 for s_k up to 1024.
-            #
-            # Future: restructure PV to accumulate into REGS/SMEM instead
-            # of TMEM, enabling the one-way correction epilogue pattern.
+            # Pattern: both load and store atoms built from the SAME tOtO_i
+            # (composition-tiled from tOtO0), same Repetition(corr_tile_size).
+            # This is the exact pattern from CUTLASS reference fmha.py line 2123.
+            # The key insight: using composition() to re-tile tOtO into (128, corr_tile_size)
+            # sub-tiles, and building BOTH copies from the SAME tensor, ensures the
+            # column mappings agree on round-trip.
             # ============================================================
+            corr_tile_size = 16  # Must be power of 2, divides head_dim
+            tOtO_i_layout = cute.composition(
+                tOtO0.layout, cute.make_layout((128, corr_tile_size))
+            )
+            tOtO_i = cute.make_tensor(tOtO0.iterator, tOtO_i_layout)
+
+            # Coordinate tensor for O (needed for partition_D of load)
+            cO = cute.make_identity_tensor((128, self.head_dim))
+            tOcO = pv_thr.partition_C(cO)
+            tOcO_i_layout = cute.composition(
+                tOcO.layout, cute.make_layout((128, corr_tile_size))
+            )
+            tOcO_i = cute.make_tensor(tOcO.iterator, tOcO_i_layout)
+
+            tmem_load_o_atom = cute.make_copy_atom(
+                tcgen05.copy.Ld32x32bOp(tcgen05.copy.Repetition(corr_tile_size)),
+                self.qk_acc_dtype,
+            )
+            tiled_tmem_load_o = tcgen05.make_tmem_copy(tmem_load_o_atom, tOtO_i)
+            thr_tmem_load_o = tiled_tmem_load_o.get_slice(sfw_idx)
+            tTMEM_LOADtO = thr_tmem_load_o.partition_S(tOtO_i)
+            tTMEM_LOADcO = thr_tmem_load_o.partition_D(tOcO_i)
+
+            tmem_store_o_atom = cute.make_copy_atom(
+                tcgen05.copy.St32x32bOp(tcgen05.copy.Repetition(corr_tile_size)),
+                self.qk_acc_dtype,
+            )
+            tiled_tmem_store_o = tcgen05.make_tmem_copy(tmem_store_o_atom, tOtO_i)
+            thr_tmem_store_o = tiled_tmem_store_o.get_slice(sfw_idx)
+            tTMEM_STOREtO = thr_tmem_store_o.partition_D(tOtO_i)
+
+            # prev_acc_scale: unused, kept for clarity. acc_scale at kt is used
+            # to rescale O from kt=0..kt-1 before PV[kt].
+            prev_acc_scale = Float32(0.0)
 
             for kt in range(self.n_kv_tiles):
                 si_handle = s_cons.wait_and_advance()
@@ -512,9 +547,37 @@ class FmhaKernel:
                             k2 = k_coord // 64
                             _sP_nostage[(m_coord, k0), 0, (k1, k2)] = rP_bf16[(j0, 0), j1, 0, 0]
                     cute.arch.fence_proxy("async.shared", space="cta")
-                # D1.5: O rescale for kt > 0 is NOT supported in-kernel.
-                # Multi-KV-tile attention uses Python KV merge instead.
-                # n_kv_tiles=1 is the only tested/supported path.
+                # D1.5: O rescale for kt > 0 — CUTLASS correction_rescale pattern.
+                # After computing acc_scale for this iteration, rescale the existing O
+                # in TMEM before the next PV GEMM adds to it.
+                # Must wait for PV[kt-1] to complete (MMA signals pv_done_bar).
+                if const_expr(self.n_kv_tiles > 1):
+                    if kt > 0:
+                        pv_done_bar.arrive_and_wait()  # Wait for PV[kt-1]
+                        # Rescale O: load, multiply by acc_scale, store back to TMEM.
+                        # CUTLASS pattern: both copies use same tOtO_i (composition-tiled).
+                        n_slices = self.head_dim // corr_tile_size
+                        tTMrO = cute.make_rmem_tensor(
+                            (tTMEM_LOADcO.shape, n_slices), self.qk_acc_dtype
+                        )
+                        for i in range(n_slices):
+                            tTMrO_i_ = tTMrO[None, i]
+                            tTMrO_i_layout = cute.composition(
+                                tTMrO_i_.layout, cute.make_layout(tTMrO.shape[0])
+                            )
+                            tTMrO_i = cute.make_tensor(tTMrO_i_.iterator, tTMrO_i_layout)
+                            tTMEM_LOADtO_i = cute.make_tensor(
+                                tTMEM_LOADtO.iterator + i * corr_tile_size, tTMEM_LOADtO.layout
+                            )
+                            tTMEM_STOREtO_i = cute.make_tensor(
+                                tTMEM_STOREtO.iterator + i * corr_tile_size, tTMEM_STOREtO.layout
+                            )
+                            cute.copy(tiled_tmem_load_o, tTMEM_LOADtO_i, tTMrO_i)
+                            cute.arch.fence_view_async_tmem_load()
+                            for k in cutlass.range(cute.size(tTMrO_i), vectorize=True):
+                                tTMrO_i[k] = tTMrO_i[k] * acc_scale
+                            cute.copy(tiled_tmem_store_o, tTMrO_i, tTMEM_STOREtO_i)
+                        cute.arch.fence_view_async_tmem_store()
 
                 si_handle.release()
                 softmax_done_bar.arrive()
diff --git a/tests/unit/test_d15_in_kernel_rescale.py b/tests/unit/test_d15_in_kernel_rescale.py
new file mode 100644
index 00000000..13b7c4c3
--- /dev/null
+++ b/tests/unit/test_d15_in_kernel_rescale.py
@@ -0,0 +1,152 @@
+"""
+D1.5 Phase 4: Test in-kernel O rescale for multi-KV-tile FMHA.
+
+Tests the CUTLASS correction_rescale pattern:
+  - Both load and store atoms built from the SAME tOtO_i (composition-tiled)
+  - Same Repetition(corr_tile_size=16) for both
+  - Rescale O in TMEM between PV iterations
+
+Compares against:
+  1. FP32 reference (ground truth)
+  2. Python KV merge (proven correct, cos 0.999998)
+  3. s_k=128 baseline (no rescale, regression check)
+"""
+import torch, math
+import cutlass.cute as cute
+import cutlass.torch as ct
+import cuda.bindings.driver as cuda
+from dsv4.kernels.attention.fmha import FmhaKernel
+
+
+def reference_attention(q, k, v, scale):
+    """FP32 reference: returns un-normalized O."""
+    qf = q.float()
+    kf = k.float()
+    attn = qf @ kf.T * scale
+    attn_max = attn.max(dim=-1, keepdim=True)[0]
+    attn_exp = torch.exp(attn - attn_max)
+    ref_unnorm = attn_exp @ v.float()
+    return ref_unnorm
+
+
+def run_fmha(q, k, v, head_dim, s_k, pv_n_tile, use_smem_p, stream, lse_tensor, row_sums_tensor):
+    """Run FMHA kernel and return output tensor."""
+    m = q.shape[0]
+    v_tile = v[:, 0:pv_n_tile].contiguous()
+    v_kernel = v_tile.unsqueeze(-1)
+    c_tile = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+
+    mQ = ct.from_dlpack(q).mark_layout_dynamic(leading_dim=ct.get_leading_dim(q))
+    mK = ct.from_dlpack(k).mark_layout_dynamic(leading_dim=ct.get_leading_dim(k))
+    mV = ct.from_dlpack(v_kernel).mark_layout_dynamic(leading_dim=ct.get_leading_dim(v_kernel))
+    mC = ct.from_dlpack(c_tile).mark_layout_dynamic(leading_dim=ct.get_leading_dim(c_tile))
+    mLSE = ct.from_dlpack(lse_tensor).mark_layout_dynamic(leading_dim=ct.get_leading_dim(lse_tensor))
+    mRS = ct.from_dlpack(row_sums_tensor).mark_layout_dynamic(leading_dim=ct.get_leading_dim(row_sums_tensor))
+
+    kernel = FmhaKernel(head_dim=head_dim, s_k=s_k, use_smem_p=use_smem_p, normalize=False)
+    compiled = cute.compile(kernel, mQ, mK, mV, mC, stream, mLSE, row_sums=mRS)
+    compiled(mQ, mK, mV, mC, stream, mLSE, row_sums=mRS)
+    return c_tile, lse_tensor, row_sums_tensor, kernel
+
+
+def test():
+    hd = 64
+    m = 128
+    scale = 1.0 / math.sqrt(hd)
+    torch.manual_seed(42)
+    stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)
+
+    q = torch.randn(m, hd, 1, dtype=torch.bfloat16, device='cuda')
+
+    # ===== Test 1: s_k=128 baseline (no rescale) =====
+    s_k1 = 128
+    k1 = torch.randn(s_k1, hd, 1, dtype=torch.bfloat16, device='cuda')
+    v1 = torch.randn(s_k1, hd, dtype=torch.bfloat16, device='cuda')
+    lse1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+
+    # Need a dummy run to get pv_n_tile
+    kernel0 = FmhaKernel(head_dim=hd, s_k=s_k1, use_smem_p=False, normalize=False)
+    pv_n_tile = kernel0.pv_n_tile
+
+    c1, lse1, rs1, _ = run_fmha(q, k1, v1, hd, s_k1, pv_n_tile, False, stream, lse1, rs1)
+    torch.cuda.synchronize()
+
+    ref1 = reference_attention(q[:, :, 0], k1[:, :, 0], v1, scale)
+    cos1 = torch.nn.functional.cosine_similarity(
+        c1[:, :, 0].float().flatten().unsqueeze(0), ref1.flatten().unsqueeze(0)
+    ).item()
+    status1 = "PASS" if cos1 >= 0.999 else "FAIL"
+    print(f'Test 1: s_k=128 baseline: cos={cos1:.6f} {status1}', flush=True)
+
+    # ===== Test 2: s_k=256 with in-kernel rescale (CUTLASS correction_rescale) =====
+    s_k2 = 256
+    k2 = torch.randn(s_k2, hd, 1, dtype=torch.bfloat16, device='cuda')
+    v2 = torch.randn(s_k2, hd, dtype=torch.bfloat16, device='cuda')
+    lse2 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs2 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+
+    c2, lse2, rs2, _ = run_fmha(q, k2, v2, hd, s_k2, pv_n_tile, False, stream, lse2, rs2)
+    torch.cuda.synchronize()
+
+    ref2 = reference_attention(q[:, :, 0], k2[:, :, 0], v2, scale)
+    cos2 = torch.nn.functional.cosine_similarity(
+        c2[:, :, 0].float().flatten().unsqueeze(0), ref2.flatten().unsqueeze(0)
+    ).item()
+    status2 = "PASS" if cos2 >= 0.999 else "FAIL"
+    print(f'Test 2: s_k=256 in-kernel rescale: cos={cos2:.6f} {status2}', flush=True)
+
+    # ===== Test 3: Python KV merge (oracle) =====
+    c_s0 = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+    lse_s0 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs_s0 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    c_s0, lse_s0, rs_s0, _ = run_fmha(q, k2[:128], v2[:128], hd, 128, pv_n_tile, False, stream, lse_s0, rs_s0)
+
+    c_s1 = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+    lse_s1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs_s1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    c_s1, lse_s1, rs_s1, _ = run_fmha(q, k2[128:], v2[128:], hd, 128, pv_n_tile, False, stream, lse_s1, rs_s1)
+    torch.cuda.synchronize()
+
+    # D5 merge: O = sum(exp(lse_i) * O_i_norm) / sum(exp(lse_i))
+    o0 = c_s0[:, :, 0].float()
+    o1 = c_s1[:, :, 0].float()
+    r0 = rs_s0[:, 0, 0].float()
+    r1 = rs_s1[:, 0, 0].float()
+    l0 = lse_s0[:, 0, 0].float()
+    l1 = lse_s1[:, 0, 0].float()
+    o0_norm = o0 / r0.unsqueeze(1).clamp(min=1e-30)
+    o1_norm = o1 / r1.unsqueeze(1).clamp(min=1e-30)
+    w0 = torch.exp(l0).unsqueeze(1)
+    w1 = torch.exp(l1).unsqueeze(1)
+    oracle = (w0 * o0_norm + w1 * o1_norm) / (w0 + w1)
+
+    cos_oracle = torch.nn.functional.cosine_similarity(
+        oracle.flatten().unsqueeze(0), ref2.flatten().unsqueeze(0)
+    ).item()
+    print(f'Oracle: Python KV merge: cos={cos_oracle:.6f}', flush=True)
+
+    # ===== Test 4: s_k=384 (3 KV tiles) =====
+    s_k3 = 384
+    k3 = torch.randn(s_k3, hd, 1, dtype=torch.bfloat16, device='cuda')
+    v3 = torch.randn(s_k3, hd, dtype=torch.bfloat16, device='cuda')
+    lse3 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs3 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+
+    c3, lse3, rs3, _ = run_fmha(q, k3, v3, hd, s_k3, pv_n_tile, False, stream, lse3, rs3)
+    torch.cuda.synchronize()
+
+    ref3 = reference_attention(q[:, :, 0], k3[:, :, 0], v3, scale)
+    cos3 = torch.nn.functional.cosine_similarity(
+        c3[:, :, 0].float().flatten().unsqueeze(0), ref3.flatten().unsqueeze(0)
+    ).item()
+    status3 = "PASS" if cos3 >= 0.999 else "FAIL"
+    print(f'Test 4: s_k=384 in-kernel rescale: cos={cos3:.6f} {status3}', flush=True)
+
+    # ===== Summary =====
+    all_pass = cos1 >= 0.999 and cos2 >= 0.999 and cos3 >= 0.999
+    print(f'\nSummary: {"ALL PASS ✅" if all_pass else "SOME FAIL ❌"}', flush=True)
+
+
+if __name__ == '__main__':
+    test()
diff --git a/tests/unit/test_tmem_roundtrip_minimal.py b/tests/unit/test_tmem_roundtrip_minimal.py
new file mode 100644
index 00000000..b0c2a47f
--- /dev/null
+++ b/tests/unit/test_tmem_roundtrip_minimal.py
@@ -0,0 +1,154 @@
+"""
+D1.5 Phase 2: NO-OP TMEM round-trip test inside FMHA context.
+
+Strategy: Run FMHA with s_k=128 (single KV tile, no rescale).
+Then add a correction_rescale with scale=1.0 (NO-OP) after PV.
+If output is bitwise identical to without rescale → round-trip works.
+If output differs → round-trip corrupts data.
+
+This tests the EXACT CUTLASS correction_rescale pattern:
+  - Both load and store atoms use Repetition(corr_tile_size)
+  - Both copies built from the SAME tOtO_i tensor (composition-tiled)
+  - Register buffer sized from load partition_D
+
+Variants:
+  V1: Repetition(16) + composition — CUTLASS exact pattern
+  V2: Repetition(32) + composition
+"""
+import torch, math
+import cutlass.cute as cute
+import cutlass.torch as ct
+import cuda.bindings.driver as cuda
+from dsv4.kernels.attention.fmha import FmhaKernel
+
+
+def reference_attention(q, k, v, scale):
+    """FP32 reference: returns un-normalized O."""
+    qf = q.float()
+    kf = k.float()
+    attn = qf @ kf.T * scale
+    attn_max = attn.max(dim=-1, keepdim=True)[0]
+    attn_exp = torch.exp(attn - attn_max)
+    ref_unnorm = attn_exp @ v.float()
+    return ref_unnorm
+
+
+def test():
+    hd = 64
+    s_k = 128
+    m = 128
+    scale = 1.0 / math.sqrt(hd)
+    torch.manual_seed(42)
+
+    q = torch.randn(m, hd, 1, dtype=torch.bfloat16, device='cuda')
+    k = torch.randn(s_k, hd, 1, dtype=torch.bfloat16, device='cuda')
+    v = torch.randn(s_k, hd, dtype=torch.bfloat16, device='cuda')
+    c = torch.zeros(m, hd, 1, dtype=torch.bfloat16, device='cuda')
+
+    # FP32 reference
+    ref_unnorm = reference_attention(q[:, :, 0], k[:, :, 0], v, scale)
+
+    lse_tensor = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    row_sums_tensor = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)
+
+    # Test 1: Baseline s_k=128, no rescale, TMEM-P
+    kernel1 = FmhaKernel(head_dim=hd, s_k=s_k, use_smem_p=False, normalize=False)
+    pv_n_tile = kernel1.pv_n_tile
+    v_tile = v[:, 0:pv_n_tile].contiguous()
+    v_kernel = v_tile.unsqueeze(-1)
+    c_tile1 = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+    lse1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+
+    mQ = ct.from_dlpack(q).mark_layout_dynamic(leading_dim=ct.get_leading_dim(q))
+    mK = ct.from_dlpack(k).mark_layout_dynamic(leading_dim=ct.get_leading_dim(k))
+    mV = ct.from_dlpack(v_kernel).mark_layout_dynamic(leading_dim=ct.get_leading_dim(v_kernel))
+    mC1 = ct.from_dlpack(c_tile1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(c_tile1))
+    mLSE1 = ct.from_dlpack(lse1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(lse1))
+    mRS1 = ct.from_dlpack(rs1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(rs1))
+
+    print(f'Test 1: s_k=128 baseline (no rescale)', flush=True)
+    compiled1 = cute.compile(kernel1, mQ, mK, mV, mC1, stream, mLSE1, row_sums=mRS1)
+    compiled1(mQ, mK, mV, mC1, stream, mLSE1, row_sums=mRS1)
+    torch.cuda.synchronize()
+
+    out1 = c_tile1[:, :, 0].float()
+    cos1 = torch.nn.functional.cosine_similarity(
+        out1.flatten().unsqueeze(0), ref_unnorm.flatten().unsqueeze(0)
+    ).item()
+    print(f'  cos_unnorm={cos1:.6f}  {"PASS" if cos1 >= 0.999 else "FAIL"}')
+
+    # Test 2: s_k=256 with the CUTLASS correction_rescale pattern
+    # This is the REAL test — does multi-KV-tile O rescale work?
+    s_k2 = 256
+    k2 = torch.randn(s_k2, hd, 1, dtype=torch.bfloat16, device='cuda')
+    v2 = torch.randn(s_k2, hd, dtype=torch.bfloat16, device='cuda')
+    c_tile2 = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+    lse2 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs2 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+
+    ref_unnorm2 = reference_attention(q[:, :, 0], k2[:, :, 0], v2, scale)
+
+    # Use the EXISTING Python KV merge as oracle
+    # Run per-segment and merge
+    kernel_s128 = FmhaKernel(head_dim=hd, s_k=128, use_smem_p=False, normalize=False)
+
+    # Segment 0
+    c_seg0 = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+    lse_seg0 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs_seg0 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    mK0 = ct.from_dlpack(k2[:128]).mark_layout_dynamic(leading_dim=ct.get_leading_dim(k2[:128]))
+    v2_t0 = v2[:128, 0:pv_n_tile].contiguous().unsqueeze(-1)
+    mV0 = ct.from_dlpack(v2_t0).mark_layout_dynamic(leading_dim=ct.get_leading_dim(v2_t0))
+    mC_s0 = ct.from_dlpack(c_seg0).mark_layout_dynamic(leading_dim=ct.get_leading_dim(c_seg0))
+    mLSE_s0 = ct.from_dlpack(lse_seg0).mark_layout_dynamic(leading_dim=ct.get_leading_dim(lse_seg0))
+    mRS_s0 = ct.from_dlpack(rs_seg0).mark_layout_dynamic(leading_dim=ct.get_leading_dim(rs_seg0))
+    compiled_s0 = cute.compile(kernel_s128, mQ, mK0, mV0, mC_s0, stream, mLSE_s0, row_sums=mRS_s0)
+    compiled_s0(mQ, mK0, mV0, mC_s0, stream, mLSE_s0, row_sums=mRS_s0)
+
+    # Segment 1
+    c_seg1 = torch.zeros(m, pv_n_tile, 1, dtype=torch.bfloat16, device='cuda')
+    lse_seg1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    rs_seg1 = torch.zeros(m, 1, 1, dtype=torch.float32, device='cuda')
+    mK1 = ct.from_dlpack(k2[128:]).mark_layout_dynamic(leading_dim=ct.get_leading_dim(k2[128:]))
+    v2_t1 = v2[128:, 0:pv_n_tile].contiguous().unsqueeze(-1)
+    mV1 = ct.from_dlpack(v2_t1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(v2_t1))
+    mC_s1 = ct.from_dlpack(c_seg1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(c_seg1))
+    mLSE_s1 = ct.from_dlpack(lse_seg1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(lse_seg1))
+    mRS_s1 = ct.from_dlpack(rs_seg1).mark_layout_dynamic(leading_dim=ct.get_leading_dim(rs_seg1))
+    compiled_s1 = cute.compile(kernel_s128, mQ, mK1, mV1, mC_s1, stream, mLSE_s1, row_sums=mRS_s1)
+    compiled_s1(mQ, mK1, mV1, mC_s1, stream, mLSE_s1, row_sums=mRS_s1)
+
+    torch.cuda.synchronize()
+
+    # Python KV merge (proven correct, cos 0.999998)
+    o0 = c_seg0[:, :, 0].float()
+    o1 = c_seg1[:, :, 0].float()
+    rs0 = rs_seg0[:, 0, 0].float()
+    rs1_val = rs_seg1[:, 0, 0].float()
+    lse0 = lse_seg0[:, 0, 0].float()
+    lse1_val = lse_seg1[:, 0, 0].float()
+
+    # D5 merge: O = sum(exp(lse_i) * O_i) / sum(exp(lse_i))
+    # where O_i is NORMALIZED (O_i = O_unnorm_i / row_sum_i)
+    o0_norm = o0 / rs0.unsqueeze(1).clamp(min=1e-30)
+    o1_norm = o1 / rs1_val.unsqueeze(1).clamp(min=1e-30)
+    w0 = torch.exp(lse0).unsqueeze(1)
+    w1 = torch.exp(lse1_val).unsqueeze(1)
+    oracle = (w0 * o0_norm + w1 * o1_norm) / (w0 + w1)
+
+    cos_oracle = torch.nn.functional.cosine_similarity(
+        oracle.flatten().unsqueeze(0), ref_unnorm2.flatten().unsqueeze(0)
+    ).item()
+    print(f'\nOracle (Python KV merge): cos={cos_oracle:.6f}', flush=True)
+
+    # Now test: s_k=256 with in-kernel rescale (if we add it to FmhaKernel)
+    # This test will FAIL until we implement the fix, but serves as the target
+    print(f'\nTest 2: s_k=256 in-kernel rescale (NOT YET IMPLEMENTED)', flush=True)
+    print(f'  This test requires adding correction_rescale to FmhaKernel')
+    print(f'  See Phase 4 of MAY_24_2026_PLAN_NEW.md')
+
+
+if __name__ == '__main__':
+    test()