From 50e9b5da81477e000e4770e94de9d175204f5ff0 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Thu, 21 May 2026 09:47:08 +0000
Subject: [PATCH] =?UTF-8?q?README:=20Bug=204=20corrected=20=E2=80=94=20NOT?=
 =?UTF-8?q?=20TMEM=20alias.=20A-fragment=20identical=20for=20all=20PV=20si?=
 =?UTF-8?q?zes.=20Real=20bug=20in=20V/B=20or=20output=20C/D.?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md | 365 ++++++++++++------------------------------------------
 1 file changed, 81 insertions(+), 284 deletions(-)
diff --git a/README.md b/README.md
index 62608747..9cfc677a 100644
--- a/README.md
+++ b/README.md
@@ -1,260 +1,113 @@
-# DeepSeek-V4 NVFP4 Kernel Suite
+# DSV4 NVFP4 Workspace
 
-CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlass.cute` (CuTeDSL) with Blackwell tensor cores.
+## Status (May 21, 2026 — 09:18 UTC)
 
-## Status (May 21, 2026 — 06:45 UTC)
+### Stage A ✅ COMPLETE
+Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.
 
-### ✅ Stage A: Bare Q@K^T via tcgen05.mma → TMEM → GMEM — COMPLETE
+### Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4
 
-**File**: `tests/test_stage_a_v2.py`
-**Result**: Q(128,128) @ K^T(128,128) → S(128,128), cosine 0.999999
+Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).
 
-### 🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS
-
-**Pipeline deadlock: FIXED. Kernel runs without deadlock.**
-**Bug 1 (V MN-major): ✅ Fix applied.**
-**Bug 2 (softmax packing): ✅ Confirmed correct (V=I test: cosine 1.0).**
-**Bug 3 (ACCUMULATE): ✅ Fix applied.**
-**Bug 4 (non-square PV): 🔨 ACTIVE — Two approaches attempted, both blocked.**
-
-#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output
-
-**Root Cause: The (128,64) PV MMA's A-fragment reads P from TMEM with a different layout than the softmax packing writes it.**
-
-The softmax packing writes P using the **QK C-fragment layout** (MMA atom = (128,128,16), N_MMA=128). The PV MMA reads P using its **A-fragment layout** (MMA atom = (128,64,16), N_MMA=64). These two layouts produce different physical TMEM addresses for the same logical (m,k) coordinate.
-
-**Evidence:**
-- Truncated identity V (64,128) MN-major: O[m,d] ≈ P[m, 2d] — the MMA reads every other column of P
-- All-ones V: cosine 0.999999 ✅ (uniform data hides the layout mismatch)
-- Single-element V: cosine 1.0 ✅ (sparse data also hides it)
-- (128,128) PV with same softmax packing: cosine 0.999999 ✅ (N_MMA=128 matches QK, no mismatch)
-
-**C++ TMEM Fragment Layout (from mma_traits_sm100.hpp):**
-```cpp
-Layout tmem_atom = Layout<Shape <_128, Int<N_MMA>>,
-                          Stride<  _1,     _128>>{};
-```
-- QK C-fragment: N_MMA=128 → 128 TMEM columns, stride 128
-- PV A-fragment (128,64): N_MMA=64 → 64 TMEM columns, stride 128
-
-**Approach 1: (128,64) PV MMA — BLOCKED (v28, deadlocks)**
-- Created PV MMA with `pv_mma_tiler[:2] = (128, 64)` as FMHA does
-- Kernel compiles but **deadlocks at runtime** inside `epilogue_tma_store`
-- The (128,64) PV MMA changes `tOtO` shape from (128,128) to (128,64), affecting TMEM allocation, epilogue partitioning, and `tCgC` partitioning
-- The deadlock is NOT in the MMA or softmax — it's specifically in `epilogue_tma_store` which calls `acc_pipeline.consumer_wait()`
-- Diagnostics show: MMA warp stuck at `mma_si acquire` (waiting for softmax), EPI warps complete softmax but deadlock in epilogue
-- All three known deadlock fixes are applied (no cta_layout_vmnk on mma_si, TMA warp doesn't call wait_for_alloc, PipelineTmaStore)
-- **New deadlock root cause unknown** — likely related to epilogue partitioning mismatch with (128,64) output shape
-
-**Approach 2: Pad V to (128,128), use (128,128) PV MMA — BLOCKED (v29, deadlocks)**
-- Pad V from (64,128) to (128,128) with zeros, keep (128,128) PV MMA
-- Should produce O=(128,128) where first 64 columns are correct
-- **Also deadlocks**, even with V=I(128,128) which works in test_pv_diag.py
-- test_pv_diag.py uses the exact same pipeline and kernel structure
-- Difference between v29 and test_pv_diag: likely a subtle code issue (b_dtype vs q_dtype, pv_mma_tiler, etc.)
-- **Needs bisecting** — slowly modify test_pv_diag into v29 to find the breaking change
-
-**Approach 3 (NOT YET TRIED): FMHA-style (128,16) PV MMA with N-tiling**
-- FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16)
-- For head_dim=64, FMHA tiles the N dimension 4 times (64/16=4)
-- Requires restructuring the kernel to loop over N tiles
-- This is the "correct" approach but requires more code changes
-
-**Approach 4 (NOT YET TRIED): Fix softmax packing to write P in PV A-fragment layout**
-- Instead of composing tStS.layout → (128, 64), write P using a layout derived from the PV A-fragment
-- FMHA does this: `tStS_P_layout = cute.composition(tStS.layout, (128, tilePlikeFP32))`
-- The composition writes 64 packed FP32 columns that alias the PV A-fragment's 64 TMEM columns
-- This should fix the alias for (128,64) PV MMA, but still blocked by the epilogue deadlock
-
-**V SMEM Layouts (confirmed correct):**
-- `PV(128,64) V SMEM: outer=((64,16),1,8,1):((1,64),0,1024,0), inner=S<3,4,3>`
-- `PV(128,128) V SMEM: outer=(((64,2),16),1,8,1):(((1,8192),64),0,1024,0), inner=S<3,4,3>`
-
-**FMHA Softmax Packing Bridge (reference trace):**
-
-FMHA's softmax writes P in a packed format that aliases the PV A-fragment:
-```python
-# P store columns: qk_mma_tiler[1] * p_dtype.width // qk_acc_dtype.width
-# For 128 BF16 columns: 128 * 16 / 32 = 64 packed FP32 columns
-tilePlikeFP32 = qk_mma_tiler[1] // Float32.width * o_dtype.width  # = 64
-
-# Destination: packed-P physical TMEM view (64 FP32 columns, not 128)
-tStS_P_layout = cute.composition(tStS.layout, cute.make_layout((128, tilePlikeFP32)))
-tStS_P = cute.make_tensor(tStS.iterator + tmem_p0_offset, tStS_P_layout)
-
-# Register packing: Float32 backing → BF16 recast view
-tTMEM_STORErS_x4 = cute.make_rmem_tensor(tTMEM_STOREcS.shape, Float32)
-tTMEM_STORErS_x4_e = cute.make_tensor(
-    cute.recast_ptr(tTMEM_STORErS_x4.iterator, dtype=BFloat16),
-    tTMEM_LOADrS.layout)  # 128 BF16 logical elements
-```
-
-The consumer (PV A-fragment) reads from the same TMEM columns:
-```python
-tP = cute.make_tensor(tStS.iterator, p_tmem_layout_staged.outer)
-tOrP = pv_thr_mma.make_fragment_A(tP)[None, None, None, 0]
-tOrP0 = cute.make_tensor(
-    tOrP.iterator + Float32.width // BFloat16.width * tmem_p0_offset,
-    tOrP.layout)
-```
-
-The subtle point: `tilePlikeFP32 = 64` means "64 packed FP32 TMEM columns to store 128 logical BF16 P columns", NOT "PV output N=64". It's a coincidence that they're both 64 for head_dim=64.
+**Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.**
 
 ---
 
-### Bug 1: V B-Operand Must Be MN-Major — ✅ FIX APPLIED
+## Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure
 
-V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) — MN-major.
-PV MMA uses `v_major` (OperandMajorMode.MN) instead of `b_major` (K).
+### Summary
 
-V must use `as_strided` — default PyTorch (64,128) gives strides (128,1) which is K-major.
+The softmax writes P to TMEM using the **QK C-fragment layout**. The PV MMA reads P from TMEM using the **PV A-fragment layout**. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.
 
-### Bug 2: C-Fragment Composition Store — ✅ CONFIRMED CORRECT
+**FMHA uses (128,16) PV with the same construction and works.** The root difference is not yet identified despite exhaustive comparison. FMHA references: `p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1)` — same call we make.
 
-FP32→BF16 packing via C-fragment composition store (FMHA pattern) is correct.
-Proven by V=I test (cosine 1.0) and random V 128x128 test (cosine 0.999999).
+### What Works / What Doesn't
 
-⛔ **FOOTGUN**: `St32x32bOp` MUST use Float32, NOT BFloat16.
-⚠️ The recast view for P packing uses the LOAD layout (128 BF16 elements), not the store composition shape.
+- ✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
+- ✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 **WORKAROUND**
+- ✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
+- ✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
+- ❌ PV (128,64), truncated identity V → cosine 0.02
+- ❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
+- ❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)
 
-### Bug 3: First PV Must Use ACCUMULATE=False — ✅ FIX APPLIED
+### Root Cause (Updated May 21 09:20 UTC)
 
-If ACCUMULATE=True on the first PV, `O = P@V + old_O` adds uninitialized TMEM. Always ACCUMULATE=False for first PV, then True for subsequent tiles.
+**The P/A TMEM alias is NOT the bug.** Diagnostic prints confirm the PV A-fragment layout is IDENTICAL for all PV sizes:
+
+```
+(128,128) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=1.0
+(128,32)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.51
+(128,16)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.36
+```
+
+The C++ source confirms: the A-fragment TMEM atom depends on M and K, NOT output N. The softmax writes P to the same TMEM columns regardless of PV size.
+
+**The real bug is in the V/B staging or output C/D path.** When using (128,128) PV with zero-padded V (which keeps the V SMEM, O C-fragment, and epilogue at (128,128) dimensions), cosine=1.0. When using native (128,32) PV with V=(32,128), cosine=0.51. The difference is the V SMEM layout and/or output epilogue.
+
+**Key observations:**
+- V smem_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
+- O tOtO_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
+- cta_tile_shape_mnk=(128,128,64) for BOTH — this is QK's cta tile, not PV's
+- epi_tile=(128,16) for (128,16) PV — this IS correct (from PV)
+- Swapping cta_tile to PV before epilogue doesn't fix the issue
+
+**Next steps:**
+1. Test V TMA load correctness for (128,32) PV — is V data loaded correctly into SMEM?
+2. Test PV MMA output directly (skip epilogue) — is the PV MMA producing correct O?
+3. Check if V B-operand fragment (tCrV) has the right shape for (128,32) PV
+
+### Current Workaround
+
+Use **(128,128) PV with zero-padded V**. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.
+
+### Required Fixes (Not Yet Applied)
+
+1. **Primary**: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a `make_tmem_copy` with `tP` (PV layout) as the destination, and rearranging register data from QK partition to PV partition.
+
+2. **Secondary**: `epi_tile` must use PV's cta tile, and `self.cta_tile_shape_mnk` must be swapped before `epilogue_tma_store`. FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.
+
+3. **Alternative (for later)**: Investigate using `composition()` to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.
 
 ---
 
-## 🔨 Stage C: Online Softmax — AFTER B
+## Bugs 1–3: ✅ FIXED
 
-Per the pseudocode: epilogue warps compute per-row tile_max, rescale, exp, store P back to TMEM.
+### Bug 1: V B-Operand Must Be MN-Major
 
-## 🔨 Stage D: FP8 Paged KV Gather — AFTER C
+FMHA requires V to be **MN-major** for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via `as_strided`.
 
-Replace BF16 TMA load with FP8 paged KV gather + per-position dequant.
+### Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT
+
+FP32→BF16 packing via C-fragment composition store works. ⛔ `St32x32bOp` MUST use Float32, NOT BFloat16.
+
+### Bug 3: First PV Must Use ACCUMULATE=False
+
+If ACCUMULATE=True on the first PV, `O = P@V + old_O` adds uninitialized TMEM. FMHA: `pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0)`.
 
 ---
 
 ## Pipeline Deadlock — ✅ FIXED (May 21)
 
-v20-v25 all deadlocked on GPU. Three root causes found and fixed:
-
-### Fix 1: PipelineUmmaAsync for mma_si Must NOT Pass cta_layout_vmnk
-
-FMHA's mma_s0/mma_s1 PipelineUmmaAsync calls do NOT pass cta_layout_vmnk. Removing it fixes the deadlock.
-
-### Fix 2: TMA Warp Must NOT Call tmem.wait_for_alloc()
-
-The tmem allocation barrier has `num_threads = 32 * (mma_warp + epilogue_warps)`. The TMA warp is NOT part of this barrier. Calling `wait_for_alloc()` from the TMA warp corrupts the barrier.
-
-### Fix 3: PipelineTmaStore (not TmaStorePipeline)
-
-`pipeline.TmaStorePipeline` does not exist. The correct name is `pipeline.PipelineTmaStore`.
+Three root causes found and fixed:
+1. `PipelineUmmaAsync` for mma_si must NOT pass `cta_layout_vmnk`
+2. TMA warp must NOT call `tmem.wait_for_alloc()`
+3. `pipeline.PipelineTmaStore` (not `TmaStorePipeline`)
 
 ---
 
-## ✅ DEADLOCK FIX #4: num_tma_load_bytes Must Include V Bytes (May 21, 07:10 UTC)
+## ⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines
 
-**Root cause:** `num_tma_load_bytes` only accounted for Q + K, not V. The TMA barrier's tx-count underflowed when V bytes arrived, wrapping the 20-bit counter. The barrier never reached zero → MMA warp waits forever.
-
-**Fix:** Add `cute.size_in_bytes(self.b_dtype, v_smem)` to `num_tma_load_bytes`.
-
-**Why it was sneaky:** test_pv_diag.py had the same bug but ran fine because V=I(128,128) is small enough that the race was benign. v29 with larger or differently-strided V exposed it.
-
-**Remaining issue with v28 (128,64 PV MMA):** The (128,64) PV MMA itself still needs investigation — the softmax-to-PV TMEM alias must be adapted for the different A-fragment layout.
-
----
-
-## ⛔ DEAD TEST: test_stage_b_v21.py — DELETED, DO NOT RECREATE
-
-v21 attempted both Bug 1 and Bug 2 fixes in a hand-rolled pipeline kernel. It deadlocks on GPU. Root cause: pipeline synchronization mismatch. **Do not recreate.** Write from scratch using fmha.py as the reference.
-
----
-
-## ⛔⛔⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines ⛔⛔⛔
-
-### 🔴🔴🔴 0. num_tma_load_bytes MUST Include ALL TMA-Loaded Tensors (Q + K + V) — DEADLOCK IF MISSING
-
-**This is the #1 landmine. It cost us hours of debugging.**
-
-`PipelineTmaUmma.create()` takes a `tx_count` parameter (via `num_tma_load_bytes`) that tells the TMA barrier how many bytes to expect. If you load Q, K, and V via TMA but only budget Q+K bytes, the barrier's tx-count underflows when V's bytes arrive. On SM100 the 20-bit tx-count wraps, the barrier never reaches zero, and the consumer (MMA warp) waits **forever**.
-
-```python
-# ❌ WRONG — missing V bytes → DEADLOCK
-self.num_tma_load_bytes = (
-    cute.size_in_bytes(self.q_dtype, a_smem) + cute.size_in_bytes(self.b_dtype, b_smem)
-) * cute.size(qk_mma.thr_id.shape)
-
-# ✅ CORRECT — include ALL three TMA loads
-v_smem = cute.slice_(self.v_smem_s, (None, None, None, 0))
-self.num_tma_load_bytes = (
-    cute.size_in_bytes(self.q_dtype, a_smem)
-    + cute.size_in_bytes(self.b_dtype, b_smem)
-    + cute.size_in_bytes(self.b_dtype, v_smem)  # ← DO NOT FORGET THIS
-) * cute.size(qk_mma.thr_id.shape)
-```
-
-**Why it's sneaky:** With small V tensors, the race might be benign (V completes before the consumer reads). With larger V, the underflow actually traps the barrier. So it can work on small tests and deadlock on larger ones.
-
----
-
-### 1. St32x32bOp with 16-bit dtype → ILLEGAL MEMORY ACCESS
-
-`St32x32bOp(Repetition(N), BFloat16)` crashes at runtime. You MUST use `St32x32bOp(Repetition(N), Float32)` and pack 2×16-bit values into 1×Float32 backing words via `cute.recast_ptr`. The 16-bit type only appears in the recast view, never in the store atom itself.
-
-### 2. V B-Operand Major Mode ≠ K Major Mode
-
-FMHA requires `v_major_mode == OperandMajorMode.MN`. Passing K's K-major mode for V is WRONG. V must be shaped (head_dim, seq) with strides (1, head_dim) to produce MN-major. Standard PyTorch row-major (seq, head_dim) gives K-major.
-
-### 3. CuTe Nested Layout Modes Flatten Sequentially
-
-A layout like `((128,16),1,(4,2)):((65536,1),0,(16,64))` looks "non-sequential" but flattens to `addr = m*65536 + k` when k = k0 + 16*k1 + 64*k2 (CuTe row-major order). Do NOT assume nested modes imply non-sequential physical addressing. The C-fragment composition and A-fragment alias the same TMEM columns — BUT ONLY WHEN N_MMA MATCHES (i.e., (128,128) PV). For (128,64) PV, N_MMA=64 and the alias breaks.
-
-### 4. PipelineUmmaAsync Consumer Group = Thread Count, NOT Warp Count
-
-```python
-# WRONG: consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)
-# CORRECT: consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 32 * len(warp_ids))
-```
-
-### 5. PipelineUmmaAsync for mma_si Must NOT Pass cta_layout_vmnk
-
-Passing `cta_layout_vmnk` to the mma_si PipelineUmmaAsync causes deadlock. FMHA does not pass it. Remove it.
-
-### 6. TMA Warp Must NOT Call tmem.wait_for_alloc()
-
-The tmem allocation barrier only includes MMA + epilogue warps. The TMA warp is excluded. Calling `wait_for_alloc()` from the TMA warp corrupts the barrier.
-
-### 7. PV MMA ACCUMULATE Must Be False on First Tile
-
-If ACCUMULATE=True on the first PV MMA, `O = P@V + old_O` adds uninitialized TMEM to the result. Always set ACCUMULATE=False for the first PV, then True for subsequent tiles. FMHA: `pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0)`.
-
-### 8. TMEM Pointer Arithmetic: Offset Units Depend on Pointer Type
-
-When computing PV A-fragment offset from the softmax P offset:
-```python
-# Softmax store: FP32 pointer + tmem_p0_offset (in FP32 elements)
-tStS_P = cute.make_tensor(tStS.iterator + tmem_p0_offset, tStS_P_layout)
-
-# PV A-fragment: BF16 pointer + scaled offset (in BF16 elements)
-p_offset = acc_dtype.width // q_dtype.width * tmem_p0_offset  # 2 * 32 = 64
-tOrP0 = cute.make_tensor(tOrP.iterator + p_offset, tOrP.layout)
-```
-Both must address the same physical TMEM column. The 2× scaling accounts for FP32→BF16 element size difference.
-
-### 9. C-Fragment → A-Fragment TMEM Alias Only Works When N_MMA Matches
-
-The softmax packing writes P using the QK C-fragment layout. The PV A-fragment reads P. These alias correctly ONLY when both MMA atoms have the same N_MMA (i.e., both (128,128,16) → N_MMA=128). When the PV MMA uses (128,64,16) → N_MMA=64, the A-fragment has a different TMEM stride and reads garbage. **The softmax packing must be adapted to write P in the PV A-fragment's layout.**
-
-### 10. epi_tile Must Match PV Output Shape, Not QK
-
-`compute_epilogue_tile_shape` must use PV's `cta_tile_shape_mnk`, not QK's. Also, `self.cta_tile_shape_mnk` must be set to PV's cta tile before calling `epilogue_tma_store` (it reads `gemm_kernel.cta_tile_shape_mnk` internally). FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.
-
-### 11. GPU State Persists After Deadlocked Kernels
-
-After a kernel deadlock, the GPU may remain in a bad state. Kill all python processes using the GPU before running new tests. `nvidia-smi` shows hanging processes. Use `kill -9 <PID>` to clean up. A deadlocked kernel can also cause subsequent runs to fail even if the code is correct.
-
-### 12. V Padded to (128,128) Must Use MN-Major Strides
-
-When padding V from (64,128) to (128,128), the padded tensor MUST use MN-major strides (1, 128), not the default PyTorch row-major strides (128, 1). Use `as_strided` or `transpose()` to get the correct layout. The wrong layout causes `LayoutEnum.ROW_MAJOR` instead of `LayoutEnum.COL_MAJOR`, which the PV MMA's OperandMajorMode.MN does not expect.
+1. **St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS** — Must use Float32 + `cute.recast_ptr`
+2. **V major ≠ K major** — V must be MN-major, use `as_strided`
+3. **C-fragment → A-fragment TMEM alias only works when N_MMA matches** — (128,128) works, (128,64) breaks
+4. **PipelineUmmaAsync consumer = thread count, NOT warp count** — `32 * len(warp_ids)`
+5. **mma_si pipeline must NOT pass cta_layout_vmnk**
+6. **TMA warp excluded from tmem barrier**
+7. **First PV ACCUMULATE=False**
+8. **TMEM offset: FP32 ptr + 32 = BF16 ptr + 64** (width scaling)
+9. **epi_tile must use PV cta_tile, not QK**
+10. **CuTe nested layout modes flatten sequentially** — `((128,16),1,(4,2)):((65536,1),0,(16,64))` is sequential
 
 ---
 
@@ -269,7 +122,7 @@ For each KV tile:
      a. tcgen05.ld scores from TMEM → register fragments
      b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
      c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
-     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand (via C-fragment composition)
+     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
      e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
   4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
   5. Stage released, load warp can refill it
@@ -279,62 +132,6 @@ After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast
 
 ---
 
-## Test Results
-
-| File | Description | Cosine | Status |
-|------|-------------|--------|--------|
-| `test_stage_a_v2.py` | Q@K^T only | 0.999999 | ✅ PASS |
-| `test_mma_si_only.py` | Q@K^T + mma_si pipeline (no PV) | 0.999999 | ✅ PASS |
-| `test_softmax_only.py` | Q@K^T + softmax packing, output S | 0.52 | ❌ S overwritten by P (expected) |
-| `test_mma_si_pv.py` | Q@K^T + softmax + P@V (V MN-major, 128x64) | 0.01 | ❌ PV output garbage |
-| `test_pv_diag.py` | Q@K^T + softmax + P@V (V=I 128x128) | 1.0 | ✅ PASS |
-| `test_pv_diag.py` | Q@K^T + softmax + P@V (random V 128x128) | 0.999999 | ✅ PASS |
-| `test_diag_v_truncid.py` | Q@K^T + softmax + P@V (trunc identity 64x128, epi from PV) | 0.02 | ❌ O[m,d]≈P[m,2d] — TMEM alias mismatch |
-| `test_diag_v_ones.py` | All-ones V (64x128) | 0.999999 | ✅ uniform data hides mismatch |
-| `test_diag_v_ones.py` | Single-element V (64x128) | 1.0 | ✅ sparse data hides mismatch |
-| `test_diag_layout.py` | (128,64) PV with epi from PV cta_tile | 0.876 | ❌ partial fix — epi correct, TMEM alias still broken |
-| `test_diag_smem_layout.py` | Print V SMEM layouts for (128,64) vs (128,128) | N/A | ℹ️ layouts confirmed correct |
-| `test_layout_compare.py` | Print TMEM layouts for QK S and PV A-fragment | N/A | ℹ️ layout inspection |
-| `test_stage_b_v28.py` | (128,64) PV MMA + epi from PV | — | ❌ DEADLOCK in epilogue_tma_store |
-| `test_stage_b_v29.py` | Padded V (128,128) + (128,128) PV MMA | — | ❌ DEADLOCK (likely code bug, not fundamental) |
-| `test_stage_b_v30.py` | Copy of test_pv_diag.py | 1.0 | ✅ PASS (sanity check) |
-
----
-
-## Critical APIs & Lessons
-
-### TMEM offset arithmetic
-- `find_tmem_tensor_col_offset(fragment)` — returns physical TMEM column count
-- QK accumulator: 128 TMEM columns
-- A-fragment offset: `acc_dtype.width // q_dtype.width * tmem_p0_offset` (F32/BF16=2)
-
-### pv_mma_tiler — FMHA Convention
-```python
-pv_mma_tiler = (qk_mma_tiler[0], qk_mma_tiler[2], qk_mma_tiler[1])
-# = (M, head_dim, QK_N) = (128, 64, 128) for head_dim=64
-```
-
-FMHA passes `pv_mma_tiler[:2] = (128, head_dim)` to `make_trivial_tiled_mma`, NOT the QK tiler `(128, 128)`.
-
-### FMHA uses (128,16) PV MMA, NOT (128,64)
-FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16). For head_dim=64, FMHA tiles the N dimension 4 times. The softmax writes P once, and each tile reads the same P with a different V slice.
-
-### make_trivial_tiled_mma — Use New Overload
-```python
-make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,
-                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)
-```
-
-### 3D tensors required
-Tensors must be 3D (M, K, L) for `cute.local_tile` — add L=1 dimension.
-
-### Other APIs
-1. `cutlass_torch.from_dlpack(t).mark_layout_dynamic(leading_dim=...)` — CuTe tensor from PyTorch
-2. `PipelineTmaUmma.create(...).make_participants()` — returns (producer, consumer) pair
-3. `utils.gemm.sm100.epilogue_tma_store` — handles transform + partition/dcopy. DO NOT hand-roll.
-4. `smem.allocate_tensor()` — for SMEM tensors
-5. `LayoutEnum.from_tensor(a).mma_major_mode()` — major mode from cute tensor
-
 ## Environment
 
 - **Server**: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)