From ad24792fc723bd128d497e487c75815c8fed7d07 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Thu, 21 May 2026 15:36:06 +0000
Subject: [PATCH] Update both READMEs: Stage B complete, document TMEM overlap
 root cause
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap),
  FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement
---
 README.md | 185 ++++++++++++++++++++++--------------------------------
 1 file changed, 76 insertions(+), 109 deletions(-)

diff --git a/README.md b/README.md
index 4b5017de..257311ba 100644
--- a/README.md
+++ b/README.md
@@ -1,132 +1,99 @@
-# DSV4 NVFP4 Workspace
+# DSV4 NVFP4 Kernel
 
-## Status (May 21, 2026 — 09:18 UTC)
+## Status (May 21, 2026 — 15:35 UTC)
 
 ### Stage A ✅ COMPLETE
 Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.
 
-### Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4
+### Stage B ✅ COMPLETE — QK → Softmax → PV pipeline working for (128,64) PV
+Cosine 0.999999 with identity softmax and random V.
 
-Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).
-
-**Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.**
+### Stage C 🔨 NEXT
+Real softmax (exp, row max, row sum, rescale). Multi-tile with proper accumulation.
 
 ---
 
-## Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure
+## Stage B — Bug 4b Root Cause & Fix
 
-### Summary
+### The Bug: TMEM P/O Region Overlap
 
-The softmax writes P to TMEM using the **QK C-fragment layout**. The PV MMA reads P from TMEM using the **PV A-fragment layout**. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.
+**Symptom:** (128,64) PV produces NaN or zeros. (128,128) PV works fine.
 
-**FMHA uses (128,16) PV with the same construction and works.** The root difference is not yet identified despite exhaustive comparison. FMHA references: `p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1)` — same call we make.
+**Root cause:** PV output accumulator O was placed at `find_tmem_tensor_col_offset(tOtO)`, which returns 64 for (128,64) PV. P occupies TMEM columns [32, 96). O at [64, 128) overlaps P at [64, 96). While PV MMA reads P (A-operand), it simultaneously writes O (D-operand) to overlapping TMEM columns. The A-operand gets corrupted mid-computation.
 
-### What Works / What Doesn't
+For (128,128) PV, `find_tmem_tensor_col_offset(tOtO)` returns 128, so O starts after P — no overlap. It worked by accident.
 
-- ✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
-- ✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 **WORKAROUND**
-- ✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
-- ✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
-- ❌ PV (128,64), truncated identity V → cosine 0.02
-- ❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
-- ❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)
+### The Fix
 
-### Root Cause (CONFIRMED May 21 09:50 UTC)
-
-**Bug is NOT the TMEM alias.** The PV A-fragment layout is identical for all PV sizes (confirmed by C++ source and diagnostics): all PV sizes produce tOrP2_s = (2048, 1, 8), size=16384.
-
-**The real bug: V SMEM only holds 1 K-tile (2048 BF16), but the PV MMA iterates 8 K-phases.** For non-(128,128) V, most K-phases read wrong V data.
-
-- (128,128) PV + V=I works by coincidence (V=I makes the projection self-consistent)
-- (128,32) PV + V=(32,128) fails because V SMEM only has V[0:16,:], K-phases 1-7 read wrong data
-- Zero-padded V works because V=(128,128) covers all 8 K-phases; rows beyond head_dim are zero
-
-**How FMHA avoids this:** FMHA interleaves QK and PV per KV-tile. Each tile loads 16 K-rows of V, and PV processes only that tile. This ensures V SMEM always has the correct data.
-
-**Workaround:** Zero-pad V to 128 K-rows (2-4x compute waste, but correct). Proper fix: FMHA-style KV-tile interleaving.
-
-### Current Workaround
-
-Use **(128,128) PV with zero-padded V**. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.
-
-### Required Fixes (Not Yet Applied)
-
-1. **Primary**: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a `make_tmem_copy` with `tP` (PV layout) as the destination, and rearranging register data from QK partition to PV partition.
-
-2. **Secondary**: `epi_tile` must use PV's cta tile, and `self.cta_tile_shape_mnk` must be swapped before `epilogue_tma_store`. FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.
-
-3. **Alternative (for later)**: Investigate using `composition()` to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.
-
----
-
-## Bugs 1–3: ✅ FIXED
-
-### Bug 1: V B-Operand Must Be MN-Major
-
-FMHA requires V to be **MN-major** for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via `as_strided`.
-
-### Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT
-
-FP32→BF16 packing via C-fragment composition store works. ⛔ `St32x32bOp` MUST use Float32, NOT BFloat16.
-
-### Bug 3: First PV Must Use ACCUMULATE=False
-
-If ACCUMULATE=True on the first PV, `O = P@V + old_O` adds uninitialized TMEM. FMHA: `pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0)`.
-
----
-
-## Pipeline Deadlock — ✅ FIXED (May 21)
-
-Three root causes found and fixed:
-1. `PipelineUmmaAsync` for mma_si must NOT pass `cta_layout_vmnk`
-2. TMA warp must NOT call `tmem.wait_for_alloc()`
-3. `pipeline.PipelineTmaStore` (not `TmaStorePipeline`)
-
----
-
-## ⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines
-
-1. **St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS** — Must use Float32 + `cute.recast_ptr`
-2. **V major ≠ K major** — V must be MN-major, use `as_strided`
-3. **C-fragment → A-fragment TMEM alias only works when N_MMA matches** — (128,128) works, (128,64) breaks
-4. **PipelineUmmaAsync consumer = thread count, NOT warp count** — `32 * len(warp_ids)`
-5. **mma_si pipeline must NOT pass cta_layout_vmnk**
-6. **TMA warp excluded from tmem barrier**
-7. **First PV ACCUMULATE=False**
-8. **TMEM offset: FP32 ptr + 32 = BF16 ptr + 64** (width scaling)
-9. **epi_tile must use PV cta_tile, not QK**
-10. **CuTe nested layout modes flatten sequentially** — `((128,16),1,(4,2)):((65536,1),0,(16,64))` is sequential
-
----
-
-## Architecture: Per-Tile Flow
+Place O after both S and P regions:
 
+```python
+p_cols_fp32 = pv_mma_tiler[2] * q_dtype.width // qk_acc_dtype.width  # 128*16/32 = 64
+p_end = tmem_p0_offset + p_cols_fp32  # 32 + 64 = 96
+s_cols = qk_mma_tiler[1]  # 128
+o_after = max(s_cols, p_end)  # 128
+tmem_o0_offset = ((o_after + 31) // 32) * 32  # align to 32 = 128
 ```
-For each KV tile:
-  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
-  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
-     Signals scores_full_mbar (via PipelineUmmaAsync commit)
-  3. Epilogue warps wait on mma_si consumer (scores ready), then:
-     a. tcgen05.ld scores from TMEM → register fragments
-     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
-     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
-     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
-     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
-  4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
-  5. Stage released, load warp can refill it
 
-After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM
+### Secondary Fix: FMHA-Style V Reconstruction
+
+V from DLPack has logical shape (n, hd) but PV B-operand expects (hd, n). Reconstruct inside CuTe:
+
+```python
+v_fmha = cute.make_tensor(
+    v.iterator,
+    cute.make_layout(
+        (HEAD_DIM, s_k, 1),
+        stride=(1, HEAD_DIM, HEAD_DIM * s_k),
+    ),
+)
+v_major = LayoutEnum.from_tensor(v_fmha).mma_major_mode()  # MN
+# Use v_fmha in make_tiled_tma_atom_B, NOT the DLPack v
 ```
 
 ---
 
-## Environment
+## TMEM Layout
 
-- **Server**: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
-- **venv**: `source /root/dsv4-nvfp4-workspace/venv/bin/activate`
-- **PYTHONPATH**: `/root/dsv4-nvfp4-workspace/kernel`
-- **Model**: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4`
-- **vLLM repo**: `/root/dsv4-nvfp4-workspace/vllm` (modified for Blackwell)
-- **Pseudocode**: `/root/fragile-kernel-example/README.md`
-- **fmha.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py`
-- **fmha_bwd.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py`
+```
+Col:  0          32          64          96          128         192        256
+      |---- S ----|---- P ----|           |---- O ----|
+      |  QK acc   | Softmax P |  (gap)    |  PV acc   |
+      |  128 FP32 |  64 FP32  |  32 col   |  64 FP32  |
+```
+
+P aliases part of S (softmax overwrites S columns 32-95 with P). O must not overlap P or S.
+
+---
+
+## Softmax P Store (FMHA Pattern)
+
+Store uses QK C-fragment composition. Read uses PV A-fragment. These are two separate aliases of the same physical TMEM — the P/A alias works (proven by no-softmax test) because both layouts depend on M=128 and K, not on PV output N.
+
+```python
+# Store (softmax writes P)
+tStP = cute.make_tensor(tStS.iterator + tmem_p0_offset,
+    cute.composition(tStS.layout, cute.make_layout((128, p_cols_fp32))))
+tiled_tmem_store = tcgen05.make_tmem_copy(store_atom, tStP)
+
+# Read (PV MMA reads P)
+tP = cute.make_tensor(tStS.iterator, p_tmem_s.outer)
+tOrP = pv_thr.make_fragment_A(tP)[None,None,None,0]
+tOrP0 = cute.make_tensor(tOrP.iterator + width_scale * tmem_p0_offset, tOrP.layout)
+```
+
+Register bridge (FP32 backing + BF16 view):
+```python
+rP_words = cute.make_rmem_tensor(tScP.shape, qk_acc_dtype)
+rP_bf16 = cute.make_tensor(recast_ptr(rP_words.iterator, dtype=q_dtype), tTMEM_LOADrS.layout)
+```
+
+---
+
+## Test Files
+
+- **tests/test_fmha_v3.py** — Full pipeline with KV-tile interleaving. PASS.
+- **tests/test_pv64_with_softmax.py** — Single AB pipeline. PASS.
+- **tests/test_128_128_vdiag.py** — (128,128) PV baseline. PASS.
+- **tests/test_qkonly.py** — QK only. PASS.
+- **tests/test_qk_softmax.py** — QK + softmax (no PV). PASS.