- Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap),
FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement
DSV4 NVFP4 Kernel
Status (May 21, 2026 — 15:35 UTC)
Stage A ✅ COMPLETE
Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.
Stage B ✅ COMPLETE — QK → Softmax → PV pipeline working for (128,64) PV
Cosine 0.999999 with identity softmax and random V.
Stage C 🔨 NEXT
Real softmax (exp, row max, row sum, rescale). Multi-tile with proper accumulation.
Stage B — Bug 4b Root Cause & Fix
The Bug: TMEM P/O Region Overlap
Symptom: (128,64) PV produces NaN or zeros. (128,128) PV works fine.
Root cause: PV output accumulator O was placed at find_tmem_tensor_col_offset(tOtO), which returns 64 for (128,64) PV. P occupies TMEM columns [32, 96). O at [64, 128) overlaps P at [64, 96). While PV MMA reads P (A-operand), it simultaneously writes O (D-operand) to overlapping TMEM columns. The A-operand gets corrupted mid-computation.
For (128,128) PV, find_tmem_tensor_col_offset(tOtO) returns 128, so O starts after P — no overlap. It worked by accident.
The Fix
Place O after both S and P regions:
p_cols_fp32 = pv_mma_tiler[2] * q_dtype.width // qk_acc_dtype.width # 128*16/32 = 64
p_end = tmem_p0_offset + p_cols_fp32 # 32 + 64 = 96
s_cols = qk_mma_tiler[1] # 128
o_after = max(s_cols, p_end) # 128
tmem_o0_offset = ((o_after + 31) // 32) * 32 # align to 32 = 128
Secondary Fix: FMHA-Style V Reconstruction
V from DLPack has logical shape (n, hd) but PV B-operand expects (hd, n). Reconstruct inside CuTe:
v_fmha = cute.make_tensor(
v.iterator,
cute.make_layout(
(HEAD_DIM, s_k, 1),
stride=(1, HEAD_DIM, HEAD_DIM * s_k),
),
)
v_major = LayoutEnum.from_tensor(v_fmha).mma_major_mode() # MN
# Use v_fmha in make_tiled_tma_atom_B, NOT the DLPack v
TMEM Layout
Col: 0 32 64 96 128 192 256
|---- S ----|---- P ----| |---- O ----|
| QK acc | Softmax P | (gap) | PV acc |
| 128 FP32 | 64 FP32 | 32 col | 64 FP32 |
P aliases part of S (softmax overwrites S columns 32-95 with P). O must not overlap P or S.
Softmax P Store (FMHA Pattern)
Store uses QK C-fragment composition. Read uses PV A-fragment. These are two separate aliases of the same physical TMEM — the P/A alias works (proven by no-softmax test) because both layouts depend on M=128 and K, not on PV output N.
# Store (softmax writes P)
tStP = cute.make_tensor(tStS.iterator + tmem_p0_offset,
cute.composition(tStS.layout, cute.make_layout((128, p_cols_fp32))))
tiled_tmem_store = tcgen05.make_tmem_copy(store_atom, tStP)
# Read (PV MMA reads P)
tP = cute.make_tensor(tStS.iterator, p_tmem_s.outer)
tOrP = pv_thr.make_fragment_A(tP)[None,None,None,0]
tOrP0 = cute.make_tensor(tOrP.iterator + width_scale * tmem_p0_offset, tOrP.layout)
Register bridge (FP32 backing + BF16 view):
rP_words = cute.make_rmem_tensor(tScP.shape, qk_acc_dtype)
rP_bf16 = cute.make_tensor(recast_ptr(rP_words.iterator, dtype=q_dtype), tTMEM_LOADrS.layout)
Test Files
- tests/test_fmha_v3.py — Full pipeline with KV-tile interleaving. PASS.
- tests/test_pv64_with_softmax.py — Single AB pipeline. PASS.
- tests/test_128_128_vdiag.py — (128,128) PV baseline. PASS.
- tests/test_qkonly.py — QK only. PASS.
- tests/test_qk_softmax.py — QK + softmax (no PV). PASS.