Same P/O overlap bug: O at col 64 overlapped P at [32,96). Same fixes: O at col 128, FMHA V reconstruction, power-of-2 TMEM alloc.
DSV4 NVFP4 Workspace
Status (May 21, 2026 — 09:18 UTC)
Stage A ✅ COMPLETE
Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.
Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4
Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).
Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.
Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure
Summary
The softmax writes P to TMEM using the QK C-fragment layout. The PV MMA reads P from TMEM using the PV A-fragment layout. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.
FMHA uses (128,16) PV with the same construction and works. The root difference is not yet identified despite exhaustive comparison. FMHA references: p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1) — same call we make.
What Works / What Doesn't
- ✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
- ✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 WORKAROUND
- ✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
- ✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
- ❌ PV (128,64), truncated identity V → cosine 0.02
- ❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
- ❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)
Root Cause (CONFIRMED May 21 09:50 UTC)
Bug is NOT the TMEM alias. The PV A-fragment layout is identical for all PV sizes (confirmed by C++ source and diagnostics): all PV sizes produce tOrP2_s = (2048, 1, 8), size=16384.
The real bug: V SMEM only holds 1 K-tile (2048 BF16), but the PV MMA iterates 8 K-phases. For non-(128,128) V, most K-phases read wrong V data.
- (128,128) PV + V=I works by coincidence (V=I makes the projection self-consistent)
- (128,32) PV + V=(32,128) fails because V SMEM only has V[0:16,:], K-phases 1-7 read wrong data
- Zero-padded V works because V=(128,128) covers all 8 K-phases; rows beyond head_dim are zero
How FMHA avoids this: FMHA interleaves QK and PV per KV-tile. Each tile loads 16 K-rows of V, and PV processes only that tile. This ensures V SMEM always has the correct data.
Workaround: Zero-pad V to 128 K-rows (2-4x compute waste, but correct). Proper fix: FMHA-style KV-tile interleaving.
Current Workaround
Use (128,128) PV with zero-padded V. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.
Required Fixes (Not Yet Applied)
-
Primary: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a
make_tmem_copywithtP(PV layout) as the destination, and rearranging register data from QK partition to PV partition. -
Secondary:
epi_tilemust use PV's cta tile, andself.cta_tile_shape_mnkmust be swapped beforeepilogue_tma_store. FMHA setsself.epi_tile = self.pv_mma_tiler[:2]directly. -
Alternative (for later): Investigate using
composition()to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.
Bugs 1–3: ✅ FIXED
Bug 1: V B-Operand Must Be MN-Major
FMHA requires V to be MN-major for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via as_strided.
Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT
FP32→BF16 packing via C-fragment composition store works. ⛔ St32x32bOp MUST use Float32, NOT BFloat16.
Bug 3: First PV Must Use ACCUMULATE=False
If ACCUMULATE=True on the first PV, O = P@V + old_O adds uninitialized TMEM. FMHA: pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0).
Pipeline Deadlock — ✅ FIXED (May 21)
Three root causes found and fixed:
PipelineUmmaAsyncfor mma_si must NOT passcta_layout_vmnk- TMA warp must NOT call
tmem.wait_for_alloc() pipeline.PipelineTmaStore(notTmaStorePipeline)
⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines
- St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS — Must use Float32 +
cute.recast_ptr - V major ≠ K major — V must be MN-major, use
as_strided - C-fragment → A-fragment TMEM alias only works when N_MMA matches — (128,128) works, (128,64) breaks
- PipelineUmmaAsync consumer = thread count, NOT warp count —
32 * len(warp_ids) - mma_si pipeline must NOT pass cta_layout_vmnk
- TMA warp excluded from tmem barrier
- First PV ACCUMULATE=False
- TMEM offset: FP32 ptr + 32 = BF16 ptr + 64 (width scaling)
- epi_tile must use PV cta_tile, not QK
- CuTe nested layout modes flatten sequentially —
((128,16),1,(4,2)):((65536,1),0,(16,64))is sequential
Architecture: Per-Tile Flow
For each KV tile:
1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
Signals scores_full_mbar (via PipelineUmmaAsync commit)
3. Epilogue warps wait on mma_si consumer (scores ready), then:
a. tcgen05.ld scores from TMEM → register fragments
b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
5. Stage released, load warp can refill it
After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM
Environment
- Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
- venv:
source /root/dsv4-nvfp4-workspace/venv/bin/activate - PYTHONPATH:
/root/dsv4-nvfp4-workspace/kernel - Model:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 - vLLM repo:
/root/dsv4-nvfp4-workspace/vllm(modified for Blackwell) - Pseudocode:
/root/fragile-kernel-example/README.md - fmha.py reference:
/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py - fmha_bwd.py reference:
/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py