Files
nvfp4-megamoe-kernel/README.md

7.0 KiB
Raw Blame History

DSV4 NVFP4 Workspace

Status (May 21, 2026 — 09:18 UTC)

Stage A COMPLETE

Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.

Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4

Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).

Pipeline deadlock: FIXED. Softmax packing: CONFIRMED CORRECT.


Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure

Summary

The softmax writes P to TMEM using the QK C-fragment layout. The PV MMA reads P from TMEM using the PV A-fragment layout. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.

FMHA uses (128,16) PV with the same construction and works. The root difference is not yet identified despite exhaustive comparison. FMHA references: p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1) — same call we make.

What Works / What Doesn't

  • PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
  • PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 WORKAROUND
  • PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
  • PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
  • PV (128,64), truncated identity V → cosine 0.02
  • PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
  • PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)

Root Cause (Updated May 21 09:20 UTC)

The P/A TMEM alias is NOT the bug. Diagnostic prints confirm the PV A-fragment layout is IDENTICAL for all PV sizes:

(128,128) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=1.0
(128,32)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.51
(128,16)  PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.36

The C++ source confirms: the A-fragment TMEM atom depends on M and K, NOT output N. The softmax writes P to the same TMEM columns regardless of PV size.

The real bug is in the V/B staging or output C/D path. When using (128,128) PV with zero-padded V (which keeps the V SMEM, O C-fragment, and epilogue at (128,128) dimensions), cosine=1.0. When using native (128,32) PV with V=(32,128), cosine=0.51. The difference is the V SMEM layout and/or output epilogue.

Key observations:

  • V smem_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
  • O tOtO_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
  • cta_tile_shape_mnk=(128,128,64) for BOTH — this is QK's cta tile, not PV's
  • epi_tile=(128,16) for (128,16) PV — this IS correct (from PV)
  • Swapping cta_tile to PV before epilogue doesn't fix the issue

Next steps:

  1. Test V TMA load correctness for (128,32) PV — is V data loaded correctly into SMEM?
  2. Test PV MMA output directly (skip epilogue) — is the PV MMA producing correct O?
  3. Check if V B-operand fragment (tCrV) has the right shape for (128,32) PV

Current Workaround

Use (128,128) PV with zero-padded V. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.

Required Fixes (Not Yet Applied)

  1. Primary: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a make_tmem_copy with tP (PV layout) as the destination, and rearranging register data from QK partition to PV partition.

  2. Secondary: epi_tile must use PV's cta tile, and self.cta_tile_shape_mnk must be swapped before epilogue_tma_store. FMHA sets self.epi_tile = self.pv_mma_tiler[:2] directly.

  3. Alternative (for later): Investigate using composition() to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.


Bugs 13: FIXED

Bug 1: V B-Operand Must Be MN-Major

FMHA requires V to be MN-major for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via as_strided.

Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT

FP32→BF16 packing via C-fragment composition store works. St32x32bOp MUST use Float32, NOT BFloat16.

Bug 3: First PV Must Use ACCUMULATE=False

If ACCUMULATE=True on the first PV, O = P@V + old_O adds uninitialized TMEM. FMHA: pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0).


Pipeline Deadlock — FIXED (May 21)

Three root causes found and fixed:

  1. PipelineUmmaAsync for mma_si must NOT pass cta_layout_vmnk
  2. TMA warp must NOT call tmem.wait_for_alloc()
  3. pipeline.PipelineTmaStore (not TmaStorePipeline)

FOOTGUNS — CUTLASS CuTeDSL Landmines

  1. St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS — Must use Float32 + cute.recast_ptr
  2. V major ≠ K major — V must be MN-major, use as_strided
  3. C-fragment → A-fragment TMEM alias only works when N_MMA matches — (128,128) works, (128,64) breaks
  4. PipelineUmmaAsync consumer = thread count, NOT warp count32 * len(warp_ids)
  5. mma_si pipeline must NOT pass cta_layout_vmnk
  6. TMA warp excluded from tmem barrier
  7. First PV ACCUMULATE=False
  8. TMEM offset: FP32 ptr + 32 = BF16 ptr + 64 (width scaling)
  9. epi_tile must use PV cta_tile, not QK
  10. CuTe nested layout modes flatten sequentially((128,16),1,(4,2)):((65536,1),0,(16,64)) is sequential

Architecture: Per-Tile Flow

For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM

Environment

  • Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
  • venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
  • PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
  • vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
  • Pseudocode: /root/fragile-kernel-example/README.md
  • fmha.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py
  • fmha_bwd.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py