biondizzle 7a8945eb76 Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
2026-05-21 04:10:07 +00:00
2026-05-19 09:37:38 +00:00

DeepSeek-V4 NVFP4 Kernel Suite

CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use cutlass.cute (CuTeDSL) with Blackwell tensor cores.

Status (May 21, 2026 — 04:10 UTC)

Stage A: Bare Q@K^T via tcgen05.mma → TMEM → GMEM — COMPLETE

File: tests/test_stage_a_v2.py Result: Q(128,128) @ K^T(128,128) → S(128,128), cosine 0.999999

🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS

Pipeline deadlock: FIXED. Kernel runs without deadlock. Bug 1 (V MN-major): Fix applied. Bug 2 (softmax packing): Fix applied, but PV output is garbage.

Bug 1: V B-Operand Must Be MN-Major — FIX APPLIED

V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) — MN-major. PV MMA uses v_major (OperandMajorMode.MN) instead of b_major (K).

V must use as_strided — default PyTorch (64,128) gives strides (128,1) which is K-major.

Bug 2 (Packing): C-Fragment Composition Store — APPLIED, PV OUTPUT WRONG

FP32→BF16 packing via C-fragment composition store (FMHA pattern) runs without error. The softmax packing overwrites part of S in TMEM (P at tmem_p0_offset=32 overlaps S at offset 0). This is intentional — S is no longer needed after softmax.

FOOTGUN: St32x32bOp MUST use Float32, NOT BFloat16. ⚠️ The recast view for P packing uses the LOAD layout (128 BF16 elements), not the store composition shape.

Bug 3 (NEW): PV MMA Output Is Garbage — 🔨 INVESTIGATING

The PV MMA produces cosine ~0.01 against the reference. Suspected cause: TMEM layout mismatch between the softmax P store (C-fragment composition layout) and the PV MMA A-fragment read (p_tmem_s layout from make_smem_layout_a). These should alias the same physical TMEM columns by the sequential-flattening property, but the specific layout functions may compute different shapes/strides.

🔨 Stage C: Online Softmax — AFTER B

Per the pseudocode: epilogue warps compute per-row tile_max, rescale, exp, store P back to TMEM.

🔨 Stage D: FP8 Paged KV Gather — AFTER C

Replace BF16 TMA load with FP8 paged KV gather + per-position dequant.


Pipeline Deadlock — FIXED (May 21)

v20-v25 all deadlocked on GPU. Three root causes found and fixed:

Fix 1: PipelineUmmaAsync for mma_si Must NOT Pass cta_layout_vmnk

FMHA's mma_s0/mma_s1 PipelineUmmaAsync calls do NOT pass cta_layout_vmnk. Removing it fixes the deadlock.

Fix 2: TMA Warp Must NOT Call tmem.wait_for_alloc()

The tmem allocation barrier has num_threads = 32 * (mma_warp + epilogue_warps). The TMA warp is NOT part of this barrier. Calling wait_for_alloc() from the TMA warp corrupts the barrier.

Fix 3: PipelineTmaStore (not TmaStorePipeline)

pipeline.TmaStorePipeline does not exist. The correct name is pipeline.PipelineTmaStore.


DEAD TEST: test_stage_b_v21.py — DELETED, DO NOT RECREATE

v21 attempted both Bug 1 and Bug 2 fixes in a hand-rolled pipeline kernel. It deadlocks on GPU. Root cause: pipeline synchronization mismatch. Do not recreate. Write from scratch using fmha.py as the reference.


FOOTGUNS — CUTLASS CuTeDSL Landmines

1. St32x32bOp with 16-bit dtype → ILLEGAL MEMORY ACCESS

St32x32bOp(Repetition(N), BFloat16) crashes at runtime. You MUST use St32x32bOp(Repetition(N), Float32) and pack 2×16-bit values into 1×Float32 backing words via cute.recast_ptr. The 16-bit type only appears in the recast view, never in the store atom itself.

2. V B-Operand Major Mode ≠ K Major Mode

FMHA requires v_major_mode == OperandMajorMode.MN. Passing K's K-major mode for V is WRONG. V must be shaped (head_dim, seq) with strides (1, head_dim) to produce MN-major. Standard PyTorch row-major (seq, head_dim) gives K-major.

3. CuTe Nested Layout Modes Flatten Sequentially

A layout like ((128,16),1,(4,2)):((65536,1),0,(16,64)) looks "non-sequential" but flattens to addr = m*65536 + k when k = k0 + 16k1 + 64k2 (CuTe row-major order). Do NOT assume nested modes imply non-sequential physical addressing. The C-fragment composition and A-fragment alias the same TMEM columns.

4. PipelineUmmaAsync Consumer Group = Thread Count, NOT Warp Count

# WRONG: consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)
# CORRECT: consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 32 * len(warp_ids))

5. PipelineUmmaAsync for mma_si Must NOT Pass cta_layout_vmnk

Passing cta_layout_vmnk to the mma_si PipelineUmmaAsync causes deadlock. FMHA does not pass it. Remove it.

6. TMA Warp Must NOT Call tmem.wait_for_alloc()

The tmem allocation barrier only includes MMA + epilogue warps. The TMA warp is excluded. Calling wait_for_alloc() from the TMA warp corrupts the barrier.


Architecture: Per-Tile Flow

For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → P operand (via C-fragment composition)
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM

Test Results

File Description Cosine Status
test_stage_a_v2.py Q@K^T only 0.999999 PASS
test_mma_si_only.py Q@K^T + mma_si pipeline (no PV) 0.999999 PASS
test_softmax_only.py Q@K^T + softmax packing, output S 0.52 S overwritten by P (expected)
test_mma_si_pv.py Q@K^T + softmax + P@V (V MN-major) 0.01 PV output garbage
test_stage_b_v7.py Q@K^T + C-fragment softmax (V=K, wrong major) -0.02 wrong major + P packing
test_stage_b_v20.py Q@K^T + softmax (V=K, PipelineTmaStore bug) N/A compile error

Critical APIs & Lessons

TMEM offset arithmetic

  • find_tmem_tensor_col_offset(fragment) — returns physical TMEM column count
  • QK accumulator: 128 TMEM columns
  • A-fragment offset: acc_dtype.width // q_dtype.width * tmem_p0_offset (F32/BF16=2)

pv_mma_tiler — FMHA Convention

pv_mma_tiler = (qk_mma_tiler[0], qk_mma_tiler[2], qk_mma_tiler[1])
# = (M, head_dim, QK_N) = (128, 64, 128) for head_dim=64

make_trivial_tiled_mma — Use New Overload

make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

3D tensors required

Tensors must be 3D (M, K, L) for cute.local_tile — add L=1 dimension.

Other APIs

  1. cutlass_torch.from_dlpack(t).mark_layout_dynamic(leading_dim=...) — CuTe tensor from PyTorch
  2. PipelineTmaUmma.create(...).make_participants() — returns (producer, consumer) pair
  3. utils.gemm.sm100.epilogue_tma_store — handles transform + partition/dcopy. DO NOT hand-roll.
  4. smem.allocate_tensor() — for SMEM tensors
  5. LayoutEnum.from_tensor(a).mma_major_mode() — major mode from cute tensor

Environment

  • Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
  • venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
  • PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
  • vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
  • Pseudocode: /root/fragile-kernel-example/README.md
  • fmha.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py
  • fmha_bwd.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py
Description
No description provided
Readme 13 MiB
Languages
Python 74.9%
Cuda 25%