biondizzle 97656a5cd1 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 
- Stage B: runs without crash, identity softmax cosine -0.02 
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
2026-05-19 09:37:38 +00:00

DeepSeek-V4 NVFP4 Kernel Suite

CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use cutlass.cute (CuTeDSL) with Blackwell tensor cores.

File Map

cutedsl/
├── native_swa_decode.py         # SWA decode attention — IN PROGRESS (v3 tcgen05 rewrite)
├── native_sparse_decode.py      # Sparse (CSA/HCA) decode — NOT YET REWRITTEN
├── nvfp4_cutedsl.py             # NVFP4 MoE runner (CuTeDSL) — WORKING
├── moe_pipeline.py              # MoE fused SwiGLU pipeline — WORKING
├── blackwell_attention.py       # vLLM bridge for Blackwell attention path
├── csa_attention.py             # CSA/HCA sparse attention bridge
├── custom_ops.py                # Custom CUDA ops registration
└── kernel/
    └── blockscaled_gemm/
        └── dense_blockscaled_gemm_persistent.py  # REFERENCE: Blackwell TMEM/tcgen05 GEMM

tests/
├── test_stage_a_v2.py           # ✅ Stage A: bare Q@K^T via tcgen05.mma → TMEM → GMEM
├── test_stage_b_v7.py           # 🔨 Stage B: two MMAs + identity softmax (runs, wrong output)
├── test_stage_b_minimal.py      # ✅ Stage B minimal: two MMAs, no softmax (runs, NaN expected)
├── test_stage_b_pipeline_only.py # ✅ Stage B pipeline-only: PipelineUmmaAsync, no ld/st (runs, NaN expected)
├── diag_tmem.py                 # Diagnostic: TMEM layout inspection
├── test_stage_b_v6.py           # ❌ Stage B v6 (hardcoded offsets, crashes)
├── test_stage_a_qk.py           # ❌ Stage A v1 (broken, superseded by v2)
├── test_stage_a_minimal.py      # ❌ Stage A minimal (broken, superseded by v2)
├── test_attention_path_b200.py  # Full attention path test (uses naive BF16 attn)
└── ...

Current Status

Stage A: Bare Q@K^T via tcgen05.mma — COMPLETE (May 20)

File: tests/test_stage_a_v2.py Result: Q(128,128) @ K^T(128,128) → S(128,128), cosine 0.999999

Validates the full tcgen05.mma → TMEM → epilogue → GMEM path:

  • tcgen05.mma with BF16 inputs, FP32 TMEM accumulator
  • TMA load for A and B (cute.nvgpu.make_tiled_tma_atom_A/B)
  • TMA store for C (cpasync.CopyBulkTensorTileS2GOp)
  • Warp specialization: 4 epilogue warps + 1 MMA warp + 1 TMA warp = 192 threads
  • PipelineTmaUmma for AB pipeline, PipelineUmmaAsync for acc pipeline
  • TmemAllocator for TMEM allocation/deallocation
  • utils.gemm.sm100.epilogue_tma_store for the TMEM→reg→SMEM→TMA→GMEM epilogue

🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS (May 20)

Latest: tests/test_stage_b_v7.py Status: Kernel compiles and runs without crashing. Identity softmax produces wrong output (cosine ≈ -0.02).

What was fixed today:

  1. PipelineUmmaAsync consumer group size crash (THE bug): PipelineUmmaAsync with Agent.Thread requires thread count (128), NOT warp count (4), for the consumer group. fmha.py uses 32 * len(softmax_warp_ids) = 128. Using 4 caused CUDA_ERROR_LAUNCH_FAILED (not a deadlock — the barrier reached wrong threshold causing illegal TMEM access).

    # WRONG (caused CUDA_ERROR_LAUNCH_FAILED):
    consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)
    # CORRECT (matches fmha.py):
    consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 128)
    
  2. TMEM offset computation (no more hardcoding):

    • s_cols = find_tmem_tensor_col_offset(tStS) = 128 — QK accumulator physical TMEM columns
    • o_cols = find_tmem_tensor_col_offset(tOtO) = 128 — PV accumulator physical TMEM columns
    • tmem_s0_offset = 0, tmem_p0_offset = 32, tmem_o0_offset = 128 — matches fmha.py
    • find_tmem_tensor_col_offset(tOrP_sliced) = 32800 = 0x8020 — 0x8000 is TMEM space tag, column offset = 32
    • Total: 256 TMEM cols (verified by get_num_tmem_alloc_cols)
  3. P fragment construction (matching fmha.py):

    tP = cute.make_tensor(tStS.iterator, p_tmem_s.outer)  # A-layout from PV MMA
    tOrP = pv_thr.make_fragment_A(tP)[None, None, None, 0]
    tOrP0 = cute.make_tensor(tOrP.iterator + 2 * tmem_p0_offset, tOrP.layout)
    

    Previously used cute.composition on C-layout — wrong, must use PV MMA's A-layout.

  4. V SMEM aliasing: V shares the same SMEM as K with a different layout interpretation:

    sV_ptr = cute.recast_ptr(sB.iterator, v_smem_s.inner)
    sV = cute.make_tensor(sV_ptr, v_smem_s.outer)
    tCrV = pv_mma.make_fragment_B(sV)  # Uses MN-major V layout
    

What's still broken:

The identity softmax C→A layout transform produces garbage output (cosine ≈ -0.02). The kernel runs, Stage A (Q@K^T) gives cosine 0.999999, but the full (Q@K^T)@V pipeline is wrong. The issue is in the tcgen05.ld/st identity softmax path — either the ld/st copy atoms, the register conversion, or the A-layout write positions are incorrect.

Bisection results:

  • Stage B minimal (no pipeline, no softmax): runs, NaN (expected — no C→A transform)
  • Stage B pipeline-only (PipelineUmmaAsync, no ld/st): runs, NaN (expected)
  • 🔨 Stage B full (identity softmax): runs, cosine -0.02 (wrong — softmax transform is broken)
  • All three crash with consumer_group=4, all run with consumer_group=128

TMEM layout diagnostic data:

QK accumulator C fragment:
  tStS.layout = ((128,128),1,1):((65536,1),0,0)
  cute.size = 16384, cute.cosize = 8323200
  find_tmem_tensor_col_offset = 128

PV A-fragment (P operand):
  tOrP_sliced.layout = ((128,16),1,4):((65536,1),0,16)
  cute.size = 8192, cute.cosize = 8323136
  find_tmem_tensor_col_offset = 32800 = 0x8020 (0x8000 tag + col 32)

🔨 Stage C: Online Softmax — AFTER B

The hard part. Per the pseudocode:

  • Epilogue warps tcgen05.ld scores from TMEM into register fragments
  • Compute per-row: tile_max, new_max, rescale = exp(old_max - new_max)
  • Apply rescale to tmem_output in place (tmem_output *= rescale)
  • Compute exp(scores - new_max), tcgen05.st back to TMEM as P operand for MMA2
  • Update row_sum = row_sum * rescale + new_tile_sum

The register fragment layout from tcgen05.ld is NOT (row, col). It's determined by the MMA instruction's partition of the accumulator. Need to figure out the mapping from fragment indices to logical (head, kv_pos) positions for per-row softmax operations. fmha.py uses tTMEM_LOADrS.load().reduce(cute.ReductionOp.MAX, row_max, 0) for the row max — a built-in reduction that handles the layout.

🔨 Stage D: FP8 Paged KV Gather — AFTER C

Replace BF16 TMA load of KV with:

  • Indexed cp.async gather from paged KV cache (fp8)
  • Per-position dequant scale (inv_scale) applied during or after gather
  • Keep KV in fp8 in SMEM, let the MMA's per-row scale handle dequant (like blockscaled GEMM)

Architecture: Per-Tile Flow (from /root/fragile-kernel-example/README.md)

For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → now it's the P operand
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), then MMA2: P @ sKV[stage] → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM

NVFP4 MoE (CuTeDSL) — WORKING

  • nvfp4_cutedsl.py + moe_pipeline.py
  • CuTeDSL NVFP4 Linear (q_a, kv, q_b, o_b) — cosine 0.994+
  • CuTeDSL NVFP4 MoE (L1 gate+up, SiLU, L2 down) — cosine 0.988
  • Fused SwiGLU epilogue (granularity-8 weight interleave) — cosine 0.988

FP8 KV Quantize/Dequant — WORKING

  • FP8 KV: cosine 0.9997
  • NVFP4 KV: cosine 0.9943 (2x smaller than FP8)
  • Paged KV cache read/write: cosine 1.0

Sparse Decode Attention — NOT YET REWRITTEN

native_sparse_decode.py still has the scalar FMA bug. Needs the same tcgen05.mma rewrite.

Full Attention Pipeline (standalone tests) — WORKING

  • FP8 KV → full attention: cosine 0.9997
  • CSA sparse attention (cr=4): works
  • HCA sparse attention (cr=128): works
  • Merged CSA+SWA attention: works

Critical APIs & Lessons

PipelineUmmaAsync consumer group size — THE MAY 20 BUG

For Agent.Thread groups in PipelineUmmaAsync: use thread count, NOT warp count.

# WRONG (caused CUDA_ERROR_LAUNCH_FAILED):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)  # warp count

# CORRECT (matches fmha.py):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 32 * len(softmax_warp_ids))  # thread count

This applies to ALL PipelineUmmaAsync consumers where the consumer is multiple warps. fmha.py line 671: self.threads_per_warp * len(self.softmax0_warp_ids) = 32 * 4 = 128.

Note: The earlier README incorrectly stated that warp count was correct. That was wrong. The Agent.Thread agent type measures group size in threads.

TMEM offset arithmetic

  • find_tmem_tensor_col_offset(fragment) — returns physical TMEM column count (with 0x8000 tag for A-fragments)
  • QK accumulator C fragment: 128 TMEM columns
  • PV A-fragment: offset 0x8020 = tag(0x8000) + col(32) — the 0x8000 is a TMEM memory-space identifier
  • P OVERLAPS S in TMEM — P is written at column 32 within the S region (C-layout columns 0..127)
  • tOrP0 = cute.make_tensor(tOrP.iterator + acc_dtype.width // q_dtype.width * tmem_p0_offset, tOrP.layout) — A-fragment offset scaled by dtype width ratio (F32/BF16 = 2)

make_trivial_tiled_mma has two overloads

# New (preferred):
make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

# Deprecated (still works, used by Stage A):
make_trivial_tiled_mma(ab_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

V SMEM aliasing (K and V share SMEM)

# K and V share the same SMEM buffer, but with different layouts:
v_smem_s = utils.sm100.make_smem_layout_b(pv_mma, pv_mma_tiler, b_dtype, 1)
sV_ptr = cute.recast_ptr(sB.iterator, v_smem_s.inner)
sV = cute.make_tensor(sV_ptr, v_smem_s.outer)
tCrV = pv_mma.make_fragment_B(sV)

Other APIs discovered from Stage A

  1. cute.Tensor APIcutlass_torch.from_dlpack(t).mark_layout_dynamic(leading_dim=...)
  2. 3D tensors — Tensors must be 3D (M, K, L) for cute.local_tile — add L=1 dimension
  3. PipelineTmaUmma.create(...).make_participants() — returns (producer, consumer) pair
  4. utils.gemm.sm100.epilogue_tma_store — handles transform + partition/dcopy. DO NOT hand-roll.
  5. get_num_tmem_alloc_cols — correct TMEM allocation (accepts list of fragments, sums cols, rounds to power of 2)
  6. smem.allocate_tensor() — for SMEM tensors (not SharedStorage struct for A/B/C)
  7. LayoutEnum.from_tensor(a).mma_major_mode() — major mode from cute tensor
  8. Minimum valid N tile for tcgen05.mma BF16: 32 (step 32, range 32-256)

Environment

  • Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
  • venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
  • PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
  • vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
  • Pseudocode: /root/fragile-kernel-example/README.md — authoritative per-tile attention flow
  • fmha.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py

4-Stage Build Plan

Stage Goal Status
A Bare Q@K^T via tcgen05.mma → TMEM → GMEM COMPLETE
B Two MMAs + identity softmax (validates TMEM A operand, shared KV, layout transform, barrier ordering) 🔨 Runs without crash, identity softmax produces wrong output
C Online softmax between MMA1 and MMA2 (the hard part) TODO
D FP8 paged KV gather + dequant (replace BF16 TMA load) TODO
Description
No description provided
Readme 13 MiB
Languages
Python 74.9%
Cuda 25%