Go to file

biondizzle 97656a5cd1 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results

2026-05-20 20:26:25 +00:00

cutedsl

Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

2026-05-20 20:26:25 +00:00

cutedsl_loader

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

reference

docs: add CuTeDSL rewrite plan + reference files

2026-05-16 02:41:51 +00:00

tests

Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

2026-05-20 20:26:25 +00:00

.dockerignore

Patch utils.py the standard way: copy modified file into Docker image

2026-05-18 19:10:08 +00:00

.gitignore

feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

2026-05-15 11:38:18 +00:00

build_and_run.sh

Revert "feat: auto-warmup in build_and_run.sh"

2026-05-16 06:14:28 +00:00

docker-compose.yml

Reduce max_model_len to 256

2026-05-19 09:37:38 +00:00

Dockerfile

nuke vllm because this keep confusing people

2026-05-19 23:04:36 +00:00

pyproject.toml

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

README.md

Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

2026-05-20 20:26:25 +00:00

README.md

DeepSeek-V4 NVFP4 Kernel Suite

CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use cutlass.cute (CuTeDSL) with Blackwell tensor cores.

File Map

cutedsl/
├── native_swa_decode.py         # SWA decode attention — IN PROGRESS (v3 tcgen05 rewrite)
├── native_sparse_decode.py      # Sparse (CSA/HCA) decode — NOT YET REWRITTEN
├── nvfp4_cutedsl.py             # NVFP4 MoE runner (CuTeDSL) — WORKING
├── moe_pipeline.py              # MoE fused SwiGLU pipeline — WORKING
├── blackwell_attention.py       # vLLM bridge for Blackwell attention path
├── csa_attention.py             # CSA/HCA sparse attention bridge
├── custom_ops.py                # Custom CUDA ops registration
└── kernel/
    └── blockscaled_gemm/
        └── dense_blockscaled_gemm_persistent.py  # REFERENCE: Blackwell TMEM/tcgen05 GEMM

tests/
├── test_stage_a_v2.py           # ✅ Stage A: bare Q@K^T via tcgen05.mma → TMEM → GMEM
├── test_stage_b_v7.py           # 🔨 Stage B: two MMAs + identity softmax (runs, wrong output)
├── test_stage_b_minimal.py      # ✅ Stage B minimal: two MMAs, no softmax (runs, NaN expected)
├── test_stage_b_pipeline_only.py # ✅ Stage B pipeline-only: PipelineUmmaAsync, no ld/st (runs, NaN expected)
├── diag_tmem.py                 # Diagnostic: TMEM layout inspection
├── test_stage_b_v6.py           # ❌ Stage B v6 (hardcoded offsets, crashes)
├── test_stage_a_qk.py           # ❌ Stage A v1 (broken, superseded by v2)
├── test_stage_a_minimal.py      # ❌ Stage A minimal (broken, superseded by v2)
├── test_attention_path_b200.py  # Full attention path test (uses naive BF16 attn)
└── ...

Current Status

✅ Stage A: Bare Q@K^T via tcgen05.mma — COMPLETE (May 20)

File: tests/test_stage_a_v2.py Result: Q(128,128) @ K^T(128,128) → S(128,128), cosine 0.999999

Validates the full tcgen05.mma → TMEM → epilogue → GMEM path:

tcgen05.mma with BF16 inputs, FP32 TMEM accumulator
TMA load for A and B (cute.nvgpu.make_tiled_tma_atom_A/B)
TMA store for C (cpasync.CopyBulkTensorTileS2GOp)
Warp specialization: 4 epilogue warps + 1 MMA warp + 1 TMA warp = 192 threads
PipelineTmaUmma for AB pipeline, PipelineUmmaAsync for acc pipeline
TmemAllocator for TMEM allocation/deallocation
utils.gemm.sm100.epilogue_tma_store for the TMEM→reg→SMEM→TMA→GMEM epilogue

🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS (May 20)

Latest: tests/test_stage_b_v7.py Status: Kernel compiles and runs without crashing. Identity softmax produces wrong output (cosine ≈ -0.02).

What was fixed today:

PipelineUmmaAsync consumer group size crash (THE bug): PipelineUmmaAsync with Agent.Thread requires thread count (128), NOT warp count (4), for the consumer group. fmha.py uses 32 * len(softmax_warp_ids) = 128. Using 4 caused CUDA_ERROR_LAUNCH_FAILED (not a deadlock — the barrier reached wrong threshold causing illegal TMEM access).
```
# WRONG (caused CUDA_ERROR_LAUNCH_FAILED):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)
# CORRECT (matches fmha.py):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 128)
```
TMEM offset computation (no more hardcoding):
- s_cols = find_tmem_tensor_col_offset(tStS) = 128 — QK accumulator physical TMEM columns
- o_cols = find_tmem_tensor_col_offset(tOtO) = 128 — PV accumulator physical TMEM columns
- tmem_s0_offset = 0, tmem_p0_offset = 32, tmem_o0_offset = 128 — matches fmha.py
- find_tmem_tensor_col_offset(tOrP_sliced) = 32800 = 0x8020 — 0x8000 is TMEM space tag, column offset = 32
- Total: 256 TMEM cols (verified by get_num_tmem_alloc_cols)

P fragment construction (matching fmha.py):

tP = cute.make_tensor(tStS.iterator, p_tmem_s.outer)  # A-layout from PV MMA
tOrP = pv_thr.make_fragment_A(tP)[None, None, None, 0]
tOrP0 = cute.make_tensor(tOrP.iterator + 2 * tmem_p0_offset, tOrP.layout)

Previously used cute.composition on C-layout — wrong, must use PV MMA's A-layout.

V SMEM aliasing: V shares the same SMEM as K with a different layout interpretation:

sV_ptr = cute.recast_ptr(sB.iterator, v_smem_s.inner)
sV = cute.make_tensor(sV_ptr, v_smem_s.outer)
tCrV = pv_mma.make_fragment_B(sV)  # Uses MN-major V layout

What's still broken:

The identity softmax C→A layout transform produces garbage output (cosine ≈ -0.02). The kernel runs, Stage A (Q@K^T) gives cosine 0.999999, but the full (Q@K^T)@V pipeline is wrong. The issue is in the tcgen05.ld/st identity softmax path — either the ld/st copy atoms, the register conversion, or the A-layout write positions are incorrect.

Bisection results:

✅ Stage B minimal (no pipeline, no softmax): runs, NaN (expected — no C→A transform)
✅ Stage B pipeline-only (PipelineUmmaAsync, no ld/st): runs, NaN (expected)
🔨 Stage B full (identity softmax): runs, cosine -0.02 (wrong — softmax transform is broken)
All three crash with consumer_group=4, all run with consumer_group=128

TMEM layout diagnostic data:

QK accumulator C fragment:
  tStS.layout = ((128,128),1,1):((65536,1),0,0)
  cute.size = 16384, cute.cosize = 8323200
  find_tmem_tensor_col_offset = 128

PV A-fragment (P operand):
  tOrP_sliced.layout = ((128,16),1,4):((65536,1),0,16)
  cute.size = 8192, cute.cosize = 8323136
  find_tmem_tensor_col_offset = 32800 = 0x8020 (0x8000 tag + col 32)

🔨 Stage C: Online Softmax — AFTER B

The hard part. Per the pseudocode:

Epilogue warps tcgen05.ld scores from TMEM into register fragments
Compute per-row: tile_max, new_max, rescale = exp(old_max - new_max)
Apply rescale to tmem_output in place (tmem_output *= rescale)
Compute exp(scores - new_max), tcgen05.st back to TMEM as P operand for MMA2
Update row_sum = row_sum * rescale + new_tile_sum

The register fragment layout from tcgen05.ld is NOT (row, col). It's determined by the MMA instruction's partition of the accumulator. Need to figure out the mapping from fragment indices to logical (head, kv_pos) positions for per-row softmax operations. fmha.py uses tTMEM_LOADrS.load().reduce(cute.ReductionOp.MAX, row_max, 0) for the row max — a built-in reduction that handles the layout.

🔨 Stage D: FP8 Paged KV Gather — AFTER C

Replace BF16 TMA load of KV with:

Indexed cp.async gather from paged KV cache (fp8)
Per-position dequant scale (inv_scale) applied during or after gather
Keep KV in fp8 in SMEM, let the MMA's per-row scale handle dequant (like blockscaled GEMM)

Architecture: Per-Tile Flow (from /root/fragile-kernel-example/README.md)

For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → now it's the P operand
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), then MMA2: P @ sKV[stage] → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM

✅ NVFP4 MoE (CuTeDSL) — WORKING

nvfp4_cutedsl.py + moe_pipeline.py
CuTeDSL NVFP4 Linear (q_a, kv, q_b, o_b) — cosine 0.994+
CuTeDSL NVFP4 MoE (L1 gate+up, SiLU, L2 down) — cosine 0.988
Fused SwiGLU epilogue (granularity-8 weight interleave) — cosine 0.988

✅ FP8 KV Quantize/Dequant — WORKING

FP8 KV: cosine 0.9997
NVFP4 KV: cosine 0.9943 (2x smaller than FP8)
Paged KV cache read/write: cosine 1.0

❌ Sparse Decode Attention — NOT YET REWRITTEN

native_sparse_decode.py still has the scalar FMA bug. Needs the same tcgen05.mma rewrite.

✅ Full Attention Pipeline (standalone tests) — WORKING

FP8 KV → full attention: cosine 0.9997
CSA sparse attention (cr=4): works
HCA sparse attention (cr=128): works
Merged CSA+SWA attention: works

Critical APIs & Lessons

PipelineUmmaAsync consumer group size — THE MAY 20 BUG

For Agent.Thread groups in PipelineUmmaAsync: use thread count, NOT warp count.

# WRONG (caused CUDA_ERROR_LAUNCH_FAILED):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)  # warp count

# CORRECT (matches fmha.py):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 32 * len(softmax_warp_ids))  # thread count

This applies to ALL PipelineUmmaAsync consumers where the consumer is multiple warps. fmha.py line 671: self.threads_per_warp * len(self.softmax0_warp_ids) = 32 * 4 = 128.

Note: The earlier README incorrectly stated that warp count was correct. That was wrong. The Agent.Thread agent type measures group size in threads.

TMEM offset arithmetic

find_tmem_tensor_col_offset(fragment) — returns physical TMEM column count (with 0x8000 tag for A-fragments)
QK accumulator C fragment: 128 TMEM columns
PV A-fragment: offset 0x8020 = tag(0x8000) + col(32) — the 0x8000 is a TMEM memory-space identifier
P OVERLAPS S in TMEM — P is written at column 32 within the S region (C-layout columns 0..127)
tOrP0 = cute.make_tensor(tOrP.iterator + acc_dtype.width // q_dtype.width * tmem_p0_offset, tOrP.layout) — A-fragment offset scaled by dtype width ratio (F32/BF16 = 2)

`make_trivial_tiled_mma` has two overloads

# New (preferred):
make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

# Deprecated (still works, used by Stage A):
make_trivial_tiled_mma(ab_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

# K and V share the same SMEM buffer, but with different layouts:
v_smem_s = utils.sm100.make_smem_layout_b(pv_mma, pv_mma_tiler, b_dtype, 1)
sV_ptr = cute.recast_ptr(sB.iterator, v_smem_s.inner)
sV = cute.make_tensor(sV_ptr, v_smem_s.outer)
tCrV = pv_mma.make_fragment_B(sV)

Other APIs discovered from Stage A

cute.Tensor API — cutlass_torch.from_dlpack(t).mark_layout_dynamic(leading_dim=...)
3D tensors — Tensors must be 3D (M, K, L) for cute.local_tile — add L=1 dimension
PipelineTmaUmma.create(...).make_participants() — returns (producer, consumer) pair
utils.gemm.sm100.epilogue_tma_store — handles transform + partition/dcopy. DO NOT hand-roll.
get_num_tmem_alloc_cols — correct TMEM allocation (accepts list of fragments, sums cols, rounds to power of 2)
smem.allocate_tensor() — for SMEM tensors (not SharedStorage struct for A/B/C)
LayoutEnum.from_tensor(a).mma_major_mode() — major mode from cute tensor
Minimum valid N tile for tcgen05.mma BF16: 32 (step 32, range 32-256)

Environment

Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
Pseudocode: /root/fragile-kernel-example/README.md — authoritative per-tile attention flow
fmha.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py

4-Stage Build Plan

Stage	Goal	Status
A	Bare Q@K^T via tcgen05.mma → TMEM → GMEM	✅ COMPLETE
B	Two MMAs + identity softmax (validates TMEM A operand, shared KV, layout transform, barrier ordering)	🔨 Runs without crash, identity softmax produces wrong output
C	Online softmax between MMA1 and MMA2 (the hard part)	⬜ TODO
D	FP8 paged KV gather + dequant (replace BF16 TMA load)	⬜ TODO

README.md

DeepSeek-V4 NVFP4 Kernel Suite

File Map

Current Status

✅ Stage A: Bare Q@K^T via tcgen05.mma — COMPLETE (May 20)

🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS (May 20)

🔨 Stage C: Online Softmax — AFTER B

🔨 Stage D: FP8 Paged KV Gather — AFTER C

Architecture: Per-Tile Flow (from /root/fragile-kernel-example/README.md)

✅ NVFP4 MoE (CuTeDSL) — WORKING

✅ FP8 KV Quantize/Dequant — WORKING

❌ Sparse Decode Attention — NOT YET REWRITTEN

✅ Full Attention Pipeline (standalone tests) — WORKING

Critical APIs & Lessons

PipelineUmmaAsync consumer group size — THE MAY 20 BUG

TMEM offset arithmetic

make_trivial_tiled_mma has two overloads

V SMEM aliasing (K and V share SMEM)

Other APIs discovered from Stage A

Environment

4-Stage Build Plan

`make_trivial_tiled_mma` has two overloads