Go to file

biondizzle 467ade37b2 Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)

2026-05-21 00:12:47 +00:00

cutedsl

Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong

2026-05-20 20:26:25 +00:00

cutedsl_loader

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

reference

docs: add CuTeDSL rewrite plan + reference files

2026-05-16 02:41:51 +00:00

tests

Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

2026-05-21 00:12:47 +00:00

.dockerignore

Patch utils.py the standard way: copy modified file into Docker image

2026-05-18 19:10:08 +00:00

.gitignore

feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

2026-05-15 11:38:18 +00:00

build_and_run.sh

Revert "feat: auto-warmup in build_and_run.sh"

2026-05-16 06:14:28 +00:00

docker-compose.yml

Reduce max_model_len to 256

2026-05-19 09:37:38 +00:00

Dockerfile

nuke vllm because this keep confusing people

2026-05-19 23:04:36 +00:00

pyproject.toml

fix: dynamic buffer sizing in nvfp4_linear for varying token counts

2026-05-19 23:59:55 +00:00

README.md

Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed

2026-05-21 00:12:47 +00:00

README.md

DeepSeek-V4 NVFP4 Kernel Suite

CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use cutlass.cute (CuTeDSL) with Blackwell tensor cores.

File Map

cutedsl/
├── native_swa_decode.py         # SWA decode attention — IN PROGRESS (v3 tcgen05 rewrite)
├── native_sparse_decode.py      # Sparse (CSA/HCA) decode — NOT YET REWRITTEN
├── nvfp4_cutedsl.py             # NVFP4 MoE runner (CuTeDSL) — WORKING
├── moe_pipeline.py              # MoE fused SwiGLU pipeline — WORKING
├── blackwell_attention.py       # vLLM bridge for Blackwell attention path
├── csa_attention.py             # CSA/HCA sparse attention bridge
├── custom_ops.py                # Custom CUDA ops registration
└── kernel/
    └── blockscaled_gemm/
        └── dense_blockscaled_gemm_persistent.py  # REFERENCE: Blackwell TMEM/tcgen05 GEMM

tests/
├── test_stage_a_v2.py                # ✅ Stage A: bare Q@K^T via tcgen05.mma → TMEM → GMEM
├── test_stage_b_v7.py                # 🔨 Stage B: two MMAs + C-fragment softmax (runs, wrong output)
├── test_stage_b_afrag2.py            # 🔨 Stage B: A-fragment store pattern (compiles, wrong output)
├── test_tmem_pure_fp32.py            # ✅ FP32 ld→st roundtrip on C-fragment: cosine 0.999999
├── test_bf16_elemwise.py             # ✅ FP32→BF16→FP32 elemwise + FP32 st: cosine 0.999999
├── test_recast_minimal.py            # ✅ BF16 recast ld S0→st S1 via C-fragment: cosine 0.999999
├── test_bf16_recast_simple.py        # ❌ BF16 recast ld/st same region (S0): zero (can't overwrite MMA output)
├── test_tmem_copy_roundtrip.py       # ❌ BF16 recast + C→A mismatch: zero
├── test_stage_b_final.py             # ❌ C-fragment st + A-fragment read: NaN (physical layout mismatch)
├── test_afrag_roundtrip.py           # ❌ A-frag st corrupts S0 (overlapping TMEM region)
├── diag_tmem.py                      # Diagnostic: TMEM layout inspection
└── ...

Current Status

✅ Stage A: Bare Q@K^T via tcgen05.mma — COMPLETE (May 20)

File: tests/test_stage_a_v2.py Result: Q(128,128) @ K^T(128,128) → S(128,128), cosine 0.999999

Validates the full tcgen05.mma → TMEM → epilogue → GMEM path:

tcgen05.mma with BF16 inputs, FP32 TMEM accumulator
TMA load for A and B (cute.nvgpu.make_tiled_tma_atom_A/B)
TMA store for C (cpasync.CopyBulkTensorTileS2GOp)
Warp specialization: 4 epilogue warps + 1 MMA warp + 1 TMA warp = 192 threads
PipelineTmaUmma for AB pipeline, PipelineUmmaAsync for acc pipeline
TmemAllocator for TMEM allocation/deallocation
utils.gemm.sm100.epilogue_tma_store for the TMEM→reg→SMEM→TMA→GMEM epilogue

🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS (May 20-21)

Core Problem: The C-fragment (MMA accumulator) and A-fragment (MMA A-operand from TMEM) use different physical TMEM address mappings for the same logical (M,K) position. The softmax writes P via one mapping, but the PV MMA reads via the other. This produces garbage.

What's Been Proven

Test	Pattern	Result	Why
test_tmem_pure_fp32	FP32 ld→st, same C-fragment layout	✅ cos=0.999999	C-fragment addresses self-consistent
test_bf16_elemwise	FP32→BF16→FP32 elemwise, C-fragment st	✅ cos=0.999999	BF16 conversion works, C-fragment st works
test_recast_minimal	BF16 recast ld S0→st S1, C-fragment	✅ cos=0.999999	Recast works when writing to different region
test_bf16_recast_simple	BF16 recast ld/st same region S0	❌ zero	Can't overwrite MMA output in same region
test_stage_b_final	C-fragment st → A-fragment read (S1)	❌ NaN	C-layout ≠ A-layout physical addresses
test_stage_b_afrag2	A-fragment st (backward FMHA pattern)	❌ cos=-0.02	Store + PV MMA layout compatible, but register data flow wrong

Root Cause: C-fragment vs A-fragment Physical TMEM Layout

From the CUTLASS source (mma_traits_sm100.hpp):

C-fragment (MMA accumulator, FP32):

Layout: ((128,128),1,1):((65536,1),0,0) — virtual layout
Physical TMEM addresses determined by the MMA hardware's accumulator write path
St32x32bOp with C-fragment layout writes to C-fragment physical addresses

A-fragment (MMA A-operand from TMEM, BF16, K-major, M=128):

Layout: ((128,16),1,4):((65536,1),0,16) — physical TMEM layout
A[m, k_inner] → tmem[dp=m, col=base + 16*mma_k + k_inner]
BK=64 = 4 K=16 MMA atoms, NOT one K=64 atom
The 4D fragment partition order is NOT the physical TMEM order

The St32x32bOp with C-fragment composition writes to C-layout physical addresses. The PV MMA reads from A-layout physical addresses. These are different physical locations.

Forward FMHA's Approach (FP16 Only!)

Forward FMHA uses a recast pattern to pack 2×FP16 into 1×FP32 register, then St32x32bOp writes to a C-fragment composition subview. But forward FMHA explicitly rejects BF16:

if in_dtype not in {cutlass.Float8E4M3FN, cutlass.Float16}:
    raise ValueError(in_dtype must be Float8E4M3FN or Float16)

The recast softmax path is validated for FP16, NOT BF16. Our BF16 use is outside the tested path.

Backward FMHA's Approach (BF16 Supported)

Backward FMHA writes dV to TMEM using the A-fragment layout:

tdVrP_iter = cute.recast_ptr(tSTtST.iterator, dtype=self.element_dtype) — recast C-fragment iterator to BF16
tdVrP = cute.make_tensor(tdVrP_iter, tOrP.layout) — A-fragment layout, C-fragment base
tmem_store_atom = cute.make_copy_atom(St32x32bOp(Repetition(8)), self.element_dtype) — BF16 store atom
Quantize via make_rmem_tensor(input.shape, element_dtype) + .load()/.store(v.to(element_dtype)) — true BF16 register, NOT recast
Reshape: cute.make_tensor(rBf16.iterator, cute.make_layout(tStcS.shape)) — match store partition shape

This compiles and runs for us (no crash), but the output is still wrong (cosine -0.02). The remaining issue is the register layout mismatch:

Load partition (C-fragment): 128 FP32 values per thread (full 128×128 QK tile)
Store partition (A-fragment): 64 BF16 values per thread (128×64 P tile for PV MMA K=64)
The backward FMHA uses quantize() + reshape, but our element counts differ because the QK tile is 128×128 while P only needs 128×64

Next Steps for Stage B

Fix the register data flow — properly subselect the P-relevant 64 BF16 columns from the 128 FP32 load columns, or use the backward FMHA's PdO MMA tiler (M=128, N=64) instead of (M=128, N=128)
Verify A-fragment store roundtrip — write known BF16 values via A-fragment store, have PV MMA read them back via A-fragment, confirm the physical TMEM addresses match
Once data flow is correct, add online softmax (Stage C)

🔨 Stage C: Online Softmax — AFTER B

The hard part. Per the pseudocode:

Epilogue warps tcgen05.ld scores from TMEM into register fragments
Compute per-row: tile_max, new_max, rescale = exp(old_max - new_max)
Apply rescale to tmem_output in place (tmem_output *= rescale)
Compute exp(scores - new_max), tcgen05.st back to TMEM as P operand for MMA2
Update row_sum = row_sum * rescale + new_tile_sum

The register fragment layout from tcgen05.ld is NOT (row, col). It's determined by the MMA instruction's partition of the accumulator. Need to figure out the mapping from fragment indices to logical (head, kv_pos) positions for per-row softmax operations. fmha.py uses tTMEM_LOADrS.load().reduce(cute.ReductionOp.MAX, row_max, 0) for the row max — a built-in reduction that handles the layout.

🔨 Stage D: FP8 Paged KV Gather — AFTER C

Replace BF16 TMA load of KV with:

Indexed cp.async gather from paged KV cache (fp8)
Per-position dequant scale (inv_scale) applied during or after gather
Keep KV in fp8 in SMEM, let the MMA's per-row scale handle dequant (like blockscaled GEMM)

Architecture: Per-Tile Flow (from /root/fragile-kernel-example/README.md)

For each KV tile:
  1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
  2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
     Signals scores_full_mbar (via PipelineUmmaAsync commit)
  3. Epilogue warps wait on mma_si consumer (scores ready), then:
     a. tcgen05.ld scores from TMEM → register fragments
     b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
     c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
     d. tcgen05.st exp(scores - new_max) back to TMEM → now it's the P operand
     e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
  4. MMA warp waits on mma_si acquire (softmax done), then MMA2: P @ sKV[stage] → tmem_output (accumulate=True)
  5. Stage released, load warp can refill it

After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM

✅ NVFP4 MoE (CuTeDSL) — WORKING

nvfp4_cutedsl.py + moe_pipeline.py
CuTeDSL NVFP4 Linear (q_a, kv, q_b, o_b) — cosine 0.994+
CuTeDSL NVFP4 MoE (L1 gate+up, SiLU, L2 down) — cosine 0.988
Fused SwiGLU epilogue (granularity-8 weight interleave) — cosine 0.988

✅ FP8 KV Quantize/Dequant — WORKING

FP8 KV: cosine 0.9997
NVFP4 KV: cosine 0.9943 (2x smaller than FP8)
Paged KV cache read/write: cosine 1.0

❌ Sparse Decode Attention — NOT YET REWRITTEN

native_sparse_decode.py still has the scalar FMA bug. Needs the same tcgen05.mma rewrite.

✅ Full Attention Pipeline (standalone tests) — WORKING

FP8 KV → full attention: cosine 0.9997
CSA sparse attention (cr=4): works
HCA sparse attention (cr=128): works
Merged CSA+SWA attention: works

Critical APIs & Lessons

C-fragment ≠ A-fragment TMEM Physical Layout — THE MAY 20-21 FINDING

The St32x32bOp with C-fragment composition writes to C-layout physical TMEM addresses. The PV MMA reads from A-layout physical TMEM addresses. These are DIFFERENT physical locations for the same logical (M,K) position.

For the softmax to work, P must be written to TMEM using the A-fragment's physical layout, not the C-fragment's. The backward FMHA does this correctly by:

Creating the store destination with A-fragment layout + recast C-fragment iterator
Using a BF16 St32x32bOp atom
True BF16 register (not FP32 recast) via quantize() pattern

Forward FMHA Recast Pattern — FP16 ONLY

The cute.recast_ptr + .store(v.to(FP16)) pattern for packing 2×16-bit into 1×FP32 register is validated for FP16 only. BF16 is rejected in forward FMHA. The BF16 recast produces zero output when writing to the same TMEM region as the MMA output, and NaN when writing to a different region read via A-fragment.

PipelineUmmaAsync consumer group size — thread count, NOT warp count

# WRONG (caused CUDA_ERROR_LAUNCH_FAILED):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 4)  # warp count

# CORRECT (matches fmha.py):
consumer_group=pipeline.CooperativeGroup(pipeline.Agent.Thread, 32 * len(softmax_warp_ids))  # thread count

TMEM offset arithmetic

find_tmem_tensor_col_offset(fragment) — returns physical TMEM column count (with 0x8000 tag for A-fragments)
QK accumulator C fragment: 128 TMEM columns
PV A-fragment: offset 0x8020 = tag(0x8000) + col(32) — the 0x8000 is a TMEM memory-space identifier
tOrP0 = cute.make_tensor(tOrP.iterator + acc_dtype.width // q_dtype.width * tmem_p0_offset, tOrP.layout) — A-fragment offset scaled by dtype width ratio (F32/BF16 = 2)

A-fragment iterator must use recast C-fragment pointer

When creating the P tensor for PV MMA's A-operand, the iterator must be the C-fragment's iterator recast to BF16:

tP_iter = cute.recast_ptr(tStS.iterator, dtype=self.q_dtype)
tP = cute.make_tensor(tP_iter, p_tmem_s.outer)
tOrP = pv_thr.make_fragment_A(tP)[None, None, None, 0]

Without the recast, the A-fragment addresses are computed from an FP32 pointer base, giving wrong physical TMEM addresses (illegal memory access crash).

v_smem_s = utils.sm100.make_smem_layout_b(pv_mma, pv_mma_tiler, b_dtype, 1)
sV_ptr = cute.recast_ptr(sB.iterator, v_smem_s.inner)
sV = cute.make_tensor(sV_ptr, v_smem_s.outer)
tCrV = pv_mma.make_fragment_B(sV)

`make_trivial_tiled_mma` has two overloads

# New (preferred):
make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

# Deprecated (still works, used by Stage A):
make_trivial_tiled_mma(ab_dtype, a_leading_mode, b_leading_mode,
                        acc_dtype, cta_group, mma_tiler_mn, a_source=SMEM)

Other APIs discovered from Stage A

cute.Tensor API — cutlass_torch.from_dlpack(t).mark_layout_dynamic(leading_dim=...)
3D tensors — Tensors must be 3D (M, K, L) for cute.local_tile — add L=1 dimension
PipelineTmaUmma.create(...).make_participants() — returns (producer, consumer) pair
utils.gemm.sm100.epilogue_tma_store — handles transform + partition/dcopy. DO NOT hand-roll.
get_num_tmem_alloc_cols — correct TMEM allocation (accepts list of fragments, sums cols, rounds to power of 2)
smem.allocate_tensor() — for SMEM tensors (not SharedStorage struct for A/B/C)
LayoutEnum.from_tensor(a).mma_major_mode() — major mode from cute tensor
Minimum valid N tile for tcgen05.mma BF16: 32 (step 32, range 32-256)

Environment

Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
Pseudocode: /root/fragile-kernel-example/README.md — authoritative per-tile attention flow
fmha.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py
fmha_bwd.py reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py

4-Stage Build Plan

Stage	Goal	Status
A	Bare Q@K^T via tcgen05.mma → TMEM → GMEM	✅ COMPLETE
B	Two MMAs + identity softmax (validates TMEM A operand, shared KV, layout transform, barrier ordering)	🔨 A-fragment store compiles, register data flow needs fixing
C	Online softmax between MMA1 and MMA2 (the hard part)	⬜ TODO
D	FP8 paged KV gather + dequant (replace BF16 TMA load)	⬜ TODO

README.md Unescape Escape

DeepSeek-V4 NVFP4 Kernel Suite

File Map

Current Status

✅ Stage A: Bare Q@K^T via tcgen05.mma — COMPLETE (May 20)

🔨 Stage B: Two MMAs + Identity Softmax — IN PROGRESS (May 20-21)

What's Been Proven

Root Cause: C-fragment vs A-fragment Physical TMEM Layout

Forward FMHA's Approach (FP16 Only!)

Backward FMHA's Approach (BF16 Supported)

Next Steps for Stage B

🔨 Stage C: Online Softmax — AFTER B

🔨 Stage D: FP8 Paged KV Gather — AFTER C

Architecture: Per-Tile Flow (from /root/fragile-kernel-example/README.md)

✅ NVFP4 MoE (CuTeDSL) — WORKING

✅ FP8 KV Quantize/Dequant — WORKING

❌ Sparse Decode Attention — NOT YET REWRITTEN

✅ Full Attention Pipeline (standalone tests) — WORKING

Critical APIs & Lessons

C-fragment ≠ A-fragment TMEM Physical Layout — THE MAY 20-21 FINDING

Forward FMHA Recast Pattern — FP16 ONLY

PipelineUmmaAsync consumer group size — thread count, NOT warp count

TMEM offset arithmetic

A-fragment iterator must use recast C-fragment pointer

V SMEM aliasing (K and V share SMEM)

make_trivial_tiled_mma has two overloads

Other APIs discovered from Stage A

Environment

4-Stage Build Plan

README.md

`make_trivial_tiled_mma` has two overloads