DSV4 Inference Kernel

⚠️⚠️⚠️ CRITICAL: TMA Partition Tensor Mode Ordering ⚠️⚠️⚠️

THIS BUG COST US AN ENTIRE DAY. READ THIS. BURN IT INTO YOUR BRAIN.

After cpasync.tma_partition(), the output GMEM tensor has 4 modes (verified on B200):

tBgK shape: (((64, 128), 1), ?, KV_tiles, ?)
                 mode 0      1  2        3

Mode 2 is the GMEM tile dimension. The dimension you index with kt to load different K/V tiles.

THE WRONG WAY (what we did — silently loads from tile 0 forever):

# ❌❌❌ (None,None,0,0) KEEPS MODES 0,1 FREE, SETS MODE 2 TO 0 ❌❌❌
# Mode 2 (the KV tile dim) gets collapsed to coordinate 0.
# TMA ALWAYS reads from tile 0.
tBgK = tBgK[(None, None, 0, 0)]  # ← WRONG! Mode 2 pinned to 0!

# The copy "works" but kv_coord indexes mode 1 (inner GEMM K, not KV tiles).
cute.copy(tma_k, tBgK[(None, kv_coord)], ...)  # ← kv_coord indexes wrong mode!

THE RIGHT WAY (verified on B200 at n=128 and n=256):

# ✅ (None,0,None,0) keeps modes 0 and 2 free → 2D tensor
# Mode 2 (KV tiles) survives as the second mode.
tBgK = tBgK[(None, 0, None, 0)]

# ✅ [None, kt] indexes the surviving mode 1 (originally mode 2 = KV tiles)
cute.copy(tma_k, tBgK[None, kt], ...)
#                       ^^ THIS IS THE KV TILE DIM

Verified shapes on B200 (May 22, n=256, inside @cute.kernel):

Before slice: tBgK = (((64,128),1), Int32(?), Int32(?), Int32(?))  — 4 modes
After (None,0,None,0): tBgK = (((64,128),1), Int32(?))             — 2 modes

WHY THIS IS SO INSIDIOUS

  1. No error, no warning. The slice tBgK[(None,None,0,0)] silently sets mode 2 to 0.
  2. Single-tile (n=128) works perfectly. With only 1 KV tile, mode 2 is size 1, so the bug is invisible.
  3. Multi-tile tests produce "reasonable" output. The TMA loads from tile 0 every time, so you get a valid (but wrong) attention computation. Cosine similarity is 0.7-0.9, not NaN.
  4. The strides are all 0. Printing tBgK.layout.stride shows all zeros for TMA tensors. You can't detect the bug from strides alone.
  5. cute.printf shows kv_coord=0. We thought the JIT was constant-folding the variable. It wasn't — the variable was fine, but it was indexing the wrong mode.
  6. The 8-mode theory was wrong. We assumed tma_partition produced 8 TMA coordinate dimensions. It produces 4. The 8-None no-op slice fails with "weakly congruent" at JIT compile.

THE LESSON

PRINT THE SHAPES. ALWAYS. Run print(f"tBgK: shape={cute.shape(tBgK)}") inside @cute.kernel at trace time. The shapes are your ground truth. Reasoning about mode counts without evidence is how we wasted a day.

The correct pre-slice depends on which mode is the GMEM tile iteration axis. For our local_tile + partition_B + group_modes(0,3) pattern, mode 2 is the KV tile axis. (None,0,None,0) keeps it free. (None,None,0,0) collapses it to 0.

# ALWAYS verify the shape at trace time:
print(f"tBgK shape: {cute.shape(tBgK)}")  # 4 modes
print(f"tBgK after slice: {cute.shape(tBgK[(None,0,None,0)])}")  # 2 modes

# Then index the 2D tensor:
cute.copy(tma_k, tBgK[None, kt], ...)

IF YOU USE (None,None,0,0) INSTEAD OF (None,0,None,0), MULTI-TILE TMA WILL BE SILENTLY BROKEN.


Architecture

DSV4 is not MLA. It uses CSA (Compressed Sparse Attention, m=4) and HCA (Heavily Compressed Attention, m=128). KV latent is (T, 512) shared across all 128 heads. Sink weights merge sparse + SWA attention. vLLM misnames this as "MLA" — it is not. The architecture is fundamentally different.

DSV4 inference pipeline — component status
==========================================

Legend:
 [✓] built and tested
 [~] partial — reference or seam exists, native pending
 [✗] to build


 ┌────────────────────────────────────┐
 │ [✗] Embedding + mHC init          │
 │ token embed + n_hc=4 streams      │
 └────────────────┬───────────────────┘
                  │
                  ▼
┌─ Transformer layer × L ──────────────────────────────────────────────┐
│ HCA on layers 01 of Pro, alternating CSA / HCA after              │
│                                                                      │
│ ┌─ Attention sub-block ──────────────────────────────────────────┐  │
│ │ [✓] Residual mHC pre + post mix                               │  │
│ │ [~] Norms + RoPE             RMSNorm + partial RoPE           │  │
│ │ [✓] Q / KV projection        NVFP4 linears + LoRA             │  │
│ │ [~] Token compressor         CSA m=4 / HCA m=128             │  │
│ │ [✗] Indexer + top-k          CSA only, FP4 QK                 │  │
│ │ [~] FMHA core                QK → online softmax → PV         │  │
│ │                              + SWA branch + sink merge         │  │
│ │ [✓] Output projection        inv RoPE + wo_a grouped + wo_b   │  │
│ └────────────────────────────────────────────────────────────────┘  │
│                                                                      │
│ ┌─ FFN sub-block ────────────────────────────────────────────────┐  │
│ │ [✓] Residual mHC pre + post mix                               │  │
│ │ [~] Pre-FFN norm              RMSNorm                          │  │
│ │ [✗] Router                    sqrt(softplus) + topk + hash     │  │
│ │ [✓] Routed MoE               fused SwiGLU L1 + L2             │  │
│ │ [✓] Shared expert            NVFP4 single-group GEMM          │  │
│ └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────┐
│ [✗] Final RMSNorm → [✗] LM head → [✗] MTP (depth=1) → [✗] Sampler │
└──────────────────────────────────────────────────────────────────────┘

┌─ Supporting infrastructure ──────────────────────────────────────────┐
│ [✗] KV cache management                                             │
│ • state cache: SWA window + uncompressed tail per layer             │
│ • classical paged cache: lcm(m, m) = 128 tokens per block         │
│ • heterogeneous layout per layer                                    │
└──────────────────────────────────────────────────────────────────────┘


Summary
-------
 Built  [✓] : 6 — mHC ×2, Q/KV proj, output proj, routed MoE,
               shared expert
 Partial [~] : 4 — norms+RoPE, token compressor, FMHA core,
               pre-FFN norm
 To build [✗] : 8 — embedding+init, indexer+top-k, router,
               final norm, LM head, MTP, sampler, KV cache

Status (May 23, 2026 — 02:55 UTC)

Stage Status Description
A COMPLETE Q@K^T via tcgen05.mma → TMEM → GMEM
B COMPLETE QK → identity softmax → P@V pipeline (TMEM alias, KV-tile interleaving)
C ⚠️ SINGLE-TILE OK, MULTI-TILE 3% ERROR n=128 cos 0.973. n=256 cos 0.793. TMEM layout mismatch between MMA C-fragment and get_tmem_load_op. See below.
D TODO Full decode attention: paged KV cache, multi-query, causal mask
E TODO Production kernel: extract into dsv4/kernels/attention/, PyTorch custom op, vLLM bridge

Package Structure

dsv4/
├── kernels/          Pure GPU code (CuTeDSL @cute.jit, .cu files)
│   ├── gemm/           NVFP4 MoE GEMM kernels (grouped, fused_swiglu, dense, scheduler)
│   ├── attention/      FMHA kernel (stub — extraction is Stage E)
│   ├── compressor/     CSA/HCA token-level compressor
│   ├── decode/         Decode-time attention (sparse, SWA — future)
│   └── cuda/           Raw .cu files (deinterleave_quantize, sparse_topk_metadata)
├── ops/              PyTorch ↔ kernel bridges
│   ├── quantize.py      BF16 ↔ NVFP4 conversion, scale factors
│   ├── layouts.py       Scale swizzle, gate/up interleave, K-major, offsets
│   ├── gemm_runner.py   Warmup, compile, run grouped/fused GEMMs
│   ├── custom_ops.py    torch.library.custom_op registrations
│   ├── decode_sparse.py native_sparse_decode dispatcher
│   ├── decode_swa.py    native_swa_decode dispatcher
│   ├── rope.py          Forward + inverse RoPE
│   └── topk.py          Python wrapper for sparse_topk_metadata.cu
├── layers/           nn.Module-style components
│   ├── linear.py        Nvfp4Linear
│   ├── grouped_linear.py Nvfp4GroupedLinear
│   ├── moe.py           Nvfp4MoE
│   ├── shared_expert.py Nvfp4SharedExpert
│   ├── mhc.py           mHCLayer
│   └── (stubs: attention, ffn, router, norm, embedding)
├── model/            Model assembly (stubs — Phase 1)
├── cache/            KV cache infra (stubs — Phase 3)
├── loader/           Checkpoint I/O (stubs — Phase 1)
└── reference/        Slow PyTorch oracles (never imported by production code)
    ├── attention.py     RoPE, KV cache, causal attention, SWA
    ├── csa_attention.py CSA/HCA sparse attention
    ├── compressor.py    Compressor PyTorch example
    └── moe_pipeline.py  MoE pipeline reference

Mental model: kernels/ops/layers/model/ (dependency flows left to right). reference/ and loader/ are sidecars.


Active Test Files

FMHA (Stages A/B/C) — in tests/unit/

File Stage Status
test_fmha_v3.py A+B Full QK→identity softmax→PV, cosine 0.999999
test_fmha_v3_12w.py A+B 12-warp QK→PV, cosine 0.999999
test_fmha_v3_stage_c_full.py C Real online softmax + O normalization, cosine 0.993-0.996
test_fmha_v3_stage_c_min.py C 🔨 Early 12-warp pipeline (broken pipeline state)
test_pv64_with_softmax.py B (128,64) PV, single AB pipeline
test_128_128_vdiag.py A+B (128,128) PV baseline
test_qkonly.py A QK with split Q/KV pipelines
test_qk_softmax.py A+B QK + identity softmax, no PV

MoE / GEMM — in tests/unit/

File What
test_cutedsl.py NVFP4 grouped GEMM kernel
cudagraph_test.py Cudagraph capture + replay
layertest.py Per-layer correctness
test_custom_op.py torch.library custom ops
test_compile_custom_op.py Compile + warmup
test_fp4_roundtrip.py BF16 → NVFP4 → BF16 roundtrip
test_interleave.py Gate/up weight interleaving
test_interleave_gemm.py Interleaved GEMM correctness
test_fused_step1.py Fused SwiGLU GEMM

Archived Tests

tests/archive/ contains ~190 debug files from Stages A/B. Not maintained. Can be deleted.


Test Harness

Scripts in tests/ for running tests on the B200 (root@45.76.247.107):

run_test.sh — Run a test in a screen session

# On the B200:
cd /root/dsv4-nvfp4-workspace/kernel
bash tests/run_test.sh tests/unit/test_fmha_v3.py

What it does:

  1. Kills any existing kernel-test screen and SIGKILLs all child processes (handles deadlocked GPU procs that ignore SIGHUP)
  2. Deletes the old log file
  3. Starts a new screen -dmS kernel-test running the test
  4. Logs output to /tmp/kernel-test.log
  5. Verifies the screen started

check_log.sh — Check test progress

bash tests/check_log.sh

Shows the log contents and whether the screen is still running.

Local → B200 workflow

# 1. Edit locally, commit, push
cd ~/dev/nvfp4-megamoe-kernel
git add -A && git commit -m "my change" && git push

# 2. SSH to B200, pull, run
ssh root@45.76.247.107
cd /root/dsv4-nvfp4-workspace/kernel && git pull
bash tests/run_test.sh tests/unit/test_fmha_v3_stage_c_full.py

# 3. Check results
bash tests/check_log.sh

fire_b200_test — One-command local test runner

Lives in ~/.openclaw/workspace/fire_b200_test (NOT in the repo — project-specific tooling).

# From your local machine, one command to push, run, and get results:
~/.openclaw/workspace/fire_b200_test tests/unit/test_fmha_v3.py

What it does:

  1. Auto-commits and pushes any local changes
  2. SSH to B200, pulls, starts run_test.sh in a screen
  3. Polls every 15s until the screen exits
  4. Dumps the full test log to your terminal

This is strictly for the DSV4 NVFP4 kernel project. It hardcodes the B200 IP, repo paths, and git remote.


Stage C: Online Softmax — TMEM Layout Mismatch Issue

Current Results (test_fmha_v3_stage_c.py)

n cos Status
128 0.973 ⚠️ 3% error from TMEM layout mismatch
256 0.793 ⚠️ Two TMEM round-trips compound the error
384+ N/A Pipeline doesn't cycle past 2 KV tiles

Root Cause: TMEM Layout Mismatch

The MMA instruction writes O to TMEM using the C-fragment layout. The epilogue_tma_store helper reads O from TMEM using get_tmem_load_op, which uses a different TMEM column mapping. When epilogue_tma_store reads O directly after PV (no normalize), the layout matches perfectly — cos 0.999998 (raw PV output is correct).

The problem appears when we do a TMEM round-trip (load O → modify → store back) using hand-constructed Ld32x32bOp/St32x32bOp atoms:

  1. NO-OP round-trip (load + store unchanged) → cos 0.973. The hand-constructed atoms read/write using a different column mapping than get_tmem_load_op. The data gets "transcoded" — close but not exact.
  2. Normalize round-trip (load → multiply by 1/row_sum → store) → cos 0.973 (with preceding NO-OP) or cos 0.465 (without NO-OP). Without the NO-OP, epilogue_tma_store reads the MMA layout directly and produces garbage.
  3. O rescale (kt > 0) + normalize → cos 0.793 at n=256. Each round-trip compounds the layout mismatch error.

Why the NO-OP Round-Trip "Fixes" It

The MMA writes O in C-fragment TMEM layout. epilogue_tma_store reads in get_tmem_load_op layout. Without a round-trip, these layouts are incompatible → garbage output (cos 0.465).

A NO-OP round-trip through hand-constructed atoms reads the data using the hand-constructed layout (which can read the C-fragment data) and writes it back using the hand-constructed layout. After the round-trip, the data is in the hand-constructed layout, which is close to (but not identical to) the get_tmem_load_op layout → 3% error (cos 0.973).

The Proper Fix: correction_epilog Pattern

The CUTLASS FMHA reference uses a one-way trip for the final epilogue:

TMEM --get_tmem_load_op--> reg (normalize + FP32→BF16) --get_smem_store_op--> SMEM --TMA--> GMEM

This reads O using get_tmem_load_op (same layout as epilogue_tma_store) and writes directly to SMEM. No TMEM round-trip. No layout mismatch. The CUTLASS reference uses this pattern and gets correct results.

Why we can't use it yet: The TMA store from SMEM → GMEM requires tma_partition / flat_divide which hit CuTeDSL region isolation errors when called inside if warp_idx < self.mma_warp_id blocks. The epilogue_tma_store helper works because it's a regular Python function that inlines into the same MLIR region — but it always reads from TMEM, not SMEM.

Possible solutions:

  1. Call epilogue_tma_store but inject the 1/row_sum multiply into its pipeline (requires modifying the helper or replicating it inline with the scale)
  2. Pre-compute TMA partitioning outside the if block and pass the partitioned tensors through the kernel interface
  3. Use the experimental cutlass.cute.experimental.epilogue_tma_store API which has a cleaner structure

Verified Facts

  • Raw PV output is perfect: epilogue_tma_store with identity op gives cos 0.999998 at n=128
  • Softmax P values are correct: Unnormalized P@V matches reference exactly (cos 0.999998)
  • Online softmax computation is correct: row_max and row_sum tracking works
  • The ONLY issue is the TMEM round-trip for normalize/rescale
  • Stage A/B with identity softmax: cos 0.999999 — the pipeline works, softmax is the only addition

Architecture (6-warp, current)

Warps 0-3: Softmax + Epilogue (row_max, row_sum, P store, O rescale, final normalize)
Warp 4: MMA (QK, PV)
Warp 5: TMA (Q/K/V load)

TMEM Layout

Col 0-31:   S (QK acc, 128 FP32 via Ld32x32bOp Repetition(32))
Col 32-95:  P (64 FP32 via St32x32bOp Repetition(32), register bridge BF16 view)
Col 128+:   O (PV acc, 64 FP32, rescale via Ld32x32bOp Repetition(16))

Remaining for Multi-Tile Production

  1. Fix TMEM layout mismatch — replace hand-constructed atom round-trips with correction_epilog pattern
  2. Pipeline state cycling for n≥384 — kv_stage=2 can only buffer 2 tiles
  3. 12-warp layout — separate softmax/correction/epilogue warps
  4. O rescale for kt > 0 — must also use paired atoms or correction_epilog

CuTeDSL Constraints (hard-won)

  1. vectorize=True loops: ONLY load/store/print — no fmax, no cmpf, no inner loops, no carry
  2. .reduce(cute.ReductionOp.MAX): reduces ENTIRE C-fragment to scalar — global max, not per-row
  3. cute.arch.fmax: impure for vectorizer — use plain range() loop
  4. TMA partition tensors have 4 modes: (((64,128),1), ?, KV_tiles, ?)(None,0,None,0) keeps mode 2 (KV tiles) free, [None, kt] indexes it
  5. tBgK[(None, None, 0, 0)] pins mode 2 to 0 — silently reads tile 0 forever. Use (None,0,None,0) instead.
  6. softmax_done_bar NamedBarrier is reusable across tiles
  7. Hand-constructed TMEM atoms corrupt data on round-trip: Ld32x32bOp + St32x32bOp built independently introduce ~3% error. Use get_tmem_load_op + get_smem_store_op paired atoms for one-way trips.
  8. CuTeDSL region isolation: flat_divide and tma_partition can't be called inside if warp_idx blocks. Do partitioning outside if blocks or in regular (non-@cute.kernel) helper functions.
  9. composition vs logical_divide: Both re-tile a tensor, but produce different layouts. The CUTLASS correction_rescale uses composition, correction_epilog uses logical_divide. The copy atoms must match the tensor layout they were created with.

Key Lessons

  1. NEVER use find_tmem_tensor_col_offset() as TMEM placement. It returns footprint size, not a safe offset.
  2. FMHA never trusts DLPack tensor layouts. Reconstruct V as (hd, s_k) MN-major inside CuTe.
  3. TMEM allocation must be power of 2.
  4. Square hides bugs. (128,128) worked for every wrong approach. Always test non-square.
  5. St32x32bOp MUST use Float32, NOT BFloat16. BFloat16 causes illegal memory access.
  6. First PV ACCUMULATE=False. Otherwise adds uninitialized TMEM to output.
  7. FMHA P store uses QK C-fragment composition, NOT PV A-fragment. Two aliases, same TMEM.
  8. Register bridge: FP32 backing (store partition) + BF16 view (QK-load layout). Do not skip this.
  9. PRINT THE SHAPES. ALWAYS. Reasoning about TMEM layouts without evidence is how we waste days.
  10. Never assume TMEM round-trips are safe. Verify with NO-OP tests before adding logic.

Environment

  • Server: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
  • venv: source /root/dsv4-nvfp4-workspace/venv/bin/activate
  • PYTHONPATH: /root/dsv4-nvfp4-workspace/kernel
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
  • vLLM repo: /root/dsv4-nvfp4-workspace/vllm (modified for Blackwell)
  • CUTLASS FMHA reference: /root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py
  • Local CUTLASS clone: /home/openclaw/dev/cutlass
Description
No description provided
Readme 13 MiB
Languages
Python 74.9%
Cuda 25%