more doc cleanup

This commit is contained in:
2026-06-03 07:37:13 +00:00
parent d463ac8512
commit 0bf276f8c9
2 changed files with 48 additions and 48 deletions

View File

@@ -2,7 +2,8 @@
Production-grade Blackwell SM100 inference kernel for **DeepSeek-V4-Pro NVFP4**, written in CuTeDSL with a CUDA fallback path. Target hardware: NVIDIA B200 (180 GiB HBM3e).
For what's done, what's blocked, and what's next, see **ROADMAP.md**. This file is the durable reference — architecture, design choices, package layout, workflow, and hard-won lessons. If you're touching the kernel, read the "Lessons learned" section every time.
This file is the durable reference — architecture, design choices, package layout, workflow, and hard-won lessons. If you're touching the kernel, read the "Lessons learned" section every time.
---
@@ -88,50 +89,6 @@ One pass, one kernel. No two-loop epilogue, no LSE arithmetic in the merge. This
---
## Our kernel design choices
### Attention kernel (FmhaKernel)
**6-warp specialization.** Warps 03 handle softmax + correction + epilogue. Warp 4 is the MMA warp (QK + PV). Warp 5 is the TMA warp (Q/K/V loads, output store via pipeline).
**P staging — two paths.**
- **TMEM-P** (hd ≤ 64): P stored to TMEM via register bridge (FP32 backing + BF16 view). PV reads P from TMEM. Used at the small head dims where QK C-fragment and PV A-fragment TMEM layouts agree.
- **SMEM-P** (hd > 64): P written to SMEM via coordinate-indexed store using `tTMEM_LOADcS` to map register indices to `(m, k)` then into `sP`'s subtile layout. PV reads P from SMEM with `OperandSource.SMEM`. Required because the QK ↔ PV TMEM layout disagreement at hd > 64 corrupts the round-trip.
**Un-normalized O + LSE output.** The kernel emits raw `sum(P · V)` and `lse = ln(row_sum) + row_max · ln(2)`. External code (or the next kernel pass) divides. This composes — D5 merge, multi-tile rescale, and the inverse-RoPE → wo_a fuse all rely on it.
**Per-head launch for multi-head.** Python loop dispatches the single-CTA kernel once per head. Multi-CTA grid using `flat_divide` + `tma_partition` is the next refactor (see ROADMAP); the path is unblocked once the correction-epilog rewrite lands.
**Head-packed M dimension for decode.** Q reshaped to `(n_h * T, hd, 1)`, all heads' rows packed into the 128-row M tile. Per-row softmax. At Pro decode (T=1, n_h=128) the M tile fits exactly.
**K-dim sub-tiling at hd > 256.** When `head_dim > 256` (MMA instruction K-dim limit), Q and K split into `n_k_sub_tiles = head_dim / 256` chunks along head_dim. QK accumulates in TMEM across sub-tiles (additive in logit space). The PV path uses `pv_n_tile = 128` for hd > 256 to keep sV+sC within the 232 KB SMEM budget.
**Sink bias as logit modification.** D3 (SWA length mask), D4 (causal mask on SWA), and D5c (attention sink) all live in the same post-QK, pre-softmax in-register code. They read `tTMEM_LOADcS` to get `(m, k)` coordinates and modify `tTMEM_LOADrS` before the row-max reduction. The sink bias is added in the raw-logit domain as `attn_sink / scale_softmax`, then the existing `* scale_log2` multiply converts to log2 space.
### MoE kernel (FusedSwiGLUScaledGroupedGemmKernel)
**7-warp specialization.** Warps 03 epilogue (TMEM → registers → SMEM → GMEM with global scale, SwiGLU, clamp). Warp 4 MMA (`tcgen05.mma.block_scale` with SFA/SFB in TMEM). Warp 5 TMA load (A, B, SFA, SFB). Warp 6 scheduler (`MoEStaticPersistentTileScheduler`).
**One-way TMEM → registers → SMEM → GMEM epilogue.** Uses `epilogue_tmem_copy_and_partition` + `epilogue_smem_copy_and_partition` (CUTLASS helpers, paired atoms). The SwiGLU + clamping math runs in registers between the t2r and r2s copies. No TMEM round-trip. This is the same pattern FMHA needs to adopt to fix the D1.5 blocker — see ROADMAP.
**Subtile-level gate/up pairing.** With granularity-8 interleaved L1 weights and `epi_tile_n=8`, even subtiles are gate and odd subtiles are up. `silu_gate_buf` register tensor carries the SiLU result across the subtile-pair boundary.
**`use_2cta_instrs` conditional** on `tokens_sum ≥ 256` and even `cluster_m`. Decode (small M) stays 1-CTA; prefill/batched gets 2-CTA UMMA with multicast B (1.71.9× throughput).
### Heterogeneous KV cache
- **State cache** per request: fixed-size block holding `(n_win SWA KV)` and `(uncompressed tail tokens awaiting compression)`. One block per request, lifetime managed by request scheduling.
- **Classical paged cache** per request: variable blocks holding `(k1 CSA compressed entries, k2 HCA compressed entries)` per layer. `k1 = lcm(m, m') / m = 32`, `k2 = lcm(m, m') / m' = 1`. Block covers 128 original tokens.
- Different layers can produce different KV cache sizes (CSA vs HCA vs SWA-only). The state cache + classical-pool split keeps PagedAttention-style alignment intact for the compressed pool.
### NVFP4 throughout
- **Weights**: NVFP4 (FP8 E4M3 scales, 16-element microblocks). Verified: `sf_dtype`, TMA element type, MMA kind (`mxf4nvf4`) all correct.
- **Activations**: BF16 today, FP4 after NVFP4-1.x epilogue fusion lands (see ROADMAP).
- **KV cache**: BF16 today; the FP8 (RoPE in BF16, NoPE in FP8) split per paper §2.3.4 is on the roadmap as NVFP4-2.
- **Indexer keys**: stored FP4 in the cache today, but scored with a scalar CUDA-core kernel. Tensor-core FP4 scoring (paper §5.2.1) is a Stage F priority.
---
## Package structure
@@ -262,7 +219,7 @@ These are surface-level traps. Get them wrong and the kernel silently produces g
4. **`cute.arch.fmax` is impure** for the vectorizer. Use it inside plain `range()`, never inside `vectorize=True`.
5. **Hand-constructed TMEM atoms corrupt data on round-trip.** Independently-built `Ld32x32bOp` + `St32x32bOp` atoms have addressing that doesn't match — even a NO-OP round-trip drops cos to ~0.97. Use paired atoms from `epilogue_tmem_copy_and_partition` / `epilogue_smem_copy_and_partition` for one-way trips. This is the D1.5 blocker in ROADMAP.
5. **Hand-constructed TMEM atoms corrupt data on round-trip.** Independently-built `Ld32x32bOp` + `St32x32bOp` atoms have addressing that doesn't match — even a NO-OP round-trip drops cos to ~0.97. Use paired atoms from `epilogue_tmem_copy_and_partition` / `epilogue_smem_copy_and_partition` for one-way trips.
6. **CuTeDSL `if` blocks are separate MLIR regions.** Variables defined inside one `if` are not visible in another, even when the condition is a compile-time constant. Define all variables unconditionally before any branching.
@@ -303,13 +260,13 @@ These cost real days to learn. They are listed in priority of how easy they are
- **FMHA P store uses QK C-fragment composition, not PV A-fragment.** Two aliases of the same TMEM region. Mixing them up gives valid-looking garbage.
- **Register bridge for P: FP32 backing (store partition) + BF16 view (QK-load layout).** Do not skip the dual view.
- **TMEM round-trip mismatch with `epilogue_tma_store`**: `epilogue_tma_store` reads O from TMEM using `get_tmem_load_op`'s layout. Hand-built atoms read with a different layout. Round-tripping through hand-built atoms transcodes the data, leaving 3% error.
- **The correction-epilog pattern is the fix.** TMEM → registers (via paired t2r atom) → modify in registers → SMEM (via paired r2s atom) → GMEM (via TMA). One-way trip, no round-trip, no transcoding. The MoE kernel uses this and gets perfect results. See ROADMAP.
- **The correction-epilog pattern is the fix.** TMEM → registers (via paired t2r atom) → modify in registers → SMEM (via paired r2s atom) → GMEM (via TMA). One-way trip, no round-trip, no transcoding. The MoE kernel uses this and gets perfect results.
### CuTeDSL & MLIR
- **CuTeDSL `if` blocks create separate MLIR regions.** Variables defined in `if not use_smem_p:` and read in another `if not use_smem_p:` inside a `for` inside an `if warp_idx < mma_warp_id:` are not visible. Define unconditionally before any branching.
- **CuTeDSL compiles both branches of Python `if`.** Wrap mode-specific dead code in `const_expr(condition)` to eliminate it. Critical for O rescale (`n_kv_tiles > 1`), LSE compute (`not normalize`), SMEM-P path.
- **CuTeDSL MLIR backend cannot handle complex pipeline loops at hd=512.** Both unrolled (Python `range`) and runtime (`cutlass.range unroll=1`) loops trigger exponential-or-worse optimizer time. Tracer is fast (~0.8s); MLIR optimizer chews for 3+ hours. Workaround options in ROADMAP.
- **CuTeDSL MLIR backend cannot handle complex pipeline loops at hd=512.** Both unrolled (Python `range`) and runtime (`cutlass.range unroll=1`) loops trigger exponential-or-worse optimizer time. Tracer is fast (~0.8s); MLIR optimizer chews for 3+ hours.
- **Don't mix Python loops and pipeline ops.** Python `for` unrolls at trace time — N copies of pipeline acquire/release + TMA + GEMM blow up the IR. Prefer `cutlass.range(unroll=1)` for pipeline loops.
### Math & merging

View File

@@ -0,0 +1,43 @@
## Our kernel design choices
### Attention kernel (FmhaKernel)
**6-warp specialization.** Warps 03 handle softmax + correction + epilogue. Warp 4 is the MMA warp (QK + PV). Warp 5 is the TMA warp (Q/K/V loads, output store via pipeline).
**P staging — two paths.**
- **TMEM-P** (hd ≤ 64): P stored to TMEM via register bridge (FP32 backing + BF16 view). PV reads P from TMEM. Used at the small head dims where QK C-fragment and PV A-fragment TMEM layouts agree.
- **SMEM-P** (hd > 64): P written to SMEM via coordinate-indexed store using `tTMEM_LOADcS` to map register indices to `(m, k)` then into `sP`'s subtile layout. PV reads P from SMEM with `OperandSource.SMEM`. Required because the QK ↔ PV TMEM layout disagreement at hd > 64 corrupts the round-trip.
**Un-normalized O + LSE output.** The kernel emits raw `sum(P · V)` and `lse = ln(row_sum) + row_max · ln(2)`. External code (or the next kernel pass) divides. This composes — D5 merge, multi-tile rescale, and the inverse-RoPE → wo_a fuse all rely on it.
**Per-head launch for multi-head.** Python loop dispatches the single-CTA kernel once per head. Multi-CTA grid using `flat_divide` + `tma_partition` is the next refactor; the path is unblocked once the correction-epilog rewrite lands.
**Head-packed M dimension for decode.** Q reshaped to `(n_h * T, hd, 1)`, all heads' rows packed into the 128-row M tile. Per-row softmax. At Pro decode (T=1, n_h=128) the M tile fits exactly.
**K-dim sub-tiling at hd > 256.** When `head_dim > 256` (MMA instruction K-dim limit), Q and K split into `n_k_sub_tiles = head_dim / 256` chunks along head_dim. QK accumulates in TMEM across sub-tiles (additive in logit space). The PV path uses `pv_n_tile = 128` for hd > 256 to keep sV+sC within the 232 KB SMEM budget.
**Sink bias as logit modification.** D3 (SWA length mask), D4 (causal mask on SWA), and D5c (attention sink) all live in the same post-QK, pre-softmax in-register code. They read `tTMEM_LOADcS` to get `(m, k)` coordinates and modify `tTMEM_LOADrS` before the row-max reduction. The sink bias is added in the raw-logit domain as `attn_sink / scale_softmax`, then the existing `* scale_log2` multiply converts to log2 space.
### MoE kernel (FusedSwiGLUScaledGroupedGemmKernel)
**7-warp specialization.** Warps 03 epilogue (TMEM → registers → SMEM → GMEM with global scale, SwiGLU, clamp). Warp 4 MMA (`tcgen05.mma.block_scale` with SFA/SFB in TMEM). Warp 5 TMA load (A, B, SFA, SFB). Warp 6 scheduler (`MoEStaticPersistentTileScheduler`).
**One-way TMEM → registers → SMEM → GMEM epilogue.** Uses `epilogue_tmem_copy_and_partition` + `epilogue_smem_copy_and_partition` (CUTLASS helpers, paired atoms). The SwiGLU + clamping math runs in registers between the t2r and r2s copies. No TMEM round-trip. This is the same pattern FMHA needs to adopt to fix the D1.5 blocker.
**Subtile-level gate/up pairing.** With granularity-8 interleaved L1 weights and `epi_tile_n=8`, even subtiles are gate and odd subtiles are up. `silu_gate_buf` register tensor carries the SiLU result across the subtile-pair boundary.
**`use_2cta_instrs` conditional** on `tokens_sum ≥ 256` and even `cluster_m`. Decode (small M) stays 1-CTA; prefill/batched gets 2-CTA UMMA with multicast B (1.71.9× throughput).
### Heterogeneous KV cache
- **State cache** per request: fixed-size block holding `(n_win SWA KV)` and `(uncompressed tail tokens awaiting compression)`. One block per request, lifetime managed by request scheduling.
- **Classical paged cache** per request: variable blocks holding `(k1 CSA compressed entries, k2 HCA compressed entries)` per layer. `k1 = lcm(m, m') / m = 32`, `k2 = lcm(m, m') / m' = 1`. Block covers 128 original tokens.
- Different layers can produce different KV cache sizes (CSA vs HCA vs SWA-only). The state cache + classical-pool split keeps PagedAttention-style alignment intact for the compressed pool.
### NVFP4 throughout
- **Weights**: NVFP4 (FP8 E4M3 scales, 16-element microblocks). Verified: `sf_dtype`, TMA element type, MMA kind (`mxf4nvf4`) all correct.
- **Activations**: BF16 today, FP4 after NVFP4-1.x epilogue fusion lands.
- **KV cache**: BF16 today; the FP8 (RoPE in BF16, NoPE in FP8) split per paper §2.3.4 is on the roadmap as NVFP4-2.
- **Indexer keys**: stored FP4 in the cache today, but scored with a scalar CUDA-core kernel. Tensor-core FP4 scoring (paper §5.2.1) is a Stage F priority.