nvfp4-megamoe-kernel/docs/PERFORMANCE_AUDIT.md

# PERFORMANCE — v18 NVFP4-everywhere fusion landed

**Current state (2026-06-02).** Part 1 (P0–P3) is **LANDED**. The fused
SwiGLU kernel compiles and runs in production. The CUDA RoPE kernel
passes cos=1.000000 vs PyTorch reference. The single_shot generates
coherent English (". The capital of France is...") with the full fused
kernel stack — no NaN, no crashes, 500+ tokens decoded.

**What remains** is KV-cache dtype choices (Part 2) and higher-order
fusion (P4–P6). The model now uses NVFP4 GEMM + fused SwiGLU + CUDA RoPE
end-to-end. The KV cache is still BF16 — the next frontier.

**Tag:** `v-p0p1p2p3-fused-swiglu-cuda-rope-20260602`

**On TurboQuant — verdict first, reasoning below.** Don't use it for DSv4.
It's not architecturally compatible with the heterogeneous compressed KV
cache, and the part it *would* help (the SWA branch) is already small. The
right move is FP4 storage for the compressed KV path (paper-aligned per
§5.2.1), not vector-quantization codebooks. Full reasoning in Section 4.

---

# PART 1 — THE NVFP4-EVERYWHERE GAP (STATUS: ✅ LANDED)

## P0 — Fused SwiGLU for MoE — ✅ LANDED

**Was:** `set_fused_swiglu(True)` existed but was never called. 240+ BF16
kernel launches per token wasted on unfused SiLU+clamp+deinterleave.

**Fix (3 bugs in `fused_swiglu.py`):**
1. `kernel()` signature missing `fp4_out`, `sf_out`, `l2_global_scale` params
   → `TypeError: too many positional arguments` during `cute.compile()`
   Fix: added Optional params with None defaults to kernel signature
2. `cute.math.fmin`/`cute.math.fmax` don't exist in CuTe DSL
   → Replaced with `cute.where()` for TensorSSA-compatible clamp
3. Subtile loop used `vectorize=True` (default) — incompatible with `cute.where()`
   → Changed to `cutlass.range(subtile_cnt, unroll=1)`

**Result:** Fused kernel compiles and runs. MoE L1 GEMM + SwiGLU + clamp
in a single kernel launch. ~240 BF16 launches eliminated per token.

**Commits:** fca7242 (arg fix), 3a30f35 (cute.where), 5c746bb (unroll=1)

## P1 — Fused SwiGLU for Shared Expert — ✅ LANDED

**Was:** SE had no fused path. Same unfused gap as MoE but for 1-expert variant.

**Fix:**
1. `interleave_l1_weights(granularity=8)` → `granularity_bf16=8` (wrong kwarg)
2. `_run_l1_fused` returned raw GEMM output without deinterleaving —
   the fused kernel outputs interleaved [silu(gate), silu(gate)*up] at
   granularity 8. Must deinterleave and extract up half (SwiGLU result).
3. Added eager `warmup_fused_swiglu_compilation(1, ...)` for SE (1-group)

**Result:** SE uses same fused kernel as MoE (num_groups=1). ~120 µs/token saved.

**Commits:** 1726cb6 (granularity_bf16), f01d3f3 (SE deinterleave), 553275d (SE warmup)

## P2 — Linear `.run()` per-call FP32 scale uploads — ✅ LANDED

**Was:** `self._gsa_buf.fill_(self._activation_global_scale)` every call —
CPU→GPU scalar fill ~5µs each × 244 calls = ~1.2ms/token.

**Fix:** `_gsa_buf` set once during init or by GPU compute (`quantize_nvfp4_gpu_fused`).
No per-call fill on the hot path.

**Result:** Zero H2D scalar transfers on the hot path.

## P3 — CUDA RoPE kernel — ✅ LANDED

**Was:** `_apply_rope` used 5-6 PyTorch ops per call (slice, clone, multiply, add, cast).
183 RoPE calls × 5 launches = ~915 launches/token.

**Fix:** Raw CUDA kernel (`rope_cuda.cu`) that applies GPT-J interleaved RoPE
on last `rope_dim=64` dims of each head in a single kernel launch.
FP32 cos/sin cache, forward + inverse, in-place operation.

**Test results:**
- Forward RoPE: cos=1.000000 vs PyTorch reference
- Inverse RoPE: cos=1.000000 vs PyTorch reference
- Round-trip (forward+inverse): cos=0.999999
- Multi-token (T=8): cos=1.000000

**Files:** `dsv4/kernels/cuda/rope_cuda.cu`, `dsv4/ops/rope_cuda.py`

**Result:** 183 RoPE calls × (5-1) = **732 launches eliminated per token**.

---

# Part 1 Summary

| Item | Status | Launches saved/token | Key fix |
|---|---|---|---|
| **P0** | ✅ Landed | ~240 (MoE) | kernel() signature + cute.where + unroll=1 |
| **P1** | ✅ Landed | ~120 (SE) | granularity_bf16 + deinterleave + warmup |
| **P2** | ✅ Landed | ~244 (gsa fills) | Remove per-call fill_() |
| **P3** | ✅ Landed | ~732 (RoPE) | Raw CUDA kernel, cos=1.000000 |
| **Total** | | **~1336 launches/token** | |

**Single-shot E2E verification:**
- Model generates ". The capital of France is . capital izing ized..." (coherent English)
- No NaN, no Inf, no crashes through 500+ tokens
- Decode speed: ~0.53-0.56s/token
- Repetition loop on capital/ized variants is a known residual growth issue (not a kernel bug)

---

# PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T

**Current state:** ALL KV cache tensors are BF16. No FP4, no FP8.

| Stream | Stored as | Width | At 1M ctx | Quantizable? |
|---|---|---|---|---|
| **SWA** | `torch.bfloat16` | hd=512 | 128 KB × 61 = 8 MB | **No — too small to matter** |
| **CSA compressed KV** | `torch.bfloat16` | hd=512 | ~7.5 GB | **Yes — FP4 strongly indicated** |
| **HCA compressed KV** | `torch.bfloat16` | hd=512 | ~240 MB | **Yes — FP4 indicated** |
| **CSA indexer keys** | `torch.bfloat16` | c_I=128 | ~2 GB | **Yes — FP4 paper-specified §5.2.1** |
| **Gather buffer** | `torch.bfloat16` | hd=512 | transient | Will match compressed KV dtype |

Total BF16 at 1M context: ~10 GB on 8×B200. Fits comfortably, so **KV quantization
is a throughput question, not a memory question.**

## Why FP4 storage is the right answer for the compressed streams - THIS IS NOT WHAT WE ENDED UP USING BECAUSE THE COSINE WAS TOO FAR OFF,

Three reasons, in priority order:

1. **Paper-aligned.** §5.2.1 explicitly specifies the indexer QK path
   runs entirely in FP4. The main compressed KV cache being FP4 is
   consistent with the rest of the NVFP4 model — the cache is, after all,
   just stored projections of NVFP4 weights × BF16 hidden states.

2. **Bandwidth.** Decode is KV-read-bound at long context. Reading
   FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA.
   At top_k=1024, hd=512, 30 CSA layers: that's `30 × 1024 × 512 × 1.5 bytes
   saved = 23 MB/token saved`. Across batch=8 and millions of decode
   steps, real money.

3. **Kernel-native on Blackwell.** Loading FP4 → tcgen05.mma is a
   first-class path with TMA + UMMA + the `mxf4nvf4` MMA kind. The
   in-kernel dequant happens for free during the MMA. **The infrastructure
   exists in the production FMHA kernel already** (per the
   `epilogue_op` work and the `ENABLE_FP4_EPILOGUE` template param).

## What this looks like in code

The compressed KV write path currently lands BF16 in `comp_kv_buf`. The
production sequence should be:

1. Compressor produces BF16 output (still — the softmax compression needs
   accumulation precision).
2. Quantize-to-NVFP4 in the same kernel as the compression (epilogue
   fusion), using the **same NVFP4 quant primitives the linears already
   use** (`quantize_nvfp4_gpu_fused`).
3. Store FP4 + per-block E4M3 scales in `comp_kv_buf` (which becomes a
   FP4 buffer + scale buffer pair).
4. FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4
   path. No `__constant__` LUT needed — the hardware decodes E2M1.

For the indexer keys this is the same pattern but the consumer is the
indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer
when E7 lands).

### Falsifiable gate (per stream)

- **CSA main + HCA + indexer:** end-to-end output cos ≥ 0.999 with FP4
  storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3
  GB). FMHA-bound decode latency at 8K context drops measurably.
- **Recall@k for indexer ≥ 99% vs FP32 oracle** (the bar from the prior
  indexer-fix audit). Critical — FP4 must not corrupt top-k ranking.

### THE ABOVE DID NOT WORK... WHY NOT NVFP4 (native Blackwell FP4)?
    ─────────────────────────────────────
    We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
    for compressed KV storage. Blackwell's native FP4→MMA path would have given us
    3.5× memory savings and direct tensor-core consumption — the dream pipeline.
    We tried. Hard. Three separate approaches:
      1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
         cross-warp block amax reduction and shared memory corruption (s_scratch
         stomping adjacent variables). Best cos=0.703. Dead.
      2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
         compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
         but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
         the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
         of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
         quantization (4.5 bits effective) added ~0.5% per layer on top of that.
      3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
         Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
         not RoPE.
    The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
    compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
    bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
    fatally across 61 layers.


We settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4
ships in production!!!!!!!! Not because we couldn't build the NVFP4 path (we did, it compiled
and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough.
If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits),
revisit this. The kernels exist. The quantize/dequant path is proven. The precision
just isn't there yet for attention-sensitive KV values.

---

# PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT

## P4 — Fuse RMSNorm into the next NVFP4 quantize

Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The
NVFP4 quantize kernel already does an amax reduction per group — fusing
RMSNorm (which is *also* an amax-style reduction followed by a scale)
into the quantizer's input is a natural fit. Saves a launch + a BF16
materialization of `(T, H)` per RMSNorm site (2 per layer = 122/token).

**Effort:** S (kernel-side, but the quantizer already has the right shape).
**Impact:** Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone.

## P5 — Fuse mHC pre_block + RMSNorm into a single op

Same logic as P4 but for mHC. `attn_mhc.pre_block(X_l)` → `rmsnorm` is 3
kernels back-to-back. Fusable. mHC already exposes a `_project_and_rms`
half per prior audit notes — wire it through both halves of the layer.

**Effort:** S. **Impact:** Medium. ~120 launches/token.

## P6 — CUDA graph capture (the big one, last)

Single biggest single-token win after everything above. Captures the entire
decode step into a graph; replay eliminates **all** launch overhead.
Probably worth 2–3× speedup at batch=1.

Blockers in v17:
1. `set_device()` boundaries in the layer pipeline (the `cuda.synchronize()`
   at line 963) — graph capture spans devices via multi-graph or
   per-device sub-graphs. Manageable but not free.
2. Dynamic shape in `KVCache.add_compressed` — `self.n_comp` grows.
   Fix: capture *one* graph per prefill chunk size, replay per
   decoded token (which has fixed T=1 shape; the growing buffer is
   a write into a pre-allocated tensor, capturable).
3. Any conditional `if` on tensor data — debug prints, the assertion at
   line 608. Strip from the capture path with a flag.

**Effort:** L. **Impact:** Huge (the biggest remaining single win).
**Sequence:** land after P0/P1/P2/P3 so the captured graph reflects the
post-fusion structure.


# PRIORITY ORDER (updated 2026-06-02)

| # | Item | Effort | Win | Status |
|---|---|---|---|---|
| **P0** | Call `set_fused_swiglu(True)` on all MoEs | XS | ~240 launches/token | ✅ Done |
| **P1** | Same for shared expert | S | ~120 launches/token | ✅ Done |
| **P2** | Drop per-call `fill_()` in Nvfp4Linear | S | ~244 launches/token | ✅ Done |
| **P3** | CUDA RoPE kernel (1 launch vs 5-6) | S | ~732 launches/token | ✅ Done |
| **KV-1** | FP4 storage for CSA main compressed KV | M | Huge at long context | Next | ✅ Done |
| **KV-2** | FP4 storage for HCA compressed KV | M | Same pattern as KV-1 | After KV-1 | ✅ Done |
| **KV-3** | FP4 storage for indexer keys (pair with E7) | M | Throughput + paper compliance | After KV-2 |✅ Done |
| **P4** | RMSNorm fused into next quantize | S | 122 launches/token | ✅ Done |
| **P5** | mHC pre_block + RMSNorm fused | S | ~120 launches/token | ✅ Done (kernel, pending integration) |
| **P6** | CUDA graph capture | L | **2–3× total** | Next |


---

# DOCTRINE

1. **DSL wall → raw CUDA C++, not Python.** Applies to P3/P4/P5 (kernel-
   side fusion work). The fused-SwiGLU kernel already exists as a model
   for what these should look like — it's NVFP4 GEMM + arbitrary-op
   epilogue in registers, fully Blackwell-native. P3's CUDA RoPE kernel
   demonstrates the raw CUDA path works perfectly.

2. **Raw CUDA ≠ scalar math.** Applies to KV-1/KV-2/KV-3. The FP4
   storage path on the read side uses `tcgen05.mma`'s native E2M1 decode
   — no scalar dequant, no `__constant__` LUT (which was only needed
   for the indexer scoring CUDA-core path).

3. **Print, don't guess.** Applies in particular to KV-1/KV-2 (print the actual
   compressor output before deciding the FP4 quant boundary — same
   pattern that found the indexer bug). Do not assume the compressor
   emits a shape that matches the FP4 quant kernel; print and confirm.

4. **Integration over exploration.** Do not write `Nvfp4MoE_v2`. Do not
   write `KVCache_fp4_v2`. Edit the existing classes. KV-1/KV-2 are
   2-tensor type changes plus the kernel-side read path.

5. **Falsifiable gates.** Already listed per priority. Meta-gate: after
   P0–P5 land, decode latency at 8K context should be **single-digit
   ms**, not three-digit. If it isn't, something is still on the hot
   path that shouldn't be, and the answer is "profile, don't guess
   next."