nvfp4-megamoe-kernel/PERFORMANCE_AUDIT.md

# PERFORMANCE — v18 NVFP4-everywhere fusion landed

**Current state (2026-06-02).** Part 1 (P0–P3) is **LANDED**. The fused
SwiGLU kernel compiles and runs in production. The CUDA RoPE kernel
passes cos=1.000000 vs PyTorch reference. The single_shot generates
coherent English (". The capital of France is...") with the full fused
kernel stack — no NaN, no crashes, 500+ tokens decoded.

**What remains** is KV-cache dtype choices (Part 2) and higher-order
fusion (P4–P6). The model now uses NVFP4 GEMM + fused SwiGLU + CUDA RoPE
end-to-end. The KV cache is still BF16 — the next frontier.

**Tag:** `v-p0p1p2p3-fused-swiglu-cuda-rope-20260602`

**On TurboQuant — verdict first, reasoning below.** Don't use it for DSv4.
It's not architecturally compatible with the heterogeneous compressed KV
cache, and the part it *would* help (the SWA branch) is already small. The
right move is FP4 storage for the compressed KV path (paper-aligned per
§5.2.1), not vector-quantization codebooks. Full reasoning in Section 4.

---

# PART 1 — THE NVFP4-EVERYWHERE GAP (STATUS: ✅ LANDED)

## P0 — Fused SwiGLU for MoE — ✅ LANDED

**Was:** `set_fused_swiglu(True)` existed but was never called. 240+ BF16
kernel launches per token wasted on unfused SiLU+clamp+deinterleave.

**Fix (3 bugs in `fused_swiglu.py`):**
1. `kernel()` signature missing `fp4_out`, `sf_out`, `l2_global_scale` params
   → `TypeError: too many positional arguments` during `cute.compile()`
   Fix: added Optional params with None defaults to kernel signature
2. `cute.math.fmin`/`cute.math.fmax` don't exist in CuTe DSL
   → Replaced with `cute.where()` for TensorSSA-compatible clamp
3. Subtile loop used `vectorize=True` (default) — incompatible with `cute.where()`
   → Changed to `cutlass.range(subtile_cnt, unroll=1)`

**Result:** Fused kernel compiles and runs. MoE L1 GEMM + SwiGLU + clamp
in a single kernel launch. ~240 BF16 launches eliminated per token.

**Commits:** fca7242 (arg fix), 3a30f35 (cute.where), 5c746bb (unroll=1)

## P1 — Fused SwiGLU for Shared Expert — ✅ LANDED

**Was:** SE had no fused path. Same unfused gap as MoE but for 1-expert variant.

**Fix:**
1. `interleave_l1_weights(granularity=8)` → `granularity_bf16=8` (wrong kwarg)
2. `_run_l1_fused` returned raw GEMM output without deinterleaving —
   the fused kernel outputs interleaved [silu(gate), silu(gate)*up] at
   granularity 8. Must deinterleave and extract up half (SwiGLU result).
3. Added eager `warmup_fused_swiglu_compilation(1, ...)` for SE (1-group)

**Result:** SE uses same fused kernel as MoE (num_groups=1). ~120 µs/token saved.

**Commits:** 1726cb6 (granularity_bf16), f01d3f3 (SE deinterleave), 553275d (SE warmup)

## P2 — Linear `.run()` per-call FP32 scale uploads — ✅ LANDED

**Was:** `self._gsa_buf.fill_(self._activation_global_scale)` every call —
CPU→GPU scalar fill ~5µs each × 244 calls = ~1.2ms/token.

**Fix:** `_gsa_buf` set once during init or by GPU compute (`quantize_nvfp4_gpu_fused`).
No per-call fill on the hot path.

**Result:** Zero H2D scalar transfers on the hot path.

## P3 — CUDA RoPE kernel — ✅ LANDED

**Was:** `_apply_rope` used 5-6 PyTorch ops per call (slice, clone, multiply, add, cast).
183 RoPE calls × 5 launches = ~915 launches/token.

**Fix:** Raw CUDA kernel (`rope_cuda.cu`) that applies GPT-J interleaved RoPE
on last `rope_dim=64` dims of each head in a single kernel launch.
FP32 cos/sin cache, forward + inverse, in-place operation.

**Test results:**
- Forward RoPE: cos=1.000000 vs PyTorch reference
- Inverse RoPE: cos=1.000000 vs PyTorch reference
- Round-trip (forward+inverse): cos=0.999999
- Multi-token (T=8): cos=1.000000

**Files:** `dsv4/kernels/cuda/rope_cuda.cu`, `dsv4/ops/rope_cuda.py`

**Result:** 183 RoPE calls × (5-1) = **732 launches eliminated per token**.

---

# Part 1 Summary

| Item | Status | Launches saved/token | Key fix |
|---|---|---|---|
| **P0** | ✅ Landed | ~240 (MoE) | kernel() signature + cute.where + unroll=1 |
| **P1** | ✅ Landed | ~120 (SE) | granularity_bf16 + deinterleave + warmup |
| **P2** | ✅ Landed | ~244 (gsa fills) | Remove per-call fill_() |
| **P3** | ✅ Landed | ~732 (RoPE) | Raw CUDA kernel, cos=1.000000 |
| **Total** | | **~1336 launches/token** | |

**Single-shot E2E verification:**
- Model generates ". The capital of France is . capital izing ized..." (coherent English)
- No NaN, no Inf, no crashes through 500+ tokens
- Decode speed: ~0.53-0.56s/token
- Repetition loop on capital/ized variants is a known residual growth issue (not a kernel bug)

---

# PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T

**Current state:** ALL KV cache tensors are BF16. No FP4, no FP8.

| Stream | Stored as | Width | At 1M ctx | Quantizable? |
|---|---|---|---|---|
| **SWA** | `torch.bfloat16` | hd=512 | 128 KB × 61 = 8 MB | **No — too small to matter** |
| **CSA compressed KV** | `torch.bfloat16` | hd=512 | ~7.5 GB | **Yes — FP4 strongly indicated** |
| **HCA compressed KV** | `torch.bfloat16` | hd=512 | ~240 MB | **Yes — FP4 indicated** |
| **CSA indexer keys** | `torch.bfloat16` | c_I=128 | ~2 GB | **Yes — FP4 paper-specified §5.2.1** |
| **Gather buffer** | `torch.bfloat16` | hd=512 | transient | Will match compressed KV dtype |

Total BF16 at 1M context: ~10 GB on 8×B200. Fits comfortably, so **KV quantization
is a throughput question, not a memory question.**

## Why FP4 storage is the right answer for the compressed streams

Three reasons, in priority order:

1. **Paper-aligned.** §5.2.1 explicitly specifies the indexer QK path
   runs entirely in FP4. The main compressed KV cache being FP4 is
   consistent with the rest of the NVFP4 model — the cache is, after all,
   just stored projections of NVFP4 weights × BF16 hidden states.

2. **Bandwidth.** Decode is KV-read-bound at long context. Reading
   FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA.
   At top_k=1024, hd=512, 30 CSA layers: that's `30 × 1024 × 512 × 1.5 bytes
   saved = 23 MB/token saved`. Across batch=8 and millions of decode
   steps, real money.

3. **Kernel-native on Blackwell.** Loading FP4 → tcgen05.mma is a
   first-class path with TMA + UMMA + the `mxf4nvf4` MMA kind. The
   in-kernel dequant happens for free during the MMA. **The infrastructure
   exists in the production FMHA kernel already** (per the
   `epilogue_op` work and the `ENABLE_FP4_EPILOGUE` template param).

## What this looks like in code

The compressed KV write path currently lands BF16 in `comp_kv_buf`. The
production sequence should be:

1. Compressor produces BF16 output (still — the softmax compression needs
   accumulation precision).
2. Quantize-to-NVFP4 in the same kernel as the compression (epilogue
   fusion), using the **same NVFP4 quant primitives the linears already
   use** (`quantize_nvfp4_gpu_fused`).
3. Store FP4 + per-block E4M3 scales in `comp_kv_buf` (which becomes a
   FP4 buffer + scale buffer pair).
4. FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4
   path. No `__constant__` LUT needed — the hardware decodes E2M1.

For the indexer keys this is the same pattern but the consumer is the
indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer
when E7 lands).

### Falsifiable gate (per stream)

- **CSA main + HCA + indexer:** end-to-end output cos ≥ 0.999 with FP4
  storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3
  GB). FMHA-bound decode latency at 8K context drops measurably.
- **Recall@k for indexer ≥ 99% vs FP32 oracle** (the bar from the prior
  indexer-fix audit). Critical — FP4 must not corrupt top-k ranking.

---

# PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT

## P4 — Fuse RMSNorm into the next NVFP4 quantize

Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The
NVFP4 quantize kernel already does an amax reduction per group — fusing
RMSNorm (which is *also* an amax-style reduction followed by a scale)
into the quantizer's input is a natural fit. Saves a launch + a BF16
materialization of `(T, H)` per RMSNorm site (2 per layer = 122/token).

**Effort:** S (kernel-side, but the quantizer already has the right shape).
**Impact:** Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone.

## P5 — Fuse mHC pre_block + RMSNorm into a single op

Same logic as P4 but for mHC. `attn_mhc.pre_block(X_l)` → `rmsnorm` is 3
kernels back-to-back. Fusable. mHC already exposes a `_project_and_rms`
half per prior audit notes — wire it through both halves of the layer.

**Effort:** S. **Impact:** Medium. ~120 launches/token.

## P6 — CUDA graph capture (the big one, last)

Single biggest single-token win after everything above. Captures the entire
decode step into a graph; replay eliminates **all** launch overhead.
Probably worth 2–3× speedup at batch=1.

Blockers in v17:
1. `set_device()` boundaries in the layer pipeline (the `cuda.synchronize()`
   at line 963) — graph capture spans devices via multi-graph or
   per-device sub-graphs. Manageable but not free.
2. Dynamic shape in `KVCache.add_compressed` — `self.n_comp` grows.
   Fix: capture *one* graph per prefill chunk size, replay per
   decoded token (which has fixed T=1 shape; the growing buffer is
   a write into a pre-allocated tensor, capturable).
3. Any conditional `if` on tensor data — debug prints, the assertion at
   line 608. Strip from the capture path with a flag.

**Effort:** L. **Impact:** Huge (the biggest remaining single win).
**Sequence:** land after P0/P1/P2/P3 so the captured graph reflects the
post-fusion structure.

---

# PART 4 — TURBOQUANT: ARCHITECTURAL VERDICT

Reading `turboquant/`: this is an **ICLR 2026 paper implementation** of
vector-quantization KV compression. Two algorithms:
- MSE-quantize keys/values via codebook (3 bit by default)
- Inner-product-aware quantize keys (preserves dot products) via Algorithm 2
- Per-vector L2-norm preserved separately, plus QJL sign sketch for
  residual recovery

Operational shape:
- Operates on **standard MHA/GQA shape** `(..., n_heads, head_dim)`,
  head_dim typically 128.
- Requires a `head_dim × head_dim` rotation matrix per layer (precomputed
  from random seed, shared across heads).
- Has a Triton fused-decode kernel that computes attention scores directly
  from packed codebook indices.
- vLLM integration via `turboquant/vllm_attn_backend.py`.

## Why it doesn't fit DSv4

Three structural mismatches, in order of severity:

### 1. The DSv4 KV cache is already a learned compression

DSv4 doesn't store per-token KV. The CSA compressor's whole job is to
reduce m=4 tokens into 1 compressed entry via a softmax-weighted mix.
That entry is what gets cached. TurboQuant quantizes the *post-projection
per-token KV* of standard attention — exactly the thing DSv4 has
already replaced with a learned compressor. **You'd be applying a lossy
compression on top of an already-lossy compression**, which (a) compounds
loss in an uncontrolled way and (b) attacks the wrong dimension. The
compressed entries are already 4× (CSA) or 128× (HCA) reduced in the
sequence dimension; further reducing the *head dimension* via codebook
gives little additional savings (you're already attending over very few
entries per query) at high quality cost.

### 2. Wrong shape, wrong primitive

TurboQuant operates on `(..., n_heads, head_dim=128)` per-token vectors
and uses a `128×128` random rotation. DSv4's compressed cache is shape
`(n_comp, head_dim=512)` — no head dimension. The whole "rotate the head
dim" abstraction needs to be reworked, and once you do, you're writing
new code that isn't TurboQuant anymore.

For the indexer keys, the storage *is* per-block 128-dim, which is closer
to TurboQuant's natural shape. But the indexer's scoring math is
`ReLU(q·k) · w_h` summed across heads — TurboQuant's "preserve inner
products" guarantee from Algorithm 2 doesn't compose with the ReLU
nonlinearity. The quantization error becomes worst-case at the threshold,
which is where top-k decisions get made. **Bad fit precisely where it
matters most.**

### 3. NVFP4 hardware exists; TurboQuant is software-only

TurboQuant runs as bit-packed uint8 + Triton kernels. It can't use
tcgen05 FP4 tensor cores because its values aren't FP4 — they're
codebook *indices*. So you'd be paying CPU/GPU cycles to dequant via
gathers and per-token rotation matrix-vector multiplies, when the same
storage cost (4 bits/value) is available natively as FP4 with hardware
dequant during MMA.

The TurboQuant benchmark numbers (+3–5% throughput at 3-bit) are
real, but they're against `bf16_kv` baselines on architectures that
don't have FP4 tensor cores. On Blackwell with NVFP4, the comparison
should be FP4 storage + FP4 MMA — which is strictly better in every
axis (bandwidth, capacity, dequant cost).

## Where TurboQuant *would* help, and the verdict on whether it's worth it

The only DSv4 stream where TurboQuant's shape is a natural fit is the
**SWA branch** — uncompressed per-token KV in the sliding window, 128
tokens × `n_layers` × `hd=512` = 8 MB at 1M context.

**It's 8 MB.** Not worth a new dependency, a paper-grade extra failure
mode, or the rotation overhead. The SWA branch fits in L2 cache on B200.

### Verdict

Don't use TurboQuant. The right move for DSv4's KV cache is **FP4 storage
+ FP4 MMA on the compressed streams**, fully Blackwell-native, paper-
aligned (§5.2.1), with no codebook lookup overhead. The infrastructure to
do this is already in your kernel library (the `ENABLE_FP4_EPILOGUE`
template, the FP4 MMA path).

If you want a paper to cite for "what's the state-of-the-art KV
compression in 2026," TurboQuant is one. If you want the highest-perf
production-grade DSv4 implementation, native FP4 is the answer.

---

# PRIORITY ORDER (updated 2026-06-02)

| # | Item | Effort | Win | Status |
|---|---|---|---|---|
| **P0** | Call `set_fused_swiglu(True)` on all MoEs | XS | ~240 launches/token | ✅ Done |
| **P1** | Same for shared expert | S | ~120 launches/token | ✅ Done |
| **P2** | Drop per-call `fill_()` in Nvfp4Linear | S | ~244 launches/token | ✅ Done |
| **P3** | CUDA RoPE kernel (1 launch vs 5-6) | S | ~732 launches/token | ✅ Done |
| **KV-1** | FP4 storage for CSA main compressed KV | M | Huge at long context | Next |
| **KV-2** | FP4 storage for HCA compressed KV | M | Same pattern as KV-1 | After KV-1 |
| **KV-3** | FP4 storage for indexer keys (pair with E7) | M | Throughput + paper compliance | After KV-2 |
| **P4** | RMSNorm fused into next quantize | S | 122 launches/token | After KV |
| **P5** | mHC pre_block + RMSNorm fused | S | ~120 launches/token | After P4 |
| **P6** | CUDA graph capture | L | **2–3× total** | After everything above |

**Part 1 complete.** The NVFP4-everywhere gap for the GEMM+activation+RoPE
path is closed. The remaining wins are KV-cache dtype (Part 2) and
higher-order fusion (P4–P6). Land all of those before attempting CUDA
graphs — the captured graph should reflect the final fused structure, not
the pre-fusion one.

---

# DOCTRINE

1. **DSL wall → raw CUDA C++, not Python.** Applies to P3/P4/P5 (kernel-
   side fusion work). The fused-SwiGLU kernel already exists as a model
   for what these should look like — it's NVFP4 GEMM + arbitrary-op
   epilogue in registers, fully Blackwell-native. P3's CUDA RoPE kernel
   demonstrates the raw CUDA path works perfectly.

2. **Raw CUDA ≠ scalar math.** Applies to KV-1/KV-2/KV-3. The FP4
   storage path on the read side uses `tcgen05.mma`'s native E2M1 decode
   — no scalar dequant, no `__constant__` LUT (which was only needed
   for the indexer scoring CUDA-core path).

3. **Print, don't guess.** Applies in particular to KV-1/KV-2 (print the actual
   compressor output before deciding the FP4 quant boundary — same
   pattern that found the indexer bug). Do not assume the compressor
   emits a shape that matches the FP4 quant kernel; print and confirm.

4. **Integration over exploration.** Do not write `Nvfp4MoE_v2`. Do not
   write `KVCache_fp4_v2`. Edit the existing classes. KV-1/KV-2 are
   2-tensor type changes plus the kernel-side read path.

5. **Falsifiable gates.** Already listed per priority. Meta-gate: after
   P0–P5 land, decode latency at 8K context should be **single-digit
   ms**, not three-digit. If it isn't, something is still on the hot
   path that shouldn't be, and the answer is "profile, don't guess
   next."

6. **Don't optimize for problems you don't have.** TurboQuant is the
   cautionary tale here. The KV cache at 1M is 10 GB on 8 × B200 — that
   is not a problem that needs solving with a new dependency. The
   problem is throughput, and the right answer is FP4 storage + FP4 MMA,
   which is hardware-native and doesn't require codebook lookups.