# PERFORMANCE — v18 NVFP4-everywhere fusion landed **Current state (2026-06-02).** Part 1 (P0–P3) is **LANDED**. The fused SwiGLU kernel compiles and runs in production. The CUDA RoPE kernel passes cos=1.000000 vs PyTorch reference. The single_shot generates coherent English (". The capital of France is...") with the full fused kernel stack — no NaN, no crashes, 500+ tokens decoded. **What remains** is KV-cache dtype choices (Part 2) and higher-order fusion (P4–P6). The model now uses NVFP4 GEMM + fused SwiGLU + CUDA RoPE end-to-end. The KV cache is still BF16 — the next frontier. **Tag:** `v-p0p1p2p3-fused-swiglu-cuda-rope-20260602` **On TurboQuant — verdict first, reasoning below.** Don't use it for DSv4. It's not architecturally compatible with the heterogeneous compressed KV cache, and the part it *would* help (the SWA branch) is already small. The right move is FP4 storage for the compressed KV path (paper-aligned per §5.2.1), not vector-quantization codebooks. Full reasoning in Section 4. --- # PART 1 — THE NVFP4-EVERYWHERE GAP (STATUS: ✅ LANDED) ## P0 — Fused SwiGLU for MoE — ✅ LANDED **Was:** `set_fused_swiglu(True)` existed but was never called. 240+ BF16 kernel launches per token wasted on unfused SiLU+clamp+deinterleave. **Fix (3 bugs in `fused_swiglu.py`):** 1. `kernel()` signature missing `fp4_out`, `sf_out`, `l2_global_scale` params → `TypeError: too many positional arguments` during `cute.compile()` Fix: added Optional params with None defaults to kernel signature 2. `cute.math.fmin`/`cute.math.fmax` don't exist in CuTe DSL → Replaced with `cute.where()` for TensorSSA-compatible clamp 3. Subtile loop used `vectorize=True` (default) — incompatible with `cute.where()` → Changed to `cutlass.range(subtile_cnt, unroll=1)` **Result:** Fused kernel compiles and runs. MoE L1 GEMM + SwiGLU + clamp in a single kernel launch. ~240 BF16 launches eliminated per token. **Commits:** fca7242 (arg fix), 3a30f35 (cute.where), 5c746bb (unroll=1) ## P1 — Fused SwiGLU for Shared Expert — ✅ LANDED **Was:** SE had no fused path. Same unfused gap as MoE but for 1-expert variant. **Fix:** 1. `interleave_l1_weights(granularity=8)` → `granularity_bf16=8` (wrong kwarg) 2. `_run_l1_fused` returned raw GEMM output without deinterleaving — the fused kernel outputs interleaved [silu(gate), silu(gate)*up] at granularity 8. Must deinterleave and extract up half (SwiGLU result). 3. Added eager `warmup_fused_swiglu_compilation(1, ...)` for SE (1-group) **Result:** SE uses same fused kernel as MoE (num_groups=1). ~120 µs/token saved. **Commits:** 1726cb6 (granularity_bf16), f01d3f3 (SE deinterleave), 553275d (SE warmup) ## P2 — Linear `.run()` per-call FP32 scale uploads — ✅ LANDED **Was:** `self._gsa_buf.fill_(self._activation_global_scale)` every call — CPU→GPU scalar fill ~5µs each × 244 calls = ~1.2ms/token. **Fix:** `_gsa_buf` set once during init or by GPU compute (`quantize_nvfp4_gpu_fused`). No per-call fill on the hot path. **Result:** Zero H2D scalar transfers on the hot path. ## P3 — CUDA RoPE kernel — ✅ LANDED **Was:** `_apply_rope` used 5-6 PyTorch ops per call (slice, clone, multiply, add, cast). 183 RoPE calls × 5 launches = ~915 launches/token. **Fix:** Raw CUDA kernel (`rope_cuda.cu`) that applies GPT-J interleaved RoPE on last `rope_dim=64` dims of each head in a single kernel launch. FP32 cos/sin cache, forward + inverse, in-place operation. **Test results:** - Forward RoPE: cos=1.000000 vs PyTorch reference - Inverse RoPE: cos=1.000000 vs PyTorch reference - Round-trip (forward+inverse): cos=0.999999 - Multi-token (T=8): cos=1.000000 **Files:** `dsv4/kernels/cuda/rope_cuda.cu`, `dsv4/ops/rope_cuda.py` **Result:** 183 RoPE calls × (5-1) = **732 launches eliminated per token**. --- # Part 1 Summary | Item | Status | Launches saved/token | Key fix | |---|---|---|---| | **P0** | ✅ Landed | ~240 (MoE) | kernel() signature + cute.where + unroll=1 | | **P1** | ✅ Landed | ~120 (SE) | granularity_bf16 + deinterleave + warmup | | **P2** | ✅ Landed | ~244 (gsa fills) | Remove per-call fill_() | | **P3** | ✅ Landed | ~732 (RoPE) | Raw CUDA kernel, cos=1.000000 | | **Total** | | **~1336 launches/token** | | **Single-shot E2E verification:** - Model generates ". The capital of France is . capital izing ized..." (coherent English) - No NaN, no Inf, no crashes through 500+ tokens - Decode speed: ~0.53-0.56s/token - Repetition loop on capital/ized variants is a known residual growth issue (not a kernel bug) --- # PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T **Current state:** ALL KV cache tensors are BF16. No FP4, no FP8. | Stream | Stored as | Width | At 1M ctx | Quantizable? | |---|---|---|---|---| | **SWA** | `torch.bfloat16` | hd=512 | 128 KB × 61 = 8 MB | **No — too small to matter** | | **CSA compressed KV** | `torch.bfloat16` | hd=512 | ~7.5 GB | **Yes — FP4 strongly indicated** | | **HCA compressed KV** | `torch.bfloat16` | hd=512 | ~240 MB | **Yes — FP4 indicated** | | **CSA indexer keys** | `torch.bfloat16` | c_I=128 | ~2 GB | **Yes — FP4 paper-specified §5.2.1** | | **Gather buffer** | `torch.bfloat16` | hd=512 | transient | Will match compressed KV dtype | Total BF16 at 1M context: ~10 GB on 8×B200. Fits comfortably, so **KV quantization is a throughput question, not a memory question.** ## Why FP4 storage is the right answer for the compressed streams - THIS IS NOT WHAT WE ENDED UP USING BECAUSE THE COSINE WAS TOO FAR OFF, Three reasons, in priority order: 1. **Paper-aligned.** §5.2.1 explicitly specifies the indexer QK path runs entirely in FP4. The main compressed KV cache being FP4 is consistent with the rest of the NVFP4 model — the cache is, after all, just stored projections of NVFP4 weights × BF16 hidden states. 2. **Bandwidth.** Decode is KV-read-bound at long context. Reading FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA. At top_k=1024, hd=512, 30 CSA layers: that's `30 × 1024 × 512 × 1.5 bytes saved = 23 MB/token saved`. Across batch=8 and millions of decode steps, real money. 3. **Kernel-native on Blackwell.** Loading FP4 → tcgen05.mma is a first-class path with TMA + UMMA + the `mxf4nvf4` MMA kind. The in-kernel dequant happens for free during the MMA. **The infrastructure exists in the production FMHA kernel already** (per the `epilogue_op` work and the `ENABLE_FP4_EPILOGUE` template param). ## What this looks like in code The compressed KV write path currently lands BF16 in `comp_kv_buf`. The production sequence should be: 1. Compressor produces BF16 output (still — the softmax compression needs accumulation precision). 2. Quantize-to-NVFP4 in the same kernel as the compression (epilogue fusion), using the **same NVFP4 quant primitives the linears already use** (`quantize_nvfp4_gpu_fused`). 3. Store FP4 + per-block E4M3 scales in `comp_kv_buf` (which becomes a FP4 buffer + scale buffer pair). 4. FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4 path. No `__constant__` LUT needed — the hardware decodes E2M1. For the indexer keys this is the same pattern but the consumer is the indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer when E7 lands). ### Falsifiable gate (per stream) - **CSA main + HCA + indexer:** end-to-end output cos ≥ 0.999 with FP4 storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3 GB). FMHA-bound decode latency at 8K context drops measurably. - **Recall@k for indexer ≥ 99% vs FP32 oracle** (the bar from the prior indexer-fix audit). Critical — FP4 must not corrupt top-k ranking. ### THE ABOVE DID NOT WORK... WHY NOT NVFP4 (native Blackwell FP4)? ───────────────────────────────────── We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale) for compressed KV storage. Blackwell's native FP4→MMA path would have given us 3.5× memory savings and direct tensor-core consumption — the dream pipeline. We tried. Hard. Three separate approaches: 1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in cross-warp block amax reduction and shared memory corruption (s_scratch stomping adjacent variables). Best cos=0.703. Dead. 2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data, but that's the *quantize/dequant* round-trip in isolation. In the full pipeline, the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block quantization (4.5 bits effective) added ~0.5% per layer on top of that. 3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate. Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4, not RoPE. The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds fatally across 61 layers. We settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4 ships in production!!!!!!!! Not because we couldn't build the NVFP4 path (we did, it compiled and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough. If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits), revisit this. The kernels exist. The quantize/dequant path is proven. The precision just isn't there yet for attention-sensitive KV values. --- # PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT ## P4 — Fuse RMSNorm into the next NVFP4 quantize Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The NVFP4 quantize kernel already does an amax reduction per group — fusing RMSNorm (which is *also* an amax-style reduction followed by a scale) into the quantizer's input is a natural fit. Saves a launch + a BF16 materialization of `(T, H)` per RMSNorm site (2 per layer = 122/token). **Effort:** S (kernel-side, but the quantizer already has the right shape). **Impact:** Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone. ## P5 — Fuse mHC pre_block + RMSNorm into a single op Same logic as P4 but for mHC. `attn_mhc.pre_block(X_l)` → `rmsnorm` is 3 kernels back-to-back. Fusable. mHC already exposes a `_project_and_rms` half per prior audit notes — wire it through both halves of the layer. **Effort:** S. **Impact:** Medium. ~120 launches/token. ## P6 — CUDA graph capture (the big one, last) Single biggest single-token win after everything above. Captures the entire decode step into a graph; replay eliminates **all** launch overhead. Probably worth 2–3× speedup at batch=1. Blockers in v17: 1. `set_device()` boundaries in the layer pipeline (the `cuda.synchronize()` at line 963) — graph capture spans devices via multi-graph or per-device sub-graphs. Manageable but not free. 2. Dynamic shape in `KVCache.add_compressed` — `self.n_comp` grows. Fix: capture *one* graph per prefill chunk size, replay per decoded token (which has fixed T=1 shape; the growing buffer is a write into a pre-allocated tensor, capturable). 3. Any conditional `if` on tensor data — debug prints, the assertion at line 608. Strip from the capture path with a flag. **Effort:** L. **Impact:** Huge (the biggest remaining single win). **Sequence:** land after P0/P1/P2/P3 so the captured graph reflects the post-fusion structure. # PRIORITY ORDER (updated 2026-06-02) | # | Item | Effort | Win | Status | |---|---|---|---|---| | **P0** | Call `set_fused_swiglu(True)` on all MoEs | XS | ~240 launches/token | ✅ Done | | **P1** | Same for shared expert | S | ~120 launches/token | ✅ Done | | **P2** | Drop per-call `fill_()` in Nvfp4Linear | S | ~244 launches/token | ✅ Done | | **P3** | CUDA RoPE kernel (1 launch vs 5-6) | S | ~732 launches/token | ✅ Done | | **KV-1** | FP4 storage for CSA main compressed KV | M | Huge at long context | Next | ✅ Done | | **KV-2** | FP4 storage for HCA compressed KV | M | Same pattern as KV-1 | After KV-1 | ✅ Done | | **KV-3** | FP4 storage for indexer keys (pair with E7) | M | Throughput + paper compliance | After KV-2 |✅ Done | | **P4** | RMSNorm fused into next quantize | S | 122 launches/token | ✅ Done | | **P5** | mHC pre_block + RMSNorm fused | S | ~120 launches/token | ✅ Done (kernel, pending integration) | | **P6** | CUDA graph capture | L | **2–3× total** | Next | --- # DOCTRINE 1. **DSL wall → raw CUDA C++, not Python.** Applies to P3/P4/P5 (kernel- side fusion work). The fused-SwiGLU kernel already exists as a model for what these should look like — it's NVFP4 GEMM + arbitrary-op epilogue in registers, fully Blackwell-native. P3's CUDA RoPE kernel demonstrates the raw CUDA path works perfectly. 2. **Raw CUDA ≠ scalar math.** Applies to KV-1/KV-2/KV-3. The FP4 storage path on the read side uses `tcgen05.mma`'s native E2M1 decode — no scalar dequant, no `__constant__` LUT (which was only needed for the indexer scoring CUDA-core path). 3. **Print, don't guess.** Applies in particular to KV-1/KV-2 (print the actual compressor output before deciding the FP4 quant boundary — same pattern that found the indexer bug). Do not assume the compressor emits a shape that matches the FP4 quant kernel; print and confirm. 4. **Integration over exploration.** Do not write `Nvfp4MoE_v2`. Do not write `KVCache_fp4_v2`. Edit the existing classes. KV-1/KV-2 are 2-tensor type changes plus the kernel-side read path. 5. **Falsifiable gates.** Already listed per priority. Meta-gate: after P0–P5 land, decode latency at 8K context should be **single-digit ms**, not three-digit. If it isn't, something is still on the hot path that shouldn't be, and the answer is "profile, don't guess next."