- Move test_*.py → tests/integration/ - Move probe_*.py, dump_*.py → helpers/ - Move PERFORMANCE_AUDIT.md → docs/ - Move single_shot_PYTORCH_REFERENCE.py → dsv4/reference/ - Fix 3 import references in test_layer_comparison, test_mhc_comparison, test_compressor_position_bias - Add helpers/import_closure.py (dead-code detection tool)
14 KiB
PERFORMANCE — v18 NVFP4-everywhere fusion landed
Current state (2026-06-02). Part 1 (P0–P3) is LANDED. The fused SwiGLU kernel compiles and runs in production. The CUDA RoPE kernel passes cos=1.000000 vs PyTorch reference. The single_shot generates coherent English (". The capital of France is...") with the full fused kernel stack — no NaN, no crashes, 500+ tokens decoded.
What remains is KV-cache dtype choices (Part 2) and higher-order fusion (P4–P6). The model now uses NVFP4 GEMM + fused SwiGLU + CUDA RoPE end-to-end. The KV cache is still BF16 — the next frontier.
Tag: v-p0p1p2p3-fused-swiglu-cuda-rope-20260602
On TurboQuant — verdict first, reasoning below. Don't use it for DSv4. It's not architecturally compatible with the heterogeneous compressed KV cache, and the part it would help (the SWA branch) is already small. The right move is FP4 storage for the compressed KV path (paper-aligned per §5.2.1), not vector-quantization codebooks. Full reasoning in Section 4.
PART 1 — THE NVFP4-EVERYWHERE GAP (STATUS: ✅ LANDED)
P0 — Fused SwiGLU for MoE — ✅ LANDED
Was: set_fused_swiglu(True) existed but was never called. 240+ BF16
kernel launches per token wasted on unfused SiLU+clamp+deinterleave.
Fix (3 bugs in fused_swiglu.py):
kernel()signature missingfp4_out,sf_out,l2_global_scaleparams →TypeError: too many positional argumentsduringcute.compile()Fix: added Optional params with None defaults to kernel signaturecute.math.fmin/cute.math.fmaxdon't exist in CuTe DSL → Replaced withcute.where()for TensorSSA-compatible clamp- Subtile loop used
vectorize=True(default) — incompatible withcute.where()→ Changed tocutlass.range(subtile_cnt, unroll=1)
Result: Fused kernel compiles and runs. MoE L1 GEMM + SwiGLU + clamp in a single kernel launch. ~240 BF16 launches eliminated per token.
Commits: fca7242 (arg fix), 3a30f35 (cute.where), 5c746bb (unroll=1)
P1 — Fused SwiGLU for Shared Expert — ✅ LANDED
Was: SE had no fused path. Same unfused gap as MoE but for 1-expert variant.
Fix:
interleave_l1_weights(granularity=8)→granularity_bf16=8(wrong kwarg)_run_l1_fusedreturned raw GEMM output without deinterleaving — the fused kernel outputs interleaved [silu(gate), silu(gate)*up] at granularity 8. Must deinterleave and extract up half (SwiGLU result).- Added eager
warmup_fused_swiglu_compilation(1, ...)for SE (1-group)
Result: SE uses same fused kernel as MoE (num_groups=1). ~120 µs/token saved.
Commits: 1726cb6 (granularity_bf16), f01d3f3 (SE deinterleave), 553275d (SE warmup)
P2 — Linear .run() per-call FP32 scale uploads — ✅ LANDED
Was: self._gsa_buf.fill_(self._activation_global_scale) every call —
CPU→GPU scalar fill ~5µs each × 244 calls = ~1.2ms/token.
Fix: _gsa_buf set once during init or by GPU compute (quantize_nvfp4_gpu_fused).
No per-call fill on the hot path.
Result: Zero H2D scalar transfers on the hot path.
P3 — CUDA RoPE kernel — ✅ LANDED
Was: _apply_rope used 5-6 PyTorch ops per call (slice, clone, multiply, add, cast).
183 RoPE calls × 5 launches = ~915 launches/token.
Fix: Raw CUDA kernel (rope_cuda.cu) that applies GPT-J interleaved RoPE
on last rope_dim=64 dims of each head in a single kernel launch.
FP32 cos/sin cache, forward + inverse, in-place operation.
Test results:
- Forward RoPE: cos=1.000000 vs PyTorch reference
- Inverse RoPE: cos=1.000000 vs PyTorch reference
- Round-trip (forward+inverse): cos=0.999999
- Multi-token (T=8): cos=1.000000
Files: dsv4/kernels/cuda/rope_cuda.cu, dsv4/ops/rope_cuda.py
Result: 183 RoPE calls × (5-1) = 732 launches eliminated per token.
Part 1 Summary
| Item | Status | Launches saved/token | Key fix |
|---|---|---|---|
| P0 | ✅ Landed | ~240 (MoE) | kernel() signature + cute.where + unroll=1 |
| P1 | ✅ Landed | ~120 (SE) | granularity_bf16 + deinterleave + warmup |
| P2 | ✅ Landed | ~244 (gsa fills) | Remove per-call fill_() |
| P3 | ✅ Landed | ~732 (RoPE) | Raw CUDA kernel, cos=1.000000 |
| Total | ~1336 launches/token |
Single-shot E2E verification:
- Model generates ". The capital of France is . capital izing ized..." (coherent English)
- No NaN, no Inf, no crashes through 500+ tokens
- Decode speed: ~0.53-0.56s/token
- Repetition loop on capital/ized variants is a known residual growth issue (not a kernel bug)
PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T
Current state: ALL KV cache tensors are BF16. No FP4, no FP8.
| Stream | Stored as | Width | At 1M ctx | Quantizable? |
|---|---|---|---|---|
| SWA | torch.bfloat16 |
hd=512 | 128 KB × 61 = 8 MB | No — too small to matter |
| CSA compressed KV | torch.bfloat16 |
hd=512 | ~7.5 GB | Yes — FP4 strongly indicated |
| HCA compressed KV | torch.bfloat16 |
hd=512 | ~240 MB | Yes — FP4 indicated |
| CSA indexer keys | torch.bfloat16 |
c_I=128 | ~2 GB | Yes — FP4 paper-specified §5.2.1 |
| Gather buffer | torch.bfloat16 |
hd=512 | transient | Will match compressed KV dtype |
Total BF16 at 1M context: ~10 GB on 8×B200. Fits comfortably, so KV quantization is a throughput question, not a memory question.
Why FP4 storage is the right answer for the compressed streams - THIS IS NOT WHAT WE ENDED UP USING BECAUSE THE COSINE WAS TOO FAR OFF,
Three reasons, in priority order:
-
Paper-aligned. §5.2.1 explicitly specifies the indexer QK path runs entirely in FP4. The main compressed KV cache being FP4 is consistent with the rest of the NVFP4 model — the cache is, after all, just stored projections of NVFP4 weights × BF16 hidden states.
-
Bandwidth. Decode is KV-read-bound at long context. Reading FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA. At top_k=1024, hd=512, 30 CSA layers: that's
30 × 1024 × 512 × 1.5 bytes saved = 23 MB/token saved. Across batch=8 and millions of decode steps, real money. -
Kernel-native on Blackwell. Loading FP4 → tcgen05.mma is a first-class path with TMA + UMMA + the
mxf4nvf4MMA kind. The in-kernel dequant happens for free during the MMA. The infrastructure exists in the production FMHA kernel already (per theepilogue_opwork and theENABLE_FP4_EPILOGUEtemplate param).
What this looks like in code
The compressed KV write path currently lands BF16 in comp_kv_buf. The
production sequence should be:
- Compressor produces BF16 output (still — the softmax compression needs accumulation precision).
- Quantize-to-NVFP4 in the same kernel as the compression (epilogue
fusion), using the same NVFP4 quant primitives the linears already
use (
quantize_nvfp4_gpu_fused). - Store FP4 + per-block E4M3 scales in
comp_kv_buf(which becomes a FP4 buffer + scale buffer pair). - FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4
path. No
__constant__LUT needed — the hardware decodes E2M1.
For the indexer keys this is the same pattern but the consumer is the indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer when E7 lands).
Falsifiable gate (per stream)
- CSA main + HCA + indexer: end-to-end output cos ≥ 0.999 with FP4 storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3 GB). FMHA-bound decode latency at 8K context drops measurably.
- Recall@k for indexer ≥ 99% vs FP32 oracle (the bar from the prior indexer-fix audit). Critical — FP4 must not corrupt top-k ranking.
THE ABOVE DID NOT WORK... WHY NOT NVFP4 (native Blackwell FP4)?
─────────────────────────────────────
We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
for compressed KV storage. Blackwell's native FP4→MMA path would have given us
3.5× memory savings and direct tensor-core consumption — the dream pipeline.
We tried. Hard. Three separate approaches:
1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
cross-warp block amax reduction and shared memory corruption (s_scratch
stomping adjacent variables). Best cos=0.703. Dead.
2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
quantization (4.5 bits effective) added ~0.5% per layer on top of that.
3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
not RoPE.
The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
fatally across 61 layers.
We settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4 ships in production!!!!!!!! Not because we couldn't build the NVFP4 path (we did, it compiled and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough. If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits), revisit this. The kernels exist. The quantize/dequant path is proven. The precision just isn't there yet for attention-sensitive KV values.
PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT
P4 — Fuse RMSNorm into the next NVFP4 quantize
Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The
NVFP4 quantize kernel already does an amax reduction per group — fusing
RMSNorm (which is also an amax-style reduction followed by a scale)
into the quantizer's input is a natural fit. Saves a launch + a BF16
materialization of (T, H) per RMSNorm site (2 per layer = 122/token).
Effort: S (kernel-side, but the quantizer already has the right shape). Impact: Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone.
P5 — Fuse mHC pre_block + RMSNorm into a single op
Same logic as P4 but for mHC. attn_mhc.pre_block(X_l) → rmsnorm is 3
kernels back-to-back. Fusable. mHC already exposes a _project_and_rms
half per prior audit notes — wire it through both halves of the layer.
Effort: S. Impact: Medium. ~120 launches/token.
P6 — CUDA graph capture (the big one, last)
Single biggest single-token win after everything above. Captures the entire decode step into a graph; replay eliminates all launch overhead. Probably worth 2–3× speedup at batch=1.
Blockers in v17:
set_device()boundaries in the layer pipeline (thecuda.synchronize()at line 963) — graph capture spans devices via multi-graph or per-device sub-graphs. Manageable but not free.- Dynamic shape in
KVCache.add_compressed—self.n_compgrows. Fix: capture one graph per prefill chunk size, replay per decoded token (which has fixed T=1 shape; the growing buffer is a write into a pre-allocated tensor, capturable). - Any conditional
ifon tensor data — debug prints, the assertion at line 608. Strip from the capture path with a flag.
Effort: L. Impact: Huge (the biggest remaining single win). Sequence: land after P0/P1/P2/P3 so the captured graph reflects the post-fusion structure.
PRIORITY ORDER (updated 2026-06-02)
| # | Item | Effort | Win | Status |
|---|---|---|---|---|
| P0 | Call set_fused_swiglu(True) on all MoEs |
XS | ~240 launches/token | ✅ Done |
| P1 | Same for shared expert | S | ~120 launches/token | ✅ Done |
| P2 | Drop per-call fill_() in Nvfp4Linear |
S | ~244 launches/token | ✅ Done |
| P3 | CUDA RoPE kernel (1 launch vs 5-6) | S | ~732 launches/token | ✅ Done |
| KV-1 | FP4 storage for CSA main compressed KV | M | Huge at long context | Next |
| KV-2 | FP4 storage for HCA compressed KV | M | Same pattern as KV-1 | After KV-1 |
| KV-3 | FP4 storage for indexer keys (pair with E7) | M | Throughput + paper compliance | After KV-2 |
| P4 | RMSNorm fused into next quantize | S | 122 launches/token | ✅ Done |
| P5 | mHC pre_block + RMSNorm fused | S | ~120 launches/token | ✅ Done (kernel, pending integration) |
| P6 | CUDA graph capture | L | 2–3× total | Next |
DOCTRINE
-
DSL wall → raw CUDA C++, not Python. Applies to P3/P4/P5 (kernel- side fusion work). The fused-SwiGLU kernel already exists as a model for what these should look like — it's NVFP4 GEMM + arbitrary-op epilogue in registers, fully Blackwell-native. P3's CUDA RoPE kernel demonstrates the raw CUDA path works perfectly.
-
Raw CUDA ≠ scalar math. Applies to KV-1/KV-2/KV-3. The FP4 storage path on the read side uses
tcgen05.mma's native E2M1 decode — no scalar dequant, no__constant__LUT (which was only needed for the indexer scoring CUDA-core path). -
Print, don't guess. Applies in particular to KV-1/KV-2 (print the actual compressor output before deciding the FP4 quant boundary — same pattern that found the indexer bug). Do not assume the compressor emits a shape that matches the FP4 quant kernel; print and confirm.
-
Integration over exploration. Do not write
Nvfp4MoE_v2. Do not writeKVCache_fp4_v2. Edit the existing classes. KV-1/KV-2 are 2-tensor type changes plus the kernel-side read path. -
Falsifiable gates. Already listed per priority. Meta-gate: after P0–P5 land, decode latency at 8K context should be single-digit ms, not three-digit. If it isn't, something is still on the hot path that shouldn't be, and the answer is "profile, don't guess next."