Add detailed comment: why compressed KV uses FP8 not NVFP4

We tried NVFP4 (Blackwell native FP4→MMA). Three approaches. cos=0.995 round-trip seems fine in isolation but 4.5 effective bits compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective bits gives cos=0.9997 — that 0.4% difference is the margin between working and broken. Kernels exist, path is proven, precision isn't.
2026-06-02 10:19:54 +00:00
parent edc8e7ee8d
commit 1f69f61363
1 changed files with 33 additions and 0 deletions
--- a/single_shot_inference.py
+++ b/single_shot_inference.py
@@ -455,6 +455,39 @@ class KVCache:
    This matches the DeepSeek V4 paper: "BF16 for RoPE dims, FP8 for remaining dims.
    This hybrid representation reduces the KV cache size by nearly half."

+    WHY NOT NVFP4 (native Blackwell FP4)?
+    ─────────────────────────────────────
+    We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
+    for compressed KV storage. Blackwell's native FP4→MMA path would have given us
+    3.5× memory savings and direct tensor-core consumption — the dream pipeline.
+
+    We tried. Hard. Three separate approaches:
+      1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
+         cross-warp block amax reduction and shared memory corruption (s_scratch
+         stomping adjacent variables). Best cos=0.703. Dead.
+      2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
+         compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
+         but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
+         the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
+         of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
+         quantization (4.5 bits effective) added ~0.5% per layer on top of that.
+      3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
+         Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
+         not RoPE.
+
+    The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
+    compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
+    bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
+    fatally across 61 layers.
+
+    So we settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4
+    ships in production. Not because we couldn't build the NVFP4 path (we did, it compiled
+    and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough.
+
+    If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits),
+    revisit this. The kernels exist. The quantize/dequant path is proven. The precision
+    just isn't there yet for attention-sensitive KV values.
+
    Storage per compressed entry at hd=512:
      nope (448) × FP8 = 448 bytes + 4 bytes (scale) = 452
      rope (64) × BF16 = 128 bytes