Add detailed comment: why compressed KV uses FP8 not NVFP4

We tried NVFP4 (Blackwell native FP4→MMA). Three approaches.
cos=0.995 round-trip seems fine in isolation but 4.5 effective bits
compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective
bits gives cos=0.9997 — that 0.4% difference is the margin between
working and broken. Kernels exist, path is proven, precision isn't.
This commit is contained in:
2026-06-02 10:19:54 +00:00
parent edc8e7ee8d
commit 1f69f61363

View File

@@ -455,6 +455,39 @@ class KVCache:
This matches the DeepSeek V4 paper: "BF16 for RoPE dims, FP8 for remaining dims.
This hybrid representation reduces the KV cache size by nearly half."
WHY NOT NVFP4 (native Blackwell FP4)?
─────────────────────────────────────
We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
for compressed KV storage. Blackwell's native FP4→MMA path would have given us
3.5× memory savings and direct tensor-core consumption — the dream pipeline.
We tried. Hard. Three separate approaches:
1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
cross-warp block amax reduction and shared memory corruption (s_scratch
stomping adjacent variables). Best cos=0.703. Dead.
2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
quantization (4.5 bits effective) added ~0.5% per layer on top of that.
3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
not RoPE.
The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
fatally across 61 layers.
So we settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4
ships in production. Not because we couldn't build the NVFP4 path (we did, it compiled
and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough.
If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits),
revisit this. The kernels exist. The quantize/dequant path is proven. The precision
just isn't there yet for attention-sensitive KV values.
Storage per compressed entry at hd=512:
nope (448) × FP8 = 448 bytes + 4 bytes (scale) = 452
rope (64) × BF16 = 128 bytes