Add detailed comment: why compressed KV uses FP8 not NVFP4
We tried NVFP4 (Blackwell native FP4→MMA). Three approaches. cos=0.995 round-trip seems fine in isolation but 4.5 effective bits compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective bits gives cos=0.9997 — that 0.4% difference is the margin between working and broken. Kernels exist, path is proven, precision isn't.
This commit is contained in:
@@ -455,6 +455,39 @@ class KVCache:
|
||||
This matches the DeepSeek V4 paper: "BF16 for RoPE dims, FP8 for remaining dims.
|
||||
This hybrid representation reduces the KV cache size by nearly half."
|
||||
|
||||
WHY NOT NVFP4 (native Blackwell FP4)?
|
||||
─────────────────────────────────────
|
||||
We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
|
||||
for compressed KV storage. Blackwell's native FP4→MMA path would have given us
|
||||
3.5× memory savings and direct tensor-core consumption — the dream pipeline.
|
||||
|
||||
We tried. Hard. Three separate approaches:
|
||||
1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
|
||||
cross-warp block amax reduction and shared memory corruption (s_scratch
|
||||
stomping adjacent variables). Best cos=0.703. Dead.
|
||||
2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
|
||||
compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
|
||||
but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
|
||||
the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
|
||||
of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
|
||||
quantization (4.5 bits effective) added ~0.5% per layer on top of that.
|
||||
3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
|
||||
Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
|
||||
not RoPE.
|
||||
|
||||
The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
|
||||
compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
|
||||
bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
|
||||
fatally across 61 layers.
|
||||
|
||||
So we settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4
|
||||
ships in production. Not because we couldn't build the NVFP4 path (we did, it compiled
|
||||
and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough.
|
||||
|
||||
If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits),
|
||||
revisit this. The kernels exist. The quantize/dequant path is proven. The precision
|
||||
just isn't there yet for attention-sensitive KV values.
|
||||
|
||||
Storage per compressed entry at hd=512:
|
||||
nope (448) × FP8 = 448 bytes + 4 bytes (scale) = 452
|
||||
rope (64) × BF16 = 128 bytes
|
||||
|
||||
Reference in New Issue
Block a user