diff --git a/single_shot_inference.py b/single_shot_inference.py index c8474ecb..f4f3d19b 100644 --- a/single_shot_inference.py +++ b/single_shot_inference.py @@ -455,6 +455,39 @@ class KVCache: This matches the DeepSeek V4 paper: "BF16 for RoPE dims, FP8 for remaining dims. This hybrid representation reduces the KV cache size by nearly half." + WHY NOT NVFP4 (native Blackwell FP4)? + ───────────────────────────────────── + We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale) + for compressed KV storage. Blackwell's native FP4→MMA path would have given us + 3.5× memory savings and direct tensor-core consumption — the dream pipeline. + + We tried. Hard. Three separate approaches: + 1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in + cross-warp block amax reduction and shared memory corruption (s_scratch + stomping adjacent variables). Best cos=0.703. Dead. + 2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's + compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data, + but that's the *quantize/dequant* round-trip in isolation. In the full pipeline, + the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers + of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block + quantization (4.5 bits effective) added ~0.5% per layer on top of that. + 3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate. + Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4, + not RoPE. + + The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for + compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective + bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds + fatally across 61 layers. + + So we settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4 + ships in production. Not because we couldn't build the NVFP4 path (we did, it compiled + and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough. + + If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits), + revisit this. The kernels exist. The quantize/dequant path is proven. The precision + just isn't there yet for attention-sensitive KV values. + Storage per compressed entry at hd=512: nope (448) × FP8 = 448 bytes + 4 bytes (scale) = 452 rope (64) × BF16 = 128 bytes