diff --git a/single_shot_inference.py b/single_shot_inference.py
index c8474ecb..f4f3d19b 100644
--- a/single_shot_inference.py
+++ b/single_shot_inference.py
@@ -455,6 +455,39 @@ class KVCache:
     This matches the DeepSeek V4 paper: "BF16 for RoPE dims, FP8 for remaining dims.
     This hybrid representation reduces the KV cache size by nearly half."
 
+    WHY NOT NVFP4 (native Blackwell FP4)?
+    ─────────────────────────────────────
+    We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
+    for compressed KV storage. Blackwell's native FP4→MMA path would have given us
+    3.5× memory savings and direct tensor-core consumption — the dream pipeline.
+
+    We tried. Hard. Three separate approaches:
+      1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
+         cross-warp block amax reduction and shared memory corruption (s_scratch
+         stomping adjacent variables). Best cos=0.703. Dead.
+      2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
+         compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
+         but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
+         the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
+         of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
+         quantization (4.5 bits effective) added ~0.5% per layer on top of that.
+      3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
+         Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
+         not RoPE.
+
+    The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
+    compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
+    bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
+    fatally across 61 layers.
+
+    So we settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4
+    ships in production. Not because we couldn't build the NVFP4 path (we did, it compiled
+    and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough.
+
+    If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits),
+    revisit this. The kernels exist. The quantize/dequant path is proven. The precision
+    just isn't there yet for attention-sensitive KV values.
+
     Storage per compressed entry at hd=512:
       nope (448) × FP8 = 448 bytes + 4 bytes (scale) = 452
       rope (64) × BF16 = 128 bytes