nvfp4-megamoe-kernel/tests/unit/test_kv_quantize.py at master

Files

biondizzle edc8e7ee8d KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)

Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims'
- Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA
- RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion)
- Indexer keys: FP8_E4M3 (ihd=128, no RoPE)
- SWA: BF16 (unchanged)

Pipeline:
  Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE]
  Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA

No BF16 intermediate for non-RoPE data.
No FP32 intermediate after BF16 RoPE.
BF16 is the final format consumed by FMHA (no further conversion).

KVCache rewritten:
- comp_nope_fp8/scale: FP8 storage for non-RoPE
- comp_rope_bf16: BF16 storage for RoPE
- comp_nope_selective/all: FP8→BF16 dequant
- comp_rope_selective/all: BF16 gather
- set_compressed_mixed: write mixed format
- set_indexer_keys_fp8: write FP8 indexer keys

2026-06-02 10:08:43 +00:00

5.3 KiB

Raw Permalink Blame History

View Raw

5.3 KiB Raw Permalink Blame History

5.3 KiB

Raw Permalink Blame History