biondizzle
7ef6402936
KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys
Architecture:
- Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa)
- Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4
- Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA
- No BF16 intermediate in the write path
- Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale)
- Write path: compress→FP32 → quantize FP32→FP8_E4M3
- Read path: dequant_fp8_e4m3 → BF16 for scoring
- SWA: remains BF16 (8MB total, fits in L2)
New kernels in kv_quantize.cu:
- compute_amax_gsa_fp32: per-row gsa from FP32 input
- quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer
- quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys
- dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16
- rope_fp32: FP32 GPT-J interleaved RoPE (no BF16)
Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused):
Kernel 1: amax_gsa (GPU-only)
Kernel 2: quantize from buffer (GPU gsa)
No shared memory bugs. No cross-CTA race conditions.
KVCache updated:
- comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16)
- comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16)
- comp_kv property: dequant NVFP4→BF16 on demand
- comp_kv_selective: dequant only top-k entries (bandwidth savings)
- comp_idx_kv property: dequant FP8→BF16 on demand
Removed: compressor_reduce_quant.cu (buggy single-kernel approach)
2026-06-02 10:00:50 +00:00
..
2026-05-30 21:09:21 +00:00
2026-06-02 10:00:50 +00:00
2026-06-02 08:41:00 +00:00
2026-05-21 17:30:44 +00:00
2026-06-01 21:05:03 +00:00
2026-06-02 09:06:36 +00:00
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00