biondizzle
f23320b5b2
KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant
- compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize.
No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel.
Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer).
- dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels.
Full dequant (HCA dense gather) and selective dequant (CSA top-k gather).
Single kernel launch per gather operation.
- production_compress.py: Added csa_compress_production_nvfp4() and
hca_compress_production_nvfp4() — production path for KV-1/KV-2.
- loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules.
- test_kv_compress_quant.py: Unit tests verifying cos >= 0.999
between BF16 reference and NVFP4 round-trip path.
2026-06-02 09:37:53 +00:00
..
2026-05-30 21:22:34 +00:00
2026-05-21 17:30:44 +00:00
2026-06-02 09:37:53 +00:00
2026-05-22 17:07:23 +00:00
2026-05-31 12:05:19 +00:00
2026-05-31 20:11:37 +00:00
2026-05-31 18:38:34 +00:00
2026-06-02 04:10:39 +00:00
2026-05-16 02:13:18 +00:00
2026-05-22 17:08:12 +00:00
2026-05-31 09:17:07 +00:00
2026-05-31 09:23:10 +00:00
2026-05-31 20:23:18 +00:00
2026-05-31 05:55:10 +00:00