nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	edc8e7ee8d	KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims' - Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA - RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion) - Indexer keys: FP8_E4M3 (ihd=128, no RoPE) - SWA: BF16 (unchanged) Pipeline: Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE] Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA No BF16 intermediate for non-RoPE data. No FP32 intermediate after BF16 RoPE. BF16 is the final format consumed by FMHA (no further conversion). KVCache rewritten: - comp_nope_fp8/scale: FP8 storage for non-RoPE - comp_rope_bf16: BF16 storage for RoPE - comp_nope_selective/all: FP8→BF16 dequant - comp_rope_selective/all: BF16 gather - set_compressed_mixed: write mixed format - set_indexer_keys_fp8: write FP8 indexer keys	2026-06-02 10:08:43 +00:00
biondizzle	12b6365b42	Fix RoPE test: use proper cos/sin cache	2026-06-02 10:04:01 +00:00
biondizzle	bdb25ee5cd	Add production-value unit tests for kv_quantize kernels	2026-06-02 10:01:07 +00:00

Author

SHA1

Message

Date

biondizzle

edc8e7ee8d

KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)

Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims'
- Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA
- RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion)
- Indexer keys: FP8_E4M3 (ihd=128, no RoPE)
- SWA: BF16 (unchanged)

Pipeline:
  Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE]
  Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA

No BF16 intermediate for non-RoPE data.
No FP32 intermediate after BF16 RoPE.
BF16 is the final format consumed by FMHA (no further conversion).

KVCache rewritten:
- comp_nope_fp8/scale: FP8 storage for non-RoPE
- comp_rope_bf16: BF16 storage for RoPE
- comp_nope_selective/all: FP8→BF16 dequant
- comp_rope_selective/all: BF16 gather
- set_compressed_mixed: write mixed format
- set_indexer_keys_fp8: write FP8 indexer keys

2026-06-02 10:08:43 +00:00

biondizzle

12b6365b42

Fix RoPE test: use proper cos/sin cache

2026-06-02 10:04:01 +00:00

biondizzle

bdb25ee5cd

Add production-value unit tests for kv_quantize kernels

2026-06-02 10:01:07 +00:00

3 Commits