nvfp4-megamoe-kernel

Files

biondizzle 7ef6402936 KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys

Architecture:
- Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa)
  - Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4
  - Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA
  - No BF16 intermediate in the write path
- Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale)
  - Write path: compress→FP32 → quantize FP32→FP8_E4M3
  - Read path: dequant_fp8_e4m3 → BF16 for scoring
- SWA: remains BF16 (8MB total, fits in L2)

New kernels in kv_quantize.cu:
- compute_amax_gsa_fp32: per-row gsa from FP32 input
- quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer
- quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys
- dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16
- rope_fp32: FP32 GPT-J interleaved RoPE (no BF16)

Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused):
  Kernel 1: amax_gsa (GPU-only)
  Kernel 2: quantize from buffer (GPU gsa)
No shared memory bugs. No cross-CTA race conditions.

KVCache updated:
- comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16)
- comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16)
- comp_kv property: dequant NVFP4→BF16 on demand
- comp_kv_selective: dequant only top-k entries (bandwidth savings)
- comp_idx_kv property: dequant FP8→BF16 on demand

Removed: compressor_reduce_quant.cu (buggy single-kernel approach)

2026-06-02 10:00:50 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys

2026-06-02 10:00:50 +00:00

layers

wip: SE fused SwiGLU deinterleave fix

2026-06-02 08:41:00 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering