nvfp4-megamoe-kernel

Files

biondizzle 29f836d711 P4: Fix fused RMSNorm kernel — match quantize_nvfp4.cu encoding

- Use half_step_to_e2m1 for E2M1 FP4 quantization (not LUT search)
- Use __nv_fp8_e4m3 + memcpy for block scale (not reinterpret_cast)
- Pack nibbles as (nibbles[2*i+1] << 4) | nibbles[2*i] (same as prod)
- Output uint8 buffers, then .view() to FP4/FP8 dtypes
- Handle near-zero block scale same as quantize_nvfp4.cu

2026-06-02 16:28:44 +00:00

attention

perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)

2026-06-02 03:54:03 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys

2026-06-02 10:00:50 +00:00

cuda

P4: Fix fused RMSNorm kernel — match quantize_nvfp4.cu encoding