Files
nvfp4-megamoe-kernel/dsv4/kernels
biondizzle 29f836d711 P4: Fix fused RMSNorm kernel — match quantize_nvfp4.cu encoding
- Use half_step_to_e2m1 for E2M1 FP4 quantization (not LUT search)
- Use __nv_fp8_e4m3 + memcpy for block scale (not reinterpret_cast)
- Pack nibbles as (nibbles[2*i+1] << 4) | nibbles[2*i] (same as prod)
- Output uint8 buffers, then .view() to FP4/FP8 dtypes
- Handle near-zero block scale same as quantize_nvfp4.cu
2026-06-02 16:28:44 +00:00
..