biondizzle
80b6b79f9e
NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels
- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid,
NaN guard for exp=15/mant=7, mantissa overflow handling)
- fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32
- half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7)
- quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack
- Verified 0/500 trial failures against Python reference
- Key fixes discovered during validation:
1. FP8 E4M3 bias is 7, NOT 8
2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid)
3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024)
4. Round-to-nearest-even (not round-half-up) for half_step and mantissa
5. Mantissa overflow (round to 8) must increment exponent
2026-05-28 03:39:55 +00:00
..
2026-05-21 17:30:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-28 03:39:55 +00:00
2026-05-22 17:07:23 +00:00
2026-05-16 02:13:18 +00:00
2026-05-22 17:08:12 +00:00
2026-05-23 00:17:07 +00:00