nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	1828a71cde	NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)	2026-05-28 04:32:08 +00:00
biondizzle	accc66741d	NVFP4-1.1: update test kernel with threshold rounding API	2026-05-28 04:27:29 +00:00
biondizzle	dc35d29811	NVFP4-1.1: fix cute.arch.store signature - store(ptr, val) not store(ptr, val, dtype)	2026-05-28 04:01:38 +00:00
biondizzle	6f94925491	NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)	2026-05-28 03:48:51 +00:00
biondizzle	60790564f0	NVFP4-1.1: fix test - two-pass kernel, cute.arch.store confirmed on B200	2026-05-28 03:46:45 +00:00
biondizzle	a41de129cb	NVFP4-1.1: fix test kernel - use cute.copy instead of cute.arch.store	2026-05-28 03:42:24 +00:00
biondizzle	3a78bdf570	NVFP4-1.1: add CuTeDSL kernel test for FP4 quantization	2026-05-28 03:40:54 +00:00
biondizzle	80b6b79f9e	NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels - fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid, NaN guard for exp=15/mant=7, mantissa overflow handling) - fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32 - half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7) - quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack - Verified 0/500 trial failures against Python reference - Key fixes discovered during validation: 1. FP8 E4M3 bias is 7, NOT 8 2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid) 3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024) 4. Round-to-nearest-even (not round-half-up) for half_step and mantissa 5. Mantissa overflow (round to 8) must increment exponent	2026-05-28 03:39:55 +00:00

8 Commits