nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	b2d0417a46	NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail: arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast. Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works). For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach. Removed dead test files (test_ptx_, test_fp4_isolate, test_minimal_cmp*, test_dtype_store, test_threshold_round).	2026-05-28 04:59:01 +00:00
biondizzle	b3eb46d4ec	NVFP4-1.1: Restore threshold RNE approach — inline PTX blocked by toolchain CuTeDSL MLIR pipeline cannot lower any float→int conversion: arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast — all fail with 'LLVM ERROR: unsupported operation'. The pipeline has no path from Float32 to Int32 MLIR types. Threshold RNE is the mathematically correct software implementation: - Float32 comparisons select Int32 constants (no arith.fptosi) - > vs >= at .5 boundaries implements round-to-nearest-even - Equivalent to PTX cvt.rni.s32.f32 for bounded ranges	2026-05-28 04:54:27 +00:00
biondizzle	e33c48e44c	NVFP4-1.1: Use nvvm.inline_ptx instead of llvm.inline_asm for f32→i32 llvm.inline_asm fails with 'LLVM ERROR: unsupported operation' in CuTeDSL lowering pipeline. Switch to nvvm.inline_ptx which is native to the NVVM dialect and lowers correctly. - f32_to_i32_rni: cvt.rni.s32.f32 via nvvm.inline_ptx - f32_to_i32_rz: cvt.rzi.s32.f32 via nvvm.inline_ptx - f32_to_i32_rmi: cvt.rmi.s32.f32 via nvvm.inline_ptx	2026-05-28 04:42:33 +00:00
biondizzle	1cbb3cf752	NVFP4-1.1: Replace threshold rounding with inline PTX cvt.rni/rz/rmi - Add f32_to_i32_rni (cvt.rni.s32.f32) for round-to-nearest-even - Add f32_to_i32_rz (cvt.rzi.s32.f32) for round-toward-zero - Add f32_to_i32_rmi (cvt.rmi.s32.f32) for round-to-minus-infinity - Replace round_rne_u0_8 and abs_scaled_to_e2m1_idx threshold hacks with proper PTX hardware rounding in fp8_e4m3_from_float32 - quantize_e2m1_nibble now uses f32_to_i32_rni + LUT logic for half_step - Add test_ptx_convert.py for inline PTX conversion verification - This is the CORRECT approach per NVFP4-1.1_INLINE_PTX_APPROACH.md option 1	2026-05-28 04:40:17 +00:00
biondizzle	d2aa93aad7	NVFP4-1.1: fix Int32 clamping — use comparisons instead of fmin/fmax (float-only ops)	2026-05-28 04:30:06 +00:00
biondizzle	dabcc415a8	NVFP4-1.1: threshold rounding for float-to-int — avoids CuTeDSL limitation All float-to-int conversions replaced with threshold comparisons: - round_rne_u0_8: mantissa rounding via Float32 comparisons → Int32 constants - abs_scaled_to_e2m1_idx: direct \|scaled\| → E2M1 index (no half_step needed) - Verified 0/500 trial failures against Python reference Key thresholds (RNE boundaries): - 0.25, 0.75, 1.25, 1.75, 2.75, 3.75, 5.25 with > vs >= for RNE tie-breaking - Fixed: 2.75 must use >= (not >) to match round(5.5)=6 RNE	2026-05-28 04:26:40 +00:00
biondizzle	e565ebce91	NVFP4-1.1: replace cute.math.fmin with cute.arch.fmin (correct API)	2026-05-28 03:55:54 +00:00
biondizzle	20d5ddfa3d	NVFP4-1.1: fix indentation for @cute.jit decorators	2026-05-28 03:52:46 +00:00
biondizzle	f6f59d34cb	NVFP4-1.1: add @cute.jit decorator to fp4_quant functions for CuTeDSL if-block support	2026-05-28 03:50:11 +00:00
biondizzle	6f94925491	NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)	2026-05-28 03:48:51 +00:00
biondizzle	80b6b79f9e	NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels - fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid, NaN guard for exp=15/mant=7, mantissa overflow handling) - fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32 - half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7) - quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack - Verified 0/500 trial failures against Python reference - Key fixes discovered during validation: 1. FP8 E4M3 bias is 7, NOT 8 2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid) 3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024) 4. Round-to-nearest-even (not round-half-up) for half_step and mantissa 5. Mantissa overflow (round to 8) must increment exponent	2026-05-28 03:39:55 +00:00

11 Commits