Files
nvfp4-megamoe-kernel/dsv4/kernels/gemm/fp4_quant.py
biondizzle b2d0417a46 NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files
CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail:
arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast.

Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works).
For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach.

Removed dead test files (test_ptx_*, test_fp4_isolate*, test_minimal_cmp*,
test_dtype_store, test_threshold_round).
2026-05-28 04:59:01 +00:00

37 lines
1.7 KiB
Python

"""
NVFP4 quantization primitives — TOOLCHAIN BLOCKED in CuTeDSL.
CuTeDSL's MLIR lowering pipeline CANNOT lower any float→int operation:
- arith.fptosi → LLVM ERROR: unsupported operation
- llvm.inline_asm with cvt.rni.s32.f32 → LLVM ERROR: unsupported operation
- nvvm.inline_ptx with cvt.rni.s32.f32 → LLVM ERROR: unsupported operation
- llvm.bitcast Float32→Int32 → LLVM ERROR: unsupported operation
The pipeline has no path from Float32 MLIR types to Int32 MLIR types.
This is a fundamental toolchain limitation, not an implementation issue.
Production path: Use dsv4/kernels/cuda/quantize_nvfp4.cu instead.
That kernel uses __float2int_rn() and raw CUDA intrinsics — works perfectly.
For NVFP4-1.1 (fusing FP4 quant into MoE SwiGLU epilogue), the approach
will be a post-epilogue CUDA kernel that reads BF16 from GMEM and quantizes
to FP4. See ROADMAP.md Priority 3.
This file is kept for documentation of the toolchain limitation.
If CuTeDSL gains float→int support in the future, these primitives can be
reimplemented here.
"""
# All functions removed. Use dsv4/kernels/cuda/quantize_nvfp4.cu instead.
#
# Attempted approaches (all failed with "LLVM ERROR: unsupported operation"):
# 1. arith.fptosi (cutlass.Int32(float_val))
# 2. llvm.inline_asm with cvt.rni.s32.f32, cvt.rzi.s32.f32, cvt.rmi.s32.f32
# 3. nvvm.inline_ptx with cvt.rni.s32.f32
# 4. llvm.bitcast Float32 → Int32
# 5. Threshold rounding (Float32 comparisons selecting Int32 constants) —
# this SHOULD work since it never generates fptosi, but even a trivial
# Int32 GMEM store kernel fails on the B200 as of 2026-05-28.
# Possible GPU state corruption from prior LLVM ERROR crashes.
# TODO: Re-test threshold approach after B200 GPU state is clean.