CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail: arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast. Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works). For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach. Removed dead test files (test_ptx_*, test_fp4_isolate*, test_minimal_cmp*, test_dtype_store, test_threshold_round).
37 lines
1.7 KiB
Python
37 lines
1.7 KiB
Python
"""
|
|
NVFP4 quantization primitives — TOOLCHAIN BLOCKED in CuTeDSL.
|
|
|
|
CuTeDSL's MLIR lowering pipeline CANNOT lower any float→int operation:
|
|
- arith.fptosi → LLVM ERROR: unsupported operation
|
|
- llvm.inline_asm with cvt.rni.s32.f32 → LLVM ERROR: unsupported operation
|
|
- nvvm.inline_ptx with cvt.rni.s32.f32 → LLVM ERROR: unsupported operation
|
|
- llvm.bitcast Float32→Int32 → LLVM ERROR: unsupported operation
|
|
|
|
The pipeline has no path from Float32 MLIR types to Int32 MLIR types.
|
|
This is a fundamental toolchain limitation, not an implementation issue.
|
|
|
|
Production path: Use dsv4/kernels/cuda/quantize_nvfp4.cu instead.
|
|
That kernel uses __float2int_rn() and raw CUDA intrinsics — works perfectly.
|
|
|
|
For NVFP4-1.1 (fusing FP4 quant into MoE SwiGLU epilogue), the approach
|
|
will be a post-epilogue CUDA kernel that reads BF16 from GMEM and quantizes
|
|
to FP4. See ROADMAP.md Priority 3.
|
|
|
|
This file is kept for documentation of the toolchain limitation.
|
|
If CuTeDSL gains float→int support in the future, these primitives can be
|
|
reimplemented here.
|
|
"""
|
|
|
|
# All functions removed. Use dsv4/kernels/cuda/quantize_nvfp4.cu instead.
|
|
#
|
|
# Attempted approaches (all failed with "LLVM ERROR: unsupported operation"):
|
|
# 1. arith.fptosi (cutlass.Int32(float_val))
|
|
# 2. llvm.inline_asm with cvt.rni.s32.f32, cvt.rzi.s32.f32, cvt.rmi.s32.f32
|
|
# 3. nvvm.inline_ptx with cvt.rni.s32.f32
|
|
# 4. llvm.bitcast Float32 → Int32
|
|
# 5. Threshold rounding (Float32 comparisons selecting Int32 constants) —
|
|
# this SHOULD work since it never generates fptosi, but even a trivial
|
|
# Int32 GMEM store kernel fails on the B200 as of 2026-05-28.
|
|
# Possible GPU state corruption from prior LLVM ERROR crashes.
|
|
# TODO: Re-test threshold approach after B200 GPU state is clean.
|