CUTLASS headers transitively include cuda_bf16.h which has a CUDA 13.2
in_place_from bug. Writing tcgen05 PTX directly via inline asm instead.
No dependencies on CUTLASS C++ — pure PTX + CUDA runtime.
CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail:
arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast.
Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works).
For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach.
Removed dead test files (test_ptx_*, test_fp4_isolate*, test_minimal_cmp*,
test_dtype_store, test_threshold_round).
CuTeDSL MLIR pipeline cannot lower any float→int conversion:
arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast — all
fail with 'LLVM ERROR: unsupported operation'. The pipeline has no
path from Float32 to Int32 MLIR types.
Threshold RNE is the mathematically correct software implementation:
- Float32 comparisons select Int32 *constants* (no arith.fptosi)
- > vs >= at .5 boundaries implements round-to-nearest-even
- Equivalent to PTX cvt.rni.s32.f32 for bounded ranges
llvm.inline_asm fails with 'LLVM ERROR: unsupported operation' in CuTeDSL
lowering pipeline. Switch to nvvm.inline_ptx which is native to the NVVM
dialect and lowers correctly.
- f32_to_i32_rni: cvt.rni.s32.f32 via nvvm.inline_ptx
- f32_to_i32_rz: cvt.rzi.s32.f32 via nvvm.inline_ptx
- f32_to_i32_rmi: cvt.rmi.s32.f32 via nvvm.inline_ptx
- Add f32_to_i32_rni (cvt.rni.s32.f32) for round-to-nearest-even
- Add f32_to_i32_rz (cvt.rzi.s32.f32) for round-toward-zero
- Add f32_to_i32_rmi (cvt.rmi.s32.f32) for round-to-minus-infinity
- Replace round_rne_u0_8 and abs_scaled_to_e2m1_idx threshold hacks
with proper PTX hardware rounding in fp8_e4m3_from_float32
- quantize_e2m1_nibble now uses f32_to_i32_rni + LUT logic for half_step
- Add test_ptx_convert.py for inline PTX conversion verification
- This is the CORRECT approach per NVFP4-1.1_INLINE_PTX_APPROACH.md option 1
All float-to-int conversions replaced with threshold comparisons:
- round_rne_u0_8: mantissa rounding via Float32 comparisons → Int32 constants
- abs_scaled_to_e2m1_idx: direct |scaled| → E2M1 index (no half_step needed)
- Verified 0/500 trial failures against Python reference
Key thresholds (RNE boundaries):
- 0.25, 0.75, 1.25, 1.75, 2.75, 3.75, 5.25 with > vs >= for RNE tie-breaking
- Fixed: 2.75 must use >= (not >) to match round(5.5)=6 RNE