Key fixes for fmha_epilogue_sm100.cuh hang:
- tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute
- Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG
- tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer
- Compute attention in SMEM, then do one-way TMEM pipeline:
SMEM → TMEM (warp-collective store) → regs (warp-collective load)
→ normalize in regs → BF16 cast → GMEM
- This proves the MoE-style one-way correction epilogue on FMHA
Also: enable TMEM kernel test + hd=128 in standalone test
ROOT CAUSE of TMET hang: tcgen05.fence.cta_group::1.sync.aligned is
NOT a valid PTX instruction. The correct TMEM ordering primitives are:
- tcgen05.wait::st.sync.aligned (wait for TMEM stores to complete)
- tcgen05.wait::ld.sync.aligned (wait for TMEM loads to complete)
Found in cutlass/arch/barrier.h fence_view_async_tmem_store/load.
New file: fmha_epilogue_sm100.cuh
- TMEM alloc/dealloc/load/store via tcgen05 PTX
- One-way correction epilogue: TMEM→regs→normalize→BF16→GMEM
- D1.5 fix: O rescale in REGISTERS (TMEM→regs→multiply→TMEM)
- Same pattern as MoE epilogue but with normalize instead of SwiGLU
- Unblocks D2 multi-CTA and NVFP4-1.2 (register slot for FP4 pack)
Test: hd=64 + hd=128, reference vs TMEM kernels
CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail:
arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast.
Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works).
For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach.
Removed dead test files (test_ptx_*, test_fp4_isolate*, test_minimal_cmp*,
test_dtype_store, test_threshold_round).
- Add f32_to_i32_rni (cvt.rni.s32.f32) for round-to-nearest-even
- Add f32_to_i32_rz (cvt.rzi.s32.f32) for round-toward-zero
- Add f32_to_i32_rmi (cvt.rmi.s32.f32) for round-to-minus-infinity
- Replace round_rne_u0_8 and abs_scaled_to_e2m1_idx threshold hacks
with proper PTX hardware rounding in fp8_e4m3_from_float32
- quantize_e2m1_nibble now uses f32_to_i32_rni + LUT logic for half_step
- Add test_ptx_convert.py for inline PTX conversion verification
- This is the CORRECT approach per NVFP4-1.1_INLINE_PTX_APPROACH.md option 1