New file: fmha_epilogue_sm100.cuh
- TMEM alloc/dealloc/load/store via tcgen05 PTX
- One-way correction epilogue: TMEM→regs→normalize→BF16→GMEM
- D1.5 fix: O rescale in REGISTERS (TMEM→regs→multiply→TMEM)
- Same pattern as MoE epilogue but with normalize instead of SwiGLU
- Unblocks D2 multi-CTA and NVFP4-1.2 (register slot for FP4 pack)
Test: hd=64 + hd=128, reference vs TMEM kernels
Use thread 0 for all computation (slow but correct).
SMEM for Q and O sharing across threads.
Online softmax with O rescale — correct D1.5 approach.
D3 SWA mask implemented.
Target: cos ~0.999998 then parallelize.
Simpler approach first: scalar Q@K^T, softmax, P@V in registers.
No TMEM/MMA yet — verify correctness first, then replace with tcgen05.
- 192-thread CTA, all threads cooperate on one (batch, head)
- Online softmax with O rescale (correct D1.5 approach)
- D3 SWA mask, D4 causal (TODO), D5c sink (TODO)
- KV loaded in blocks of 128 for SMEM efficiency
- Correctness target: cos ~0.999998 against PyTorch reference
- tcgen05.mma.cta_group::1.kind::f16 [tmem_c], desc_a, desc_b, idescE_hi, scaleC, {mask0..3}, pred
- idescE is upper 32 bits of the E descriptor
- scaleC is a float (1.0 for accumulate)
- mask is 4 uint32 values (0xFFFFFFFF for no masking)
CUTLASS headers transitively include cuda_bf16.h which has a CUDA 13.2
in_place_from bug. Writing tcgen05 PTX directly via inline asm instead.
No dependencies on CUTLASS C++ — pure PTX + CUDA runtime.
CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail:
arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast.
Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works).
For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach.
Removed dead test files (test_ptx_*, test_fp4_isolate*, test_minimal_cmp*,
test_dtype_store, test_threshold_round).
CuTeDSL MLIR pipeline cannot lower any float→int conversion:
arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast — all
fail with 'LLVM ERROR: unsupported operation'. The pipeline has no
path from Float32 to Int32 MLIR types.
Threshold RNE is the mathematically correct software implementation:
- Float32 comparisons select Int32 *constants* (no arith.fptosi)
- > vs >= at .5 boundaries implements round-to-nearest-even
- Equivalent to PTX cvt.rni.s32.f32 for bounded ranges