cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100.
The CUTLASS SM100 epilogue uses st.global directly.
The one-way epilogue pipeline is now:
1. TMEM → regs (tcgen05.ld, warp-collective)
2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE)
3. regs → SMEM (row-major, sO_epi)
4. SMEM → GMEM (direct write)
This is the same pattern as the MoE kernel but with st.global instead
of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords.
Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch,
sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.