Files
nvfp4-megamoe-kernel/dsv4
biondizzle c0379a0f86 P6: Remove broken TMA store — use direct GMEM write from SMEM
cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100.
The CUTLASS SM100 epilogue uses st.global directly.

The one-way epilogue pipeline is now:
  1. TMEM → regs (tcgen05.ld, warp-collective)
  2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE)
  3. regs → SMEM (row-major, sO_epi)
  4. SMEM → GMEM (direct write)

This is the same pattern as the MoE kernel but with st.global instead
of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords.

Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch,
sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.
2026-05-30 17:11:17 +00:00
..