One-way trip: TMEM->reg (normalize) ->SMEM->GMEM Replicates epilogue_tma_store logic with normalize step added Uses CUTLASS helpers for correct layout handling
One-way trip: TMEM->reg (normalize) ->SMEM->GMEM Replicates epilogue_tma_store logic with normalize step added Uses CUTLASS helpers for correct layout handling