Key fixes for fmha_epilogue_sm100.cuh hang:
- tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute
- Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG
- tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer
- Compute attention in SMEM, then do one-way TMEM pipeline:
SMEM → TMEM (warp-collective store) → regs (warp-collective load)
→ normalize in regs → BF16 cast → GMEM
- This proves the MoE-style one-way correction epilogue on FMHA
Also: enable TMEM kernel test + hd=128 in standalone test