nvfp4-megamoe-kernel

Files

biondizzle 2eb44a00bf fix(tmem): warp-collective TMEM ops + one-way correction epilogue

Key fixes for fmha_epilogue_sm100.cuh hang:
- tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute
- Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG
- tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer
- Compute attention in SMEM, then do one-way TMEM pipeline:
  SMEM → TMEM (warp-collective store) → regs (warp-collective load)
  → normalize in regs → BF16 cast → GMEM
- This proves the MoE-style one-way correction epilogue on FMHA

Also: enable TMEM kernel test + hd=128 in standalone test

2026-05-28 07:27:25 +00:00