Files
nvfp4-megamoe-kernel/dsv4
biondizzle 33cedbee0a fix(tmem): TMEM ld/st are warp-collective — ALL 32 lanes must call them
Root cause of TMEM epilogue hang: tmem_store/tmem_load are
warp-collective operations requiring ALL 32 lanes to participate.

The loop 'for (col = lane; col < TMEM_O_COLS; col += WARP)' with
TMEM_O_COLS=16 and WARP=32 means only lanes 0-15 execute the op.
Lanes 16-31 skip it = warp divergence on collective = HANG.

Fix: loop over TMEM_N (>= 32, power of 2) so all 32 lanes
participate. Columns beyond TMEM_O_COLS write don't-care data
to allocated-but-unused TMEM columns.
2026-05-28 07:41:16 +00:00
..