Files
nvfp4-megamoe-kernel/dsv4
biondizzle cc7b17fdaa Fix B2 indexer: use 2-warps for TMEM read (P7 row-slice model)
ROOT CAUSE: The TMEM read for rows 32-63 was wrong. The 32x32b.x8
instruction reads 32 rows per warp. Per P7 docs, warp 0 sees rows 0-31
and warp 1 sees rows 32-63 from the SAME TMEM address. There is no TMEM
offset for different row groups — the row-to-lane mapping depends on
the warp ID.

Fix: warp 0 reads heads 0-31, warp 1 reads heads 32-63 from tb + col_base.
Cross-warp reduce via SMEM to compute full 64-head weighted ReLU scores.
2026-06-03 00:55:27 +00:00
..