Files
nvfp4-megamoe-kernel/tests
biondizzle 7189165a67 WIP: TMEM vector bridge not working (same cosine 0.513)
row_sum is PROVEN correct (29.25 vs 29.22 for row 0, ratio 1.001).
The ONLY bug is QK→PV row mapping in C9 normalization.

Tried: composition(tStS,(128,1)) for write, composition(tOtO,(128,1)) for read.
Same result — the composition preserves the fragments internal thread-to-address
mapping, so the same thread writes and reads the same TMEM address regardless
of which fragment layout is used for the composition.

Need: absolute row-coordinate indexed TMEM vector. Each QK thread writes
inv_row_sum to vec[QK_row_id], each PV thread reads from vec[PV_row_id].
The row_id comes from the identity tensor coordinate.

Alternative: implement FMHA correction_epilog pattern with dedicated
correction warp group that reads row metadata from the vector.
2026-05-21 19:26:15 +00:00
..