nvfp4-megamoe-kernel

Files

biondizzle c09c660110 WIP: scalar C9 normalization - confirmed inv_row_sum is wrong

The C9 TMEM round-trip IS modifying O (confirmed by epilogue * 2.0 test).
But inv_row_sum is wrong: each thread computes row_sum via .reduce(MAX) and
packed f32x2 reduction, but the result appears to be the same for all threads.

Next: need to dump the QK C-fragment coordinate tensor to understand
which rows each thread actually owns in the TMEM load partition.

2026-05-21 19:09:32 +00:00