Files
nvfp4-megamoe-kernel/tests/unit
biondizzle ce91aa26e4 WIP: QK-partitioned C9 normalization (does not work)
The QK composition(tStS, (128,64)) view of O TMEM region does not align
with the actual PV C-fragment layout. Cannot read O with QK partition.

Need to use TMEM vector approach:
1. Store inv_row_sum via QK partition (composition(tStS, (128,1)))
2. Read inv_row_sum via PV partition (need PV-partitioned view of vector)
3. Apply normalization in PV-partitioned O TMEM access

The key challenge: creating a PV-partitioned read of the vector TMEM region
that was written with QK partition. This is what CUTLASS FMHA does with
its correction warp group.
2026-05-21 18:59:21 +00:00
..