Files
nvfp4-megamoe-kernel/tests
biondizzle d7aa4da686 BREAKTHROUGH: unnormalized P@V cosine 0.999998 for n=128!
The softmax math (exp2, P store, PV) is correct for single-tile.
The bug is ONLY in C6/C9 normalization: applying inv_row_sum
using PV partition instead of QK partition.

n=128 (single tile): cosine 0.999998 PASS
n=256/384 (multi-tile): C6 O-rescale using wrong partition = FAIL

Fix: normalize O using QK row coordinates, not PV row coordinates.
Can use TMEM vector to bridge QK partition to PV partition.
2026-05-21 18:55:00 +00:00
..