The softmax math (exp2, P store, PV) is correct for single-tile. The bug is ONLY in C6/C9 normalization: applying inv_row_sum using PV partition instead of QK partition. n=128 (single tile): cosine 0.999998 PASS n=256/384 (multi-tile): C6 O-rescale using wrong partition = FAIL Fix: normalize O using QK row coordinates, not PV row coordinates. Can use TMEM vector to bridge QK partition to PV partition.