Previous O rescale attempt broke n=128 (0.464773). Revert to known-good softmax code, only apply TMA fix: tBgK[(None,None,0,0)] → tBgK[(None,0,None,0)] Expected: n=128 cos 0.999998 (same as working), n=256 cos 0.71 (TMA fix loads 2 tiles but no O rescale)