The softmax warps store P at tmem_p0_offset=32. PV MMA must read from the same offset. tOrP0 was missing the offset, causing PV to read from TMEM column 0 (where S is) instead of column 32 (where P is). This was the root cause of NaN/zeros in D1 tests.