tmem_p0_offset is in FP32 columns, but tOrP uses BF16 elements. Offset = p0_offset * (32/16) = p0_offset * 2.