Each softmax thread writes its P values to sP using the (m,k) coordinates from tTMEM_LOADcS. The k coordinate is decomposed into (k0,k1,k2) to match sP's ((128,16),1,(4,2)) layout. CuTeDSL tensor indexing handles the swizzle automatically. No make_tiled_copy needed.