write_smem_canonical used NTHREADS=192 as the stride, but in the TMA kernel only the load warp (32 threads) calls it. With threadIdx.x in [160,191] and stride 192, only 32 out of 2048 elements got written. Fix: template STRIDE parameter, default 192, TMA kernel uses 32.