Files
nvfp4-megamoe-kernel/dsv4/kernels
biondizzle 8e09fae3a1 fix: warp-stride for TMA canonical writes — only load warp calls them
write_smem_canonical used NTHREADS=192 as the stride, but in the TMA
kernel only the load warp (32 threads) calls it. With threadIdx.x in
[160,191] and stride 192, only 32 out of 2048 elements got written.
Fix: template STRIDE parameter, default 192, TMA kernel uses 32.
2026-05-29 18:25:47 +00:00
..