Files
DeepGEMM/csrc/jit_kernels
biondizzle c71fb97687 fix: L1 output TMA smem_inner_dim was block_n/4, should be block_n/2
Packed E2M1 output has 2 elements per byte, so block_n elements = block_n/2 bytes.
block_n/4 was under-sizing the TMA SMEM row by 2x → OOB write → LAUNCH_FAILED.
2026-05-12 14:58:11 +00:00
..