Files
DeepGEMM/csrc
biondizzle 48b5b2b702 fix: TMA dimensions for packed FP4 must be in individual FP4 values (not bytes)
CUDA docs: 'Dimension for the packed data types must reflect the number
of individual U# values.' For 16U4_ALIGN8B, gmem/smem inner dims must be
FP4 value counts, not byte counts. Double the byte-oriented dimensions
passed by callers. gmem_outer_stride stays in bytes.
2026-05-12 17:39:07 +00:00
..