CUDA docs: 'Dimension for the packed data types must reflect the number of individual U# values.' For 16U4_ALIGN8B, gmem/smem inner dims must be FP4 value counts, not byte counts. Double the byte-oriented dimensions passed by callers. gmem_outer_stride stays in bytes.