CUDA docs: 'Dimension for the packed data types must reflect the number
of individual U# values.' For 16U4_ALIGN8B, gmem/smem inner dims must be
FP4 value counts, not byte counts. Double the byte-oriented dimensions
passed by callers. gmem_outer_stride stays in bytes.
kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
(same byte layout, but kInt8 is recognized by the TMA dtype switch)