fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe
- transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3
scales (4 per int32, group_size=16). Previously only handled 32/128.
- get_arch: always return '100a' for SM100, never '100f'. The family
variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing
'scale_vec::4X not supported on sm_100f' errors.
- transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block
scales, pack UE4M3→int32, transpose MN-major, call
transform_sf_into_required_layout with gran_k=16.