- CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack