float4_e2m1fn_x2 packs 2 values per byte along K, not N.
The GEMM output N dimension is the logical N from mat_b.shape[2],
not 2x packed. Previous n_dim*2 was wrong — it accidentally worked
in the test because intermediate_size*2 == 2*intermediate_size.
Real model with N=9216 exposed the bug.
- assemble_activation_scales_gpu: builds padded+swizzled scale tensor
without .item() or .tolist() CPU syncs. Uses GPU index arange + cat
+ single scatter instead of per-expert Python slicing.
- Still has a for e in range(num_experts) loop but num_experts is
compile-time constant so torch.compile unrolls it.
- Added tests/cudagraph_test.py: attempts CUDA graph capture on the
MoE runner, diagnoses sync violations with patched torch functions.
- Removed the if total_slots == 0 early return (Python control flow
on GPU data)