- assemble_activation_scales_gpu: builds padded+swizzled scale tensor
without .item() or .tolist() CPU syncs. Uses GPU index arange + cat
+ single scatter instead of per-expert Python slicing.
- Still has a for e in range(num_experts) loop but num_experts is
compile-time constant so torch.compile unrolls it.
- Added tests/cudagraph_test.py: attempts CUDA graph capture on the
MoE runner, diagnoses sync violations with patched torch functions.
- Removed the if total_slots == 0 early return (Python control flow
on GPU data)