The outer loop tqdm now covers the full finalize_weights + warmup for each MoE layer. CuTeDSL caches by (M,N,K) so every layer shape gets compiled during warmup — no RPC timeouts during inference. (JIT compile)NVFP4 MoE layers: 50%|██████████░░░░░░░░░░| 31/61