1. scatter_add_ requires int64 indices — ensure sorted_ids is .long() 2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern 3. Both code paths now use pre-allocated _tokens_per_expert_buf
1. scatter_add_ requires int64 indices — ensure sorted_ids is .long() 2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern 3. Both code paths now use pre-allocated _tokens_per_expert_buf