Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
as local IDs (0-31 with EP=8). Tokens for non-local experts got
wrong expert assignments, causing out-of-bounds scatter indices
in _assemble_scales_cudagraph_safe.
Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
JIT GPU memory corruption (refill after first GEMM call)