nvfp4-megamoe-kernel

Files

biondizzle ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync

Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
  as local IDs (0-31 with EP=8). Tokens for non-local experts got
  wrong expert assignments, causing out-of-bounds scatter indices
  in _assemble_scales_cudagraph_safe.

Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
   JIT GPU memory corruption (refill after first GEMM call)

2026-05-17 08:58:43 +00:00

patches

Fix global→local expert ID remapping for EP and remove .cpu() sync

2026-05-17 08:58:43 +00:00

nvfp4_cutedsl.py

Fix global→local expert ID remapping for EP and remove .cpu() sync

2026-05-17 08:58:43 +00:00