nvfp4-megamoe-kernel

Files

biondizzle 518a1d3f95 CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount

1. scatter_add_ requires int64 indices — ensure sorted_ids is .long()
2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern
3. Both code paths now use pre-allocated _tokens_per_expert_buf

2026-06-03 17:53:40 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

grouped_linear.py

CUDA graph: Fix per-call allocations in grouped_linear and quantize

2026-06-03 17:39:20 +00:00

linear.py

CUDA graph: Fix _assemble_scales_single_group swizzle size

2026-06-03 17:02:34 +00:00

mhc.py

CUDA graph: Fix per-step allocations in decode loop

2026-06-03 16:38:35 +00:00

moe.py

CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount

2026-06-03 17:53:40 +00:00

router.py

CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow

2026-06-01 14:21:16 +00:00

shared_expert.py

CUDA graph: Fix _assemble_scales_single_group swizzle size

2026-06-03 17:02:34 +00:00