nvfp4-megamoe-kernel

Files

biondizzle bab748763e Rewrite NVFP4 fused router kernel: MoE-style epilogue replaces broken SMEM merge

CRITICAL REWRITE of nvfp4_fused_router_kernel.py:
- REMOVED: Raw pointer SMEM merge (storage.merge_scores.data_ptr()[idx] = val)
  This crashed the CuTeDSL MLIR optimizer. Never use raw pointer indexing
  inside CuTeDSL kernels.
- REMOVED: Per-thread top-k accumulation + 128-thread SMEM merge. Too complex
  for MLIR, caused SIGABRT during compilation.
- ADDED: MoE-style epilogue (TMEM→regs→activation→SMEM→TMA store→GMEM)
  using paired copy atoms from CUTLASS (epilogue_tmem_copy_and_partition +
  epilogue_smem_copy_and_partition). Structurally identical to the proven
  FusedSwiGLUScaledGroupedGemmKernel epilogue. This SHOULD compile.
- Activation: sqrt(softplus(logit)) in registers (replaces SwiGLU)
- Output: FP32 activated scores written to GMEM via TMA store
- Top-k handled by activation_topk CUDA kernel in Python wrapper

Other changes:
- _activation_topk.py: Added run_fused_activation_topk_pre_activated() for
  top-k + renorm on pre-activated scores (PyTorch reference, not CUDA kernel)
- dense_router_dispatch_nvfp4_fused: Updated to match new kernel API
- Kernel now uses standard _compute_stages() for SMEM budget calculation
- Kernel now uses compute_epilogue_tile_shape() for epi_tile (not hardcoded)
- C pipeline (PipelineTmaStore) added for SMEM→GMEM overlap

2026-06-01 09:59:34 +00:00

attention

FMHA sink: don't double-scale sink bias

2026-05-31 23:12:20 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

fix: import torch.utils.cpp_extension explicitly in production_compress

2026-06-01 05:20:44 +00:00

cuda

fix: move compressor position_bias into CUDA kernel (was Python loop)