nvfp4-megamoe-kernel

Files

biondizzle d01b4b02de Complete NVFP4 fused router kernel: full MMA + router epilogue

- TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB
- MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy
  for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines)
- Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition,
  sqrt(softplus) + e_bias + min-heap top-k + renormalization
- Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor
  creation via from_dlpack + mark_layout_dynamic
- Single-kernel path, no BF16 fallback, no intermediate GMEM buffer
- Following exact patterns from MoE fused_swiglu.py kernel

2026-06-01 08:37:10 +00:00

attention

FMHA sink: don't double-scale sink bias

2026-05-31 23:12:20 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

fix: import torch.utils.cpp_extension explicitly in production_compress

2026-06-01 05:20:44 +00:00

cuda

fix: move compressor position_bias into CUDA kernel (was Python loop)