nvfp4-megamoe-kernel

Files

biondizzle 90b2581dfe feat: NVFP4 fused router CuTeDSL kernel (WIP)

Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k
epilogue. Avoids materializing intermediate FP32 logits to GMEM.

Architecture: 6-warp specialization
- Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM
- Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM
- Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM

Epilogue maintains per-thread min-heap across N subtiles, then
merges all 128 threads' heaps in SMEM for final top-k selection.

Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for
TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing
the standard SwiGLU + TMA store path.

NOTE: This is WIP — needs compilation testing on B200. Several
API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to
be passed through the kernel parameters properly.

2026-06-01 06:40:21 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

feat: NVFP4 fused router CuTeDSL kernel (WIP)

2026-06-01 06:40:21 +00:00

layers

feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)

2026-06-01 05:58:56 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering