Complete NVFP4 fused router kernel: full MMA + router epilogue

- TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB
- MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy
  for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines)
- Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition,
  sqrt(softplus) + e_bias + min-heap top-k + renormalization
- Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor
  creation via from_dlpack + mark_layout_dynamic
- Single-kernel path, no BF16 fallback, no intermediate GMEM buffer
- Following exact patterns from MoE fused_swiglu.py kernel

This commit is contained in:

biondizzle

2026-06-01 08:37:10 +00:00

parent 25b9a5f32d

commit d01b4b02de

1 changed files with 841 additions and 192 deletions

1033

dsv4/kernels/router/nvfp4_fused_router_kernel.py

View File

File diff suppressed because it is too large Load Diff

Complete NVFP4 fused router kernel: full MMA + router epilogue

1033 dsv4/kernels/router/nvfp4_fused_router_kernel.py View File

1033

dsv4/kernels/router/nvfp4_fused_router_kernel.py

View File