biondizzle
d01b4b02de
Complete NVFP4 fused router kernel: full MMA + router epilogue
- TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB
- MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy
for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines)
- Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition,
sqrt(softplus) + e_bias + min-heap top-k + renormalization
- Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor
creation via from_dlpack + mark_layout_dynamic
- Single-kernel path, no BF16 fallback, no intermediate GMEM buffer
- Following exact patterns from MoE fused_swiglu.py kernel
2026-06-01 08:37:10 +00:00
..
2026-05-31 23:12:20 +00:00
2026-05-30 21:12:09 +00:00
2026-06-01 05:20:44 +00:00
2026-06-01 05:54:44 +00:00
2026-05-28 04:59:01 +00:00
2026-05-30 21:19:06 +00:00
2026-06-01 08:37:10 +00:00
2026-05-21 17:30:44 +00:00