WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives
- nvfp4_fused_router_kernel.cuh: 1-CTA NVFP4 GEMM + sqrt(softplus) + top-k epilogue
- Uses DeepGEMM SM100 primitives: SM100_MMA_MXF4_SS, UTCCP, UMMA descriptors
- 4 warp roles: TMA load, UTCCP transpose, MMA issue, epilogue
- nvfp4_fused_router_cuda.py: Python wrapper (TMA descriptor setup TBD)
NOT YET COMPILING - needs:
1. SMEM layout fix (single extern __shared__)
2. TMA descriptor creation (cuTensorMapEncodeTiled)
3. Top-k cross-warp merge completion
4. FP4 tensor format alignment with DeepGEMM