nvfp4-megamoe-kernel

Files

biondizzle 2e2caadf7d WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives

- nvfp4_fused_router_kernel.cuh: 1-CTA NVFP4 GEMM + sqrt(softplus) + top-k epilogue
- Uses DeepGEMM SM100 primitives: SM100_MMA_MXF4_SS, UTCCP, UMMA descriptors
- 4 warp roles: TMA load, UTCCP transpose, MMA issue, epilogue
- nvfp4_fused_router_cuda.py: Python wrapper (TMA descriptor setup TBD)

NOT YET COMPILING - needs:
1. SMEM layout fix (single extern __shared__)
2. TMA descriptor creation (cuTensorMapEncodeTiled)
3. Top-k cross-warp merge completion
4. FP4 tensor format alignment with DeepGEMM

2026-06-01 07:41:42 +00:00

attention

FMHA sink: don't double-scale sink bias

2026-05-31 23:12:20 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

fix: import torch.utils.cpp_extension explicitly in production_compress

2026-06-01 05:20:44 +00:00

cuda

fix: move compressor position_bias into CUDA kernel (was Python loop)