Files
nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel/__pycache__/symm_buffer.cpython-312.pyc
biondizzle bf13665dbe Implement TileLang NVFP4 mega_moe L1/L2 kernels
- nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM
- nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM
- nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales
- tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM
- Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output
- MEGA_MOE_STATIC=1 bypass still works for pipeline testing

Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm
(auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4
block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang
adds E2M1+UE4M3 support.
2026-05-13 22:36:58 +00:00

3.4 KiB