nvfp4-megamoe-kernel/reference/grouped_blockscaled_gemm.py at 50e9b5da81477e000e4770e94de9d175204f5ff0

Files

biondizzle a2ea836c74 docs: add CuTeDSL rewrite plan + reference files

The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real
data). Switching to NVIDIA's CuTeDSL approach based on their official
MoE scaled grouped GEMM example.

Reference files copied:
- moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel)
- moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py
- grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py
- blockscaled_layout.py

2026-05-16 02:41:51 +00:00

125 KiB

Raw Blame History

View Raw

125 KiB Raw Blame History

125 KiB

Raw Blame History