nvfp4-megamoe-kernel/reference/blockscaled_layout.py at dfd9c10ae90a3d74eeb8db0701b541f004a9dbbc

Files

biondizzle a2ea836c74 docs: add CuTeDSL rewrite plan + reference files

The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real
data). Switching to NVIDIA's CuTeDSL approach based on their official
MoE scaled grouped GEMM example.

Reference files copied:
- moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel)
- moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py
- grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py
- blockscaled_layout.py

2026-05-16 02:41:51 +00:00

20 KiB

Raw Blame History

View Raw

20 KiB Raw Blame History

20 KiB

Raw Blame History