nvfp4-megamoe-kernel

Files

biondizzle a2cac7a7fe fix: remove CuTeDSL warmup — OOM with 175GB model loaded

The warmup allocated 1GB of dummy tensors but the model already
uses 175.7GB of the 178.35GB per GPU. No room.

With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during
the graph capture phase (which manages memory properly). The warmup
was a band-aid for eager mode and is now redundant.

2026-05-16 07:32:17 +00:00

patches

fix: remove CuTeDSL warmup — OOM with 175GB model loaded

2026-05-16 07:32:17 +00:00

nvfp4_cutedsl.py

fix: cast expert_offsets to int32 for CuTeDSL kernel

2026-05-16 07:15:57 +00:00