Files
nvfp4-megamoe-kernel/vllm
biondizzle a2cac7a7fe fix: remove CuTeDSL warmup — OOM with 175GB model loaded
The warmup allocated 1GB of dummy tensors but the model already
uses 175.7GB of the 178.35GB per GPU. No room.

With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during
the graph capture phase (which manages memory properly). The warmup
was a band-aid for eager mode and is now redundant.
2026-05-16 07:32:17 +00:00
..