The warmup allocated 1GB of dummy tensors but the model already uses 175.7GB of the 178.35GB per GPU. No room. With FULL_AND_PIEWISE CUDA graph mode, the kernel compiles during the graph capture phase (which manages memory properly). The warmup was a band-aid for eager mode and is now redundant.