First call: cute.compile() with real tensors (warmup). Subsequent calls: just invoke compiled() with new CuTe views. No cute.compile() in the forward path = cudagraph-safe.
First call: cute.compile() with real tensors (warmup). Subsequent calls: just invoke compiled() with new CuTe views. No cute.compile() in the forward path = cudagraph-safe.