torch.tensor() and new_tensor() both trigger CPU->CUDA copies during cudagraph capture. Pre-cache the LUT on first use per device.
torch.tensor() and new_tensor() both trigger CPU->CUDA copies during cudagraph capture. Pre-cache the LUT on first use per device.