buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated. Always use the full pre-allocated buffer; extra rows are zeros.
buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated. Always use the full pre-allocated buffer; extra rows are zeros.