docs/models/extensions/instanttensor.md

# Loading Model Weights with InstantTensor

InstantTensor accelerates loading Safetensors weights on CUDA devices through distributed loading, pipelined prefetching, and direct I/O. InstantTensor also supports GDS (GPUDirect Storage) when available.
For more details, see the [InstantTensor GitHub repository](https://github.com/scitix/InstantTensor).

## Installation

```bash
pip install instanttensor
```

## Use InstantTensor in vLLM

Add `--load-format instanttensor` as a command-line argument.

For example:

```bash
vllm serve Qwen/Qwen2.5-0.5B --load-format instanttensor
```

## Benchmarks

| Model | GPU | Backend | Load Time (s) | Throughput (GB/s) | Speedup |
| --- | ---: | --- | ---: | ---: | --- |
| Qwen3-30B-A3B | 1*H200 | Safetensors | 57.4 | 1.1 | 1x |
| Qwen3-30B-A3B | 1*H200 | InstantTensor | 1.77 | 35 | <span style="color: green">**32.4x**</span> |
| DeepSeek-R1 | 8*H200 | Safetensors | 160 | 4.3 | 1x |
| DeepSeek-R1 | 8*H200 | InstantTensor | 15.3 | 45 | <span style="color: green">**10.5x**</span> |

For the full benchmark results, see <https://github.com/scitix/InstantTensor/blob/main/docs/benchmark.md>.
[Feature] Add InstantTensor weight loader (#36139) 2026-03-15 01:05:23 +08:00			`# Loading Model Weights with InstantTensor`

			`InstantTensor accelerates loading Safetensors weights on CUDA devices through distributed loading, pipelined prefetching, and direct I/O. InstantTensor also supports GDS (GPUDirect Storage) when available.`
			`For more details, see the [InstantTensor GitHub repository](https://github.com/scitix/InstantTensor).`

			`## Installation`

			```bash
			`pip install instanttensor`
			```

			`## Use InstantTensor in vLLM`

			Add `--load-format instanttensor` as a command-line argument.

			`For example:`

			```bash
			`vllm serve Qwen/Qwen2.5-0.5B --load-format instanttensor`
			```

			`## Benchmarks`

			`\| Model \| GPU \| Backend \| Load Time (s) \| Throughput (GB/s) \| Speedup \|`
			`\| --- \| ---: \| --- \| ---: \| ---: \| --- \|`
			`\| Qwen3-30B-A3B \| 1*H200 \| Safetensors \| 57.4 \| 1.1 \| 1x \|`
			`\| Qwen3-30B-A3B \| 1H200 \| InstantTensor \| 1.77 \| 35 \| <span style="color: green">32.4x*</span> \|`
			`\| DeepSeek-R1 \| 8*H200 \| Safetensors \| 160 \| 4.3 \| 1x \|`
			`\| DeepSeek-R1 \| 8H200 \| InstantTensor \| 15.3 \| 45 \| <span style="color: green">10.5x*</span> \|`

			`For the full benchmark results, see <https://github.com/scitix/InstantTensor/blob/main/docs/benchmark.md>.`