docs/features/sleep_mode.md

# Sleep Mode

vLLM's Sleep Mode allows you to temporarily release most GPU memory used by a model, including model weights and KV cache, without stopping the server or unloading the Docker container. This is especially useful for RLHF, training, or cost-saving scenarios where GPU resources need to be freed between inference workloads.

Key benefits:

- **Frees GPU memory**: Offloads model weights to CPU RAM and discards KV cache, releasing up to 90%+ of GPU memory for other tasks.
- **Fast resume**: Quickly wake up the engine and resume inference without full model reload.
- **API endpoints**: Control sleep/wake_up state via HTTP endpoints or Python API.
- **Supports distributed workloads**: Works with tensor parallelism, pipeline parallelism, etc.
- **Fine-grained control**: Optionally wake up only model weights or KV cache to avoid OOM during weight updates.

!!! note
    This feature is now supported on CUDA and ROCm platform.

!!! note
    For more information, see this [Blog Post](https://blog.vllm.ai/2025/10/26/sleep-mode.html).

## Sleep levels

Level 1 sleep will offload the model weights and discard the KV cache. The content of KV cache is forgotten. Level 1 sleep is good for sleeping and waking up the engine to run the same model again. The model weights are backed up in CPU memory. Please make sure there's enough CPU memory to store the model weights. Level 2 sleep will discard both the model weights and the KV cache (while the model's buffers are kept in CPU, like rope scaling tensors). The content of both the model weights and KV cache is forgotten. Level 2 sleep is good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed, e.g. RLHF weight update.

## Usage

### Offline inference

Enable sleep mode by passing `enable_sleep_mode=True` to the `LLM` class.

```python
from vllm import LLM
llm = LLM("Qwen/Qwen3-0.6B", enable_sleep_mode=True)
```

#### Python API

```python
# Sleep level 1
# Put the engine to sleep (level=1: offload weights to CPU RAM, discard KV cache)
llm.sleep(level=1)

# Wake up the engine (restore weights)
llm.wake_up()
```

```python
# Sleep level 2
# Put the engine to sleep (level=2: discard both weights and KV cache)
llm.sleep(level=2)

# Reallocate weights memory only
llm.wake_up(tags=["weights"])

# Load weights in-place
llm.collective_rpc("reload_weights")

# Reallocate KV cache
llm.wake_up(tags=["kv_cache"])
```

#### RLHF weight updates

During RLHF training, vLLM allows you to selectively wake up only the model weights or the KV cache using the tags argument in wake_up(). This fine-grained control is especially useful when updating model weights: by waking up just the weights (e.g., llm.wake_up(tags=["weights"])), you avoid allocating memory for the KV cache until after the weight update is complete. This approach helps prevent GPU out-of-memory (OOM) errors, particularly with large models, by minimizing peak memory usage during weight synchronization and update operations.

Use `tags=["weights"]` or `tags=["kv_cache"]` to control which resources are restored, useful for RLHF and weight updates. **Note** that `is_sleeping` will report `true` until all components are awake.

```python
# Put engine to deep sleep (level=2)
llm.sleep(level=2)
# ... Get the new weights
# Wake up only weights to avoid OOM
llm.wake_up(tags=["weights"])
# ... Update the weights
# wake up KV cache after weights are updated
llm.wake_up(tags=["kv_cache"])
```

### Online Serving

To enable sleep mode in a vLLM server you need to initialize it with the flag `VLLM_SERVER_DEV_MODE=1` and pass `--enable-sleep-mode` to the vLLM server.

#### Server in development mode

When using the flag `VLLM_SERVER_DEV_MODE=1` you enable development endpoints, and these endpoints should not be exposed to users.

```bash
VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \
  --enable-sleep-mode \
  --port 8000
```

Below is an example of how to sleep and wake up a model in level 1.

```bash
curl -X POST 'http://localhost:8000/sleep?level=1'
curl -X POST 'http://localhost:8000/wake_up'
```

And this is an example of how to sleep and wake up a model in level 2.

```bash
curl -X POST 'http://localhost:8000/sleep?level=2'
# Reallocate weights memory only
curl -X POST 'http://localhost:8000/wake_up?tags=weights'
# Load weights in-place
curl -X POST 'http://localhost:8000/collective_rpc' -H 'Content-Type: application/json' -d '{"method":"reload_weights"}'
# Reallocate KV cache
curl -X POST 'http://localhost:8000/wake_up?tags=kv_cache'
```

#### HTTP endpoints

- `POST /sleep?level=1` — Put the model to sleep (`level=1`).
- `POST /wake_up` — Wake up the model. Supports optional `tags` query parameters for partial wake-up (e.g., `?tags=weights`).
- `POST /collective_rpc` — Perform a collective remote procedure call (RPC).
- `GET /is_sleeping` — Check if the model is sleeping.

!!! note
    These endpoints are only available when passing `VLLM_SERVER_DEV_MODE=1`.

## Limitation

On ROCm, the virtual memory allocation on ROCm is done through chunked memory allocation. You can control the chunk size through `VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE` (in MB). The default value is set at 256MB. The larger the chunk size the faster the performance. However, setting it too large will cause OOM. So if you encounter OOM when using sleep mode. Try reducing the chunk size. It is recommended to define the chunk size as a power of 2.
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								# Sleep Mode
 								vLLM's Sleep Mode allows you to temporarily release most GPU memory used by a model, including model weights and KV cache, without stopping the server or unloading the Docker container. This is especially useful for RLHF, training, or cost-saving scenarios where GPU resources need to be freed between inference workloads.
 								Key benefits:
 								- **Frees GPU memory**: Offloads model weights to CPU RAM and discards KV cache, releasing up to 90%+ of GPU memory for other tasks.
 								- **Fast resume**: Quickly wake up the engine and resume inference without full model reload.
 								- **API endpoints**: Control sleep/wake_up state via HTTP endpoints or Python API.
 								- **Supports distributed workloads**: Works with tensor parallelism, pipeline parallelism, etc.
 								- **Fine-grained control**: Optionally wake up only model weights or KV cache to avoid OOM during weight updates.
 								!!! note
-												[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
											
										
										
											2025-11-13 01:24:12 +02:00
+								    This feature is now supported on CUDA and ROCm platform.
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
-												[Doc] Sleep mode documentation  (#28357)

Signed-off-by: Amir Balwel <amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: Amir Balwel <amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-11-11 13:44:35 +07:00
+								!!! note
 								    For more information, see this [Blog Post](https://blog.vllm.ai/2025/10/26/sleep-mode.html).
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								## Sleep levels
 								Level 1 sleep will offload the model weights and discard the KV cache. The content of KV cache is forgotten. Level 1 sleep is good for sleeping and waking up the engine to run the same model again. The model weights are backed up in CPU memory. Please make sure there's enough CPU memory to store the model weights. Level 2 sleep will discard both the model weights and the KV cache (while the model's buffers are kept in CPU, like rope scaling tensors). The content of both the model weights and KV cache is forgotten. Level 2 sleep is good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed, e.g. RLHF weight update.
 								## Usage
 								### Offline inference
 								Enable sleep mode by passing `enable_sleep_mode=True` to the `LLM` class.
 								```python
 								from vllm import LLM
 								llm = LLM("Qwen/Qwen3-0.6B", enable_sleep_mode=True)
 								```
 								#### Python API
 								```python
-												[Doc] Sleep mode documentation  (#28357)

Signed-off-by: Amir Balwel <amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: Amir Balwel <amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-11-11 13:44:35 +07:00
+								# Sleep level 1
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								# Put the engine to sleep (level=1: offload weights to CPU RAM, discard KV cache)
 								llm.sleep(level=1)
 								# Wake up the engine (restore weights)
 								llm.wake_up()
 								```
-												[Doc] Sleep mode documentation  (#28357)

Signed-off-by: Amir Balwel <amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: Amir Balwel <amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-11-11 13:44:35 +07:00
+								```python
 								# Sleep level 2
 								# Put the engine to sleep (level=2: discard both weights and KV cache)
 								llm.sleep(level=2)
 								# Reallocate weights memory only
 								llm.wake_up(tags=["weights"])
 								# Load weights in-place
 								llm.collective_rpc("reload_weights")
 								# Reallocate KV cache
 								llm.wake_up(tags=["kv_cache"])
 								```
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								#### RLHF weight updates
 								During RLHF training, vLLM allows you to selectively wake up only the model weights or the KV cache using the tags argument in wake_up(). This fine-grained control is especially useful when updating model weights: by waking up just the weights (e.g., llm.wake_up(tags=["weights"])), you avoid allocating memory for the KV cache until after the weight update is complete. This approach helps prevent GPU out-of-memory (OOM) errors, particularly with large models, by minimizing peak memory usage during weight synchronization and update operations.
 								Use `tags=["weights"]` or `tags=["kv_cache"]` to control which resources are restored, useful for RLHF and weight updates. **Note** that `is_sleeping` will report `true` until all components are awake.
 								```python
 								# Put engine to deep sleep (level=2)
 								llm.sleep(level=2)
 								# ... Get the new weights
 								# Wake up only weights to avoid OOM
 								llm.wake_up(tags=["weights"])
 								# ... Update the weights
 								# wake up KV cache after weights are updated
 								llm.wake_up(tags=["kv_cache"])
 								```
 								### Online Serving
 								To enable sleep mode in a vLLM server you need to initialize it with the flag `VLLM_SERVER_DEV_MODE=1` and pass `--enable-sleep-mode` to the vLLM server.
 								#### Server in development mode
 								When using the flag `VLLM_SERVER_DEV_MODE=1` you enable development endpoints, and these endpoints should not be exposed to users.
 								```bash
-												[CI/Build] Replace `vllm.entrypoints.openai.api_server` entrypoint with `vllm serve` command (#25967)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-03 01:04:57 +08:00
+								VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								  --enable-sleep-mode \
 								  --port 8000
 								```
-												[Doc] Sleep mode documentation  (#28357)

Signed-off-by: Amir Balwel <amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: Amir Balwel <amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-11-11 13:44:35 +07:00
+								Below is an example of how to sleep and wake up a model in level 1.
 								```bash
 								curl -X POST 'http://localhost:8000/sleep?level=1'
 								curl -X POST 'http://localhost:8000/wake_up'
 								```
 								And this is an example of how to sleep and wake up a model in level 2.
 								```bash
 								curl -X POST 'http://localhost:8000/sleep?level=2'
 								# Reallocate weights memory only
 								curl -X POST 'http://localhost:8000/wake_up?tags=weights'
 								# Load weights in-place
 								curl -X POST 'http://localhost:8000/collective_rpc' -H 'Content-Type: application/json' -d '{"method":"reload_weights"}'
 								# Reallocate KV cache
 								curl -X POST 'http://localhost:8000/wake_up?tags=kv_cache'
 								```
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								#### HTTP endpoints
 								- `POST /sleep?level=1` — Put the model to sleep (`level=1`).
 								- `POST /wake_up` — Wake up the model. Supports optional `tags` query parameters for partial wake-up (e.g., `?tags=weights`).
-												[Doc] Sleep mode documentation  (#28357)

Signed-off-by: Amir Balwel <amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: Amir Balwel <amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
											
										
										
											2025-11-11 13:44:35 +07:00
+								- `POST /collective_rpc` — Perform a collective remote procedure call (RPC).
-												[Doc] Sleep mode documentation (#22310)

Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
											
										
										
											2025-08-08 11:25:18 +07:00
+								- `GET /is_sleeping` — Check if the model is sleeping.
 								!!! note
 								    These endpoints are only available when passing `VLLM_SERVER_DEV_MODE=1`.
-												[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
											
										
										
											2025-11-13 01:24:12 +02:00
 								## Limitation
 								On ROCm, the virtual memory allocation on ROCm is done through chunked memory allocation. You can control the chunk size through `VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE` (in MB). The default value is set at 256MB. The larger the chunk size the faster the performance. However, setting it too large will cause OOM. So if you encounter OOM when using sleep mode. Try reducing the chunk size. It is recommended to define the chunk size as a power of 2.