[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695)

Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
2025-11-13 01:24:12 +02:00
parent 10f01d5a3a
commit 4ca5cd5740
11 changed files with 582 additions and 31 deletions
--- a/docs/features/sleep_mode.md
+++ b/docs/features/sleep_mode.md
@@ -11,7 +11,7 @@ Key benefits:
 - **Fine-grained control**: Optionally wake up only model weights or KV cache to avoid OOM during weight updates.

 !!! note
-    This feature is only supported on CUDA platform.
+    This feature is now supported on CUDA and ROCm platform.

 !!! note
    For more information, see this [Blog Post](https://blog.vllm.ai/2025/10/26/sleep-mode.html).
@@ -116,3 +116,7 @@ curl -X POST 'http://localhost:8000/wake_up?tags=kv_cache'

 !!! note
    These endpoints are only available when passing `VLLM_SERVER_DEV_MODE=1`.
+
+## Limitation
+
+On ROCm, the virtual memory allocation on ROCm is done through chunked memory allocation. You can control the chunk size through `VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE` (in MB). The default value is set at 256MB. The larger the chunk size the faster the performance. However, setting it too large will cause OOM. So if you encounter OOM when using sleep mode. Try reducing the chunk size. It is recommended to define the chunk size as a power of 2.