Allow users to specify kv cache memory size (#21489)
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -110,6 +110,14 @@ class LLM:
|
||||
values will increase the KV cache size and thus improve the model's
|
||||
throughput. However, if the value is too high, it may cause out-of-
|
||||
memory (OOM) errors.
|
||||
kv_cache_memory_bytes: Size of KV Cache per GPU in bytes. By default,
|
||||
this is set to None and vllm can automatically infer the kv cache
|
||||
size based on gpu_memory_utilization. However, users may want to
|
||||
manually specify the kv cache memory size. kv_cache_memory_bytes
|
||||
allows more fine-grain control of how much memory gets used when
|
||||
compared with using gpu_memory_memory_utilization. Note that
|
||||
kv_cache_memory_bytes (when not-None) ignores
|
||||
gpu_memory_utilization
|
||||
swap_space: The size (GiB) of CPU memory per GPU to use as swap space.
|
||||
This can be used for temporarily storing the states of the requests
|
||||
when their `best_of` sampling parameters are larger than 1. If all
|
||||
@@ -184,6 +192,7 @@ class LLM:
|
||||
hf_overrides: Optional[HfOverrides] = None,
|
||||
mm_processor_kwargs: Optional[dict[str, Any]] = None,
|
||||
override_pooler_config: Optional[PoolerConfig] = None,
|
||||
kv_cache_memory_bytes: Optional[int] = None,
|
||||
compilation_config: Optional[Union[int, dict[str, Any],
|
||||
CompilationConfig]] = None,
|
||||
logits_processors: Optional[list[Union[str,
|
||||
@@ -251,6 +260,7 @@ class LLM:
|
||||
tokenizer_revision=tokenizer_revision,
|
||||
seed=seed,
|
||||
gpu_memory_utilization=gpu_memory_utilization,
|
||||
kv_cache_memory_bytes=kv_cache_memory_bytes,
|
||||
swap_space=swap_space,
|
||||
cpu_offload_gb=cpu_offload_gb,
|
||||
enforce_eager=enforce_eager,
|
||||
|
||||
Reference in New Issue
Block a user