[Docs] Switch to better markdown linting pre-commit hook (#21851)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -361,7 +361,7 @@ instances in Prometheus.
|
||||
|
||||
We use this concept for the `vllm:cache_config_info` metric:
|
||||
|
||||
```
|
||||
```text
|
||||
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
|
||||
# TYPE vllm:cache_config_info gauge
|
||||
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
|
||||
@@ -686,7 +686,7 @@ documentation for this option states:
|
||||
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
|
||||
as:
|
||||
|
||||
```
|
||||
```text
|
||||
-> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117)
|
||||
-> gen_ai.latency.time_in_model_forward: Double(3.151565277099609)
|
||||
-> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676)
|
||||
|
||||
@@ -5,6 +5,7 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica
|
||||
## Detailed Design
|
||||
|
||||
### Overall Process
|
||||
|
||||
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
|
||||
|
||||
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
|
||||
@@ -23,7 +24,7 @@ A simple HTTP service acts as the entry point for client requests and starts a b
|
||||
|
||||
The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example:
|
||||
|
||||
```
|
||||
```text
|
||||
cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0
|
||||
```
|
||||
|
||||
@@ -70,6 +71,7 @@ pip install "vllm>=0.9.2"
|
||||
## Run xPyD
|
||||
|
||||
### Instructions
|
||||
|
||||
- The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model.
|
||||
- Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.
|
||||
- For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance.
|
||||
|
||||
@@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified
|
||||
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
|
||||
* Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments.
|
||||
|
||||
> **Note 1:** We only cache full blocks.
|
||||
!!! note "Note 1"
|
||||
We only cache full blocks.
|
||||
|
||||
> **Note 2:** The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash.
|
||||
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context).
|
||||
!!! note "Note 2"
|
||||
The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash.
|
||||
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context).
|
||||
|
||||
**A hashing example with multi-modality inputs**
|
||||
In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:
|
||||
@@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache
|
||||
|
||||
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
|
||||
|
||||
> **Note:** Cache isolation is not supported in engine V0.
|
||||
!!! note
|
||||
Cache isolation is not supported in engine V0.
|
||||
|
||||
## Data Structure
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d
|
||||
|
||||
In the very verbose logs, we can see:
|
||||
|
||||
```
|
||||
```console
|
||||
INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile
|
||||
```
|
||||
|
||||
@@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual
|
||||
|
||||
In the very verbose logs, we can also see:
|
||||
|
||||
```
|
||||
```console
|
||||
DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py')
|
||||
DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py')
|
||||
...
|
||||
@@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa
|
||||
|
||||
If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs:
|
||||
|
||||
```
|
||||
```console
|
||||
DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py')
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user