[CI/Build] Add markdown linter (#11857)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
@@ -6,7 +6,7 @@ The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is
|
||||
|
||||
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
|
||||
|
||||
```
|
||||
```text
|
||||
Block 1 Block 2 Block 3
|
||||
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
|
||||
Block 1: |<--- block tokens ---->|
|
||||
@@ -14,19 +14,16 @@ Block 2: |<------- prefix ------>| |<--- block tokens --->|
|
||||
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
|
||||
```
|
||||
|
||||
|
||||
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
|
||||
|
||||
```
|
||||
```text
|
||||
hash(prefix tokens + block tokens) <--> KV Block
|
||||
```
|
||||
|
||||
With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
|
||||
|
||||
|
||||
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
|
||||
|
||||
|
||||
## Generalized Caching Policy
|
||||
|
||||
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
|
||||
@@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra
|
||||
|
||||
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
|
||||
|
||||
- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
|
||||
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
|
||||
* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
|
||||
* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
|
||||
|
||||
Reference in New Issue
Block a user