[v1] Hybrid Memory Allocator (#17996)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
This commit is contained in:
Chen Zhang
2025-06-06 11:47:09 +08:00
committed by GitHub
parent 3465b87ef8
commit f8a1a2d108
21 changed files with 1605 additions and 440 deletions

View File

@@ -157,11 +157,10 @@ class SlidingWindowSpec(AttentionSpec):
@dataclass
class KVCacheTensor:
"""
A dataclass for specifying how the workers should initialize the KV cache
for a layer. Only contains the size of KV cache for that layer for now. Will
be extended to support multiple layers sharing the same memory pool.
A class for specifying how the workers should initialize the KV cache.
"""
size: int # The size of KV cache Tensor in bytes
size: int # size of the KV cache tensor in bytes
shared_by: list[str] # layer names that share the same KV cache tensor
@dataclass
@@ -183,27 +182,13 @@ class KVCacheConfig:
"""
"""The number of KV cache blocks"""
num_blocks: int
"""layer_name -> how to initialize KV cache for that layer"""
tensors: dict[str, KVCacheTensor]
"""How should model runner initialize the KV cache tensors for each layer"""
kv_cache_tensors: list[KVCacheTensor]
"""
The kv cache groups of the model.
The layers in the models are repeated with some patterns, e.g., a model
with 10 full attention layers and 20 sliding window attention layers can be
regarded as repeating the pattern (1 * full, 2 * sw) 10 times.
The KVCacheManager allocates different block tables for each of the 3 layers
in the pattern, and repeats each of them 10 times to generate the
block_table for the 30 layers in the model.
Therefore, we can group the layers in the model into 3 groups, each of which
contains 10 layers in the model.
The KVCacheManager allocates the block_table for each group based on its
kv_cache spec, and the model runner applies the block table to each layer
in the group.
For example:
1. A model only uses full attention. The pattern is
(num_hidden_layers * full), so there is only one group and the block table
is shared by all layers.
2. (WIP) A model with 10 full attention layers and 20 sliding window
attention layers. There are 3 layers in the pattern (1 * full, 2 * sw), so
there are 3 groups, each of which represents 10 layers in the model.
For models with only one type of attention, there is only one group that
contains all layers.
For models with multiple types of attention, there will be multiple groups,
see `_get_kv_cache_config_uniform_page_size` for more details.
"""
kv_cache_groups: list[KVCacheGroupSpec]