[Core] Deprecating block manager v1 and make block manager v2 default (#8704)

Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
parent 5eda21e773
commit 81ede99ca4
45 changed files with 206 additions and 2109 deletions
--- a/docs/source/models/spec_decode.rst
+++ b/docs/source/models/spec_decode.rst
@@ -30,7 +30,6 @@ The following code configures vLLM in an offline mode to use speculative decodin
        tensor_parallel_size=1,
        speculative_model="facebook/opt-125m",
        num_speculative_tokens=5,
-        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)

@@ -104,7 +103,6 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
        speculative_model="[ngram]",
        num_speculative_tokens=5,
        ngram_prompt_lookup_max=4,
-        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)

@@ -135,7 +133,6 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-
        tensor_parallel_size=4,
        speculative_model="ibm-fms/llama3-70b-accelerator",
        speculative_draft_tensor_parallel_size=1,
-        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)