[V1][Usage] Refactor speculative decoding configuration and tests (#14434)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
This commit is contained in:
@@ -30,8 +30,10 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="facebook/opt-125m",
|
||||
num_speculative_tokens=5,
|
||||
speculative_config={
|
||||
"model": "facebook/opt-125m",
|
||||
"num_speculative_tokens": 5,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
@@ -45,10 +47,14 @@ To perform the same with an online mode launch the server:
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
|
||||
--seed 42 -tp 1 --speculative_model facebook/opt-125m \
|
||||
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
|
||||
--seed 42 -tp 1 --gpu_memory_utilization 0.8 \
|
||||
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
|
||||
```
|
||||
|
||||
:::{warning}
|
||||
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately will be deprecated in the next release.
|
||||
:::
|
||||
|
||||
Then use a client:
|
||||
|
||||
```python
|
||||
@@ -101,9 +107,11 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
llm = LLM(
|
||||
model="facebook/opt-6.7b",
|
||||
tensor_parallel_size=1,
|
||||
speculative_model="[ngram]",
|
||||
num_speculative_tokens=5,
|
||||
ngram_prompt_lookup_max=4,
|
||||
speculative_config={
|
||||
"method": "ngram",
|
||||
"num_speculative_tokens": 5,
|
||||
"prompt_lookup_max": 4,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
@@ -131,8 +139,10 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_model="ibm-ai-platform/llama3-70b-accelerator",
|
||||
speculative_draft_tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"model": "ibm-ai-platform/llama3-70b-accelerator",
|
||||
"draft_tensor_parallel_size": 1,
|
||||
},
|
||||
)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
@@ -175,8 +185,10 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-8B",
|
||||
speculative_draft_tensor_parallel_size=1,
|
||||
speculative_config={
|
||||
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
|
||||
"draft_tensor_parallel_size": 1,
|
||||
},
|
||||
)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
@@ -194,11 +206,10 @@ A few important things to consider when using the EAGLE based draft models:
|
||||
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
|
||||
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
|
||||
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
|
||||
and specify `speculative_model="path/to/modified/eagle/model"`. If weight-loading problems still occur when using
|
||||
the latest version of vLLM, please leave a comment or raise an issue.
|
||||
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
|
||||
|
||||
2. The EAGLE based draft models need to be run without tensor parallelism
|
||||
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
|
||||
(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
|
||||
it is possible to run the main model using tensor parallelism (see example above).
|
||||
|
||||
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
|
||||
|
||||
Reference in New Issue
Block a user