43 lines
2.0 KiB
Markdown
43 lines
2.0 KiB
Markdown
|
|
# MLP Draft Models
|
||
|
|
|
||
|
|
The following code configures vLLM to use speculative decoding where proposals are generated by draft models that condition draft predictions on both context vectors and sampled tokens. For more information see [The Hitchhiker's Guide to Speculative Decoding](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) and [IBM Research's Technical Report](https://arxiv.org/abs/2404.19124).
|
||
|
|
|
||
|
|
## MLP Drafter Example
|
||
|
|
|
||
|
|
```python
|
||
|
|
from vllm import LLM, SamplingParams
|
||
|
|
|
||
|
|
prompts = ["The future of AI is"]
|
||
|
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||
|
|
|
||
|
|
llm = LLM(
|
||
|
|
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
|
||
|
|
tensor_parallel_size=4,
|
||
|
|
speculative_config={
|
||
|
|
"model": "ibm-ai-platform/llama3-70b-accelerator",
|
||
|
|
"draft_tensor_parallel_size": 1,
|
||
|
|
"method": "mlp_speculator",
|
||
|
|
},
|
||
|
|
)
|
||
|
|
outputs = llm.generate(prompts, sampling_params)
|
||
|
|
|
||
|
|
for output in outputs:
|
||
|
|
prompt = output.prompt
|
||
|
|
generated_text = output.outputs[0].text
|
||
|
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||
|
|
```
|
||
|
|
|
||
|
|
## Pre-Trained MLP Drafter Models
|
||
|
|
|
||
|
|
A variety of speculative models of this type are available on HF hub:
|
||
|
|
|
||
|
|
- [llama-13b-accelerator](https://huggingface.co/ibm-ai-platform/llama-13b-accelerator)
|
||
|
|
- [llama3-8b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-8b-accelerator)
|
||
|
|
- [codellama-34b-accelerator](https://huggingface.co/ibm-ai-platform/codellama-34b-accelerator)
|
||
|
|
- [llama2-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama2-70b-accelerator)
|
||
|
|
- [llama3-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-70b-accelerator)
|
||
|
|
- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
|
||
|
|
- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
|
||
|
|
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
|
||
|
|
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
|