28 lines
803 B
Markdown
28 lines
803 B
Markdown
# N-Gram Speculation
|
|
|
|
The following code configures vLLM to use speculative decoding where proposals are generated by
|
|
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
|
|
|
|
```python
|
|
from vllm import LLM, SamplingParams
|
|
|
|
prompts = ["The future of AI is"]
|
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
|
|
|
llm = LLM(
|
|
model="Qwen/Qwen3-8B",
|
|
tensor_parallel_size=1,
|
|
speculative_config={
|
|
"method": "ngram",
|
|
"num_speculative_tokens": 5,
|
|
"prompt_lookup_max": 4,
|
|
},
|
|
)
|
|
outputs = llm.generate(prompts, sampling_params)
|
|
|
|
for output in outputs:
|
|
prompt = output.prompt
|
|
generated_text = output.outputs[0].text
|
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
|
```
|