57 lines
1.9 KiB
Markdown
57 lines
1.9 KiB
Markdown
|
|
# vLLM Kimi-K2.5-Thinking Eagle3 Drafter
|
||
|
|
|
||
|
|
A convenience Docker image that bundles the [Eagle3 drafter model](https://huggingface.co/nvidia/Kimi-K2.5-Thinking-Eagle3) into the vLLM container, so you can deploy speculative decoding without a separate model download step.
|
||
|
|
|
||
|
|
## What's Inside
|
||
|
|
|
||
|
|
- **Base image:** `vllm/vllm-openai:v0.19.0`
|
||
|
|
- **Drafter model:** `nvidia/Kimi-K2.5-Thinking-Eagle3` (Eagle3 speculator layers) extracted to `/opt/`
|
||
|
|
|
||
|
|
> **Note:** This only works with `nvidia/Kimi-K2-Thinking-NVFP4` — the text generation model. It is **not** compatible with the multimodal Kimi 2.5.
|
||
|
|
|
||
|
|
## Pull
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker pull atl.vultrcr.com/vllm/vllm-kimi25-eagle:v0.19.0
|
||
|
|
```
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
Add the speculative decoding config to your vLLM launch args. Here's a known-working Kubernetes deployment snippet:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
- "--tensor-parallel-size=8"
|
||
|
|
- "--trust-remote-code"
|
||
|
|
- "--gpu-memory-utilization=0.92"
|
||
|
|
- "--enable-auto-tool-choice"
|
||
|
|
- "--tool-call-parser=kimi_k2"
|
||
|
|
- "--reasoning-parser=kimi_k2"
|
||
|
|
- "--speculative_config"
|
||
|
|
- '{"model": "/opt/nvidia-Kimi-K2.5-Thinking-Eagle3/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/13dab2a34d650a93196d37f2af91f74b8b855bab", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, "method": "eagle3"}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Speculative Config Breakdown
|
||
|
|
|
||
|
|
| Parameter | Value | Notes |
|
||
|
|
|---|---|---|
|
||
|
|
| `model` | `/opt/nvidia-Kimi-K2.5-Thinking-Eagle3/...` | Path to the drafter inside the container |
|
||
|
|
| `draft_tensor_parallel_size` | `1` | TP size for the drafter |
|
||
|
|
| `num_speculative_tokens` | `3` | Number of tokens to speculate per step |
|
||
|
|
| `method` | `eagle3` | Speculative decoding method |
|
||
|
|
|
||
|
|
## Building
|
||
|
|
|
||
|
|
The Jenkins pipeline builds and pushes this image. Trigger a build with a specific tag:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -X POST "https://jenkins.sweetapi.com/job/vllm-kimi25-eagle/buildWithParameters" \
|
||
|
|
-u "$JENKINS_USER:$JENKINS_PASS" \
|
||
|
|
-d "TAG=v0.19.0"
|
||
|
|
```
|
||
|
|
|
||
|
|
To build locally:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker build -t atl.vultrcr.com/vllm/vllm-kimi25-eagle:v0.19.0 .
|
||
|
|
```
|