vLLM Kimi-K2.5-Thinking Eagle3 Drafter

A convenience Docker image that bundles the Eagle3 drafter model into the vLLM container, so you can deploy speculative decoding without a separate model download step.

What's Inside

  • Base image: vllm/vllm-openai:v0.19.0
  • Drafter model: nvidia/Kimi-K2.5-Thinking-Eagle3 (Eagle3 speculator layers) extracted to /opt/

Note: This only works with nvidia/Kimi-K2-Thinking-NVFP4 — the text generation model. It is not compatible with the multimodal Kimi 2.5.

Pull

docker pull atl.vultrcr.com/vllm/vllm-kimi25-eagle:v0.19.0

Usage

Add the speculative decoding config to your vLLM launch args. Here's a known-working Kubernetes deployment snippet:

- "--tensor-parallel-size=8"
- "--trust-remote-code"
- "--gpu-memory-utilization=0.92"
- "--enable-auto-tool-choice"
- "--tool-call-parser=kimi_k2"
- "--reasoning-parser=kimi_k2"
- "--speculative_config"
- '{"model": "/opt/nvidia-Kimi-K2.5-Thinking-Eagle3/models--nvidia--Kimi-K2.5-Thinking-Eagle3/snapshots/13dab2a34d650a93196d37f2af91f74b8b855bab", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, "method": "eagle3"}'

Speculative Config Breakdown

Parameter Value Notes
model /opt/nvidia-Kimi-K2.5-Thinking-Eagle3/... Path to the drafter inside the container
draft_tensor_parallel_size 1 TP size for the drafter
num_speculative_tokens 3 Number of tokens to speculate per step
method eagle3 Speculative decoding method

Building

The Jenkins pipeline builds and pushes this image. Trigger a build with a specific tag:

curl -X POST "https://jenkins.sweetapi.com/job/vllm-kimi25-eagle/buildWithParameters" \
  -u "$JENKINS_USER:$JENKINS_PASS" \
  -d "TAG=v0.19.0"

To build locally:

docker build -t atl.vultrcr.com/vllm/vllm-kimi25-eagle:v0.19.0 .
Description
No description provided
Readme 161 KiB
Languages
Python 96.6%
Jinja 2.7%
Dockerfile 0.7%