Files
vllm-deepseek-v32-mtp/README.md

5.8 KiB
Raw Permalink Blame History

vLLM DeepSeek-V3.2 MTP Tool Parser

A robust tool call parser for DeepSeek-V3.2 DSML format, designed to handle multi-token deltas from MTP (Multi-Token Prediction) and EAGLE speculative decoding.

Overview

This project provides a drop-in replacement for the standard vLLM tool parser that is resilient to multi-token streaming. Instead of maintaining incremental state, it re-parses the entire current text on every call, finds all tool call regions, builds JSON arguments, and emits only the newly-added characters. This makes it robust against variable token arrival rates.

Features

  • Re-parse-and-diff approach: Re-parses the entire text on every streaming call for correctness
  • Multi-token delta support: Handles any number of tokens arriving per step
  • Complete and partial tool call handling: Streams both complete and in-progress tool calls
  • JSON argument construction: Builds proper JSON arguments from parameter tags
  • Schema-aware type conversion: Converts parameter values according to tool schema
  • Content extraction: Properly extracts non-tool-call text without swallowing or duplicating content

Installation

Prerequisites

  • Docker
  • Access to a vLLM-compatible environment
  • Python 3.12+

Building the Docker Image

# Build the image
docker build -t vllm-deepseek-v32-mtp:v0.19.0 .

# Or use the provided Jenkins pipeline (see below)

Usage

As a Drop-in Replacement

The parser implements the same interface as the standard vLLM tool parser:

from vllm.tool_parsers.deepseekv32_tool_parser import DeepSeekV32ToolParser

parser = DeepSeekV32ToolParser(tokenizer, tools)

In Streaming Mode

The parser automatically handles streaming by:

  1. Re-scanning current text for content outside tool-call regions
  2. Finding all <DSMLinvoke> regions (complete + partial)
  3. Building JSON args for each and diffing against previous state
  4. Emitting only new content

Tool Call Format

The parser expects the DeepSeek-V3.2 DSML format:

<DSMLfunction_calls>
<DSMLinvoke name="get_weather">
<DSMLparameter name="location" string="true">杭州</DSMLparameter>
<DSMLparameter name="date" string="true">2024-01-16</DSMLparameter>
</DSMLinvoke>
</DSMLfunction_calls>

Jenkins Pipeline

The project includes a Jenkinsfile for CI/CD. The pipeline:

  1. Checks out the repository
  2. Builds the Docker image
  3. Pushes to the specified registry

Pipeline Parameters

  • IMAGE_TAG: Docker image tag (default: v0.19.0)
  • GIT_REPO: Git repository URL (optional, uses workspace if empty)
  • GIT_BRANCH: Git branch to build (default: master)

Environment Variables

  • REGISTRY: atl.vultrcr.com/vllm
  • IMAGE_NAME: vllm-deepseek-v32-mtp

Credentials

The pipeline requires Docker registry credentials stored in Jenkins as ATL_VCR_VLLM.

Configuration

Jenkins Setup

  1. Create a new pipeline job named vllm-deepseek-v32-mtp
  2. Configure it to pull from: https://sweetapi.com/biondizzle/vllm-deepseek-v32-mtp.git
  3. Set up the ATL_VCR_VLLM credentials in Jenkins
  4. Run the pipeline

Manual Build

# Set your registry credentials
export DOCKER_REGISTRY_USER=your_user
export DOCKER_REGISTRY_PASS=your_pass

# Build and push
docker build -t atl.vultrcr.com/vllm/vllm-deepseek-v32-mtp:v0.19.0 .
docker push atl.vultrcr.com/vllm/vllm-deepseek-v32-mtp:v0.19.0

Development

Testing

The parser includes comprehensive unit tests for:

  • Content extraction with partial tag overlaps
  • Invoke region detection (complete and incomplete)
  • JSON argument construction
  • Type conversion according to schema
  • Streaming delta computation

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Implement your changes
  4. Add tests
  5. Submit a pull request

License

Apache 2.0 - See LICENSE for details.

Architecture

Key Components

  • _extract_content(): Extracts non-tool-call text while handling partial tag overlaps
  • _extract_invoke_regions(): Finds both complete and incomplete invoke blocks
  • _build_args_json_so_far(): Constructs JSON arguments incrementally
  • _compute_args_diff(): Computes and emits only newly-added characters
  • extract_tool_calls_streaming(): Main entry point that orchestrates the re-parse-and-diff process

State Management

The parser maintains minimal state between calls:

  • _sent_content_idx: Position tracker for content extraction
  • _tool_call_ids: Generated IDs for each tool call
  • streamed_args_for_tool: Previously sent arguments for diffing
  • prev_tool_call_arr: Previous tool call state

Troubleshooting

Common Issues

Tool calls not detected:

  • Ensure the DSML tags are correctly formatted
  • Verify skip_special_tokens=False in the request
  • Check that the tool call format matches the expected pattern

Streaming hangs:

  • Verify the closing tags are present in the model output
  • Check for partial tag overlaps that might be causing the parser to wait

Type conversion errors:

  • Ensure your tool schema defines the correct parameter types
  • Verify that string parameters are marked with string="true"

Support

For issues and questions, please use the project's issue tracker.

  • vLLM: The main vLLM project
  • DeepSeek: DeepSeek AI models
  • MTP: Multi-Token Prediction implementation

Changelog

v0.19.0

  • Initial release with re-parse-and-diff architecture
  • Full support for DeepSeek-V3.2 DSML format
  • Jenkins pipeline integration
  • Docker build and deployment support

Roadmap

  • Performance optimizations for very long tool calls
  • Additional validation and error handling
  • Support for more parameter types
  • Integration with additional vLLM features