vllm-deepseek-v32-mtp/README.md

# vLLM DeepSeek-V3.2 MTP Tool Parser

A robust tool call parser for DeepSeek-V3.2 DSML format, designed to handle multi-token deltas from MTP (Multi-Token Prediction) and EAGLE speculative decoding.

## Overview

This project provides a drop-in replacement for the standard vLLM tool parser that is resilient to multi-token streaming. Instead of maintaining incremental state, it re-parses the entire current text on every call, finds all tool call regions, builds JSON arguments, and emits only the newly-added characters. This makes it robust against variable token arrival rates.

## Features

- **Re-parse-and-diff approach**: Re-parses the entire text on every streaming call for correctness
- **Multi-token delta support**: Handles any number of tokens arriving per step
- **Complete and partial tool call handling**: Streams both complete and in-progress tool calls
- **JSON argument construction**: Builds proper JSON arguments from parameter tags
- **Schema-aware type conversion**: Converts parameter values according to tool schema
- **Content extraction**: Properly extracts non-tool-call text without swallowing or duplicating content

## Installation

### Prerequisites
- Docker
- Access to a vLLM-compatible environment
- Python 3.12+

### Building the Docker Image

```bash
# Build the image
docker build -t vllm-deepseek-v32-mtp:v0.19.0 .

# Or use the provided Jenkins pipeline (see below)
```

## Usage

### As a Drop-in Replacement

The parser implements the same interface as the standard vLLM tool parser:

```python
from vllm.tool_parsers.deepseekv32_tool_parser import DeepSeekV32ToolParser

parser = DeepSeekV32ToolParser(tokenizer, tools)
```

### In Streaming Mode

The parser automatically handles streaming by:
1. Re-scanning current text for content outside tool-call regions
2. Finding all `<｜DSML｜invoke>` regions (complete + partial)
3. Building JSON args for each and diffing against previous state
4. Emitting only new content

## Tool Call Format

The parser expects the DeepSeek-V3.2 DSML format:

```
<｜DSML｜function_calls>
<｜DSML｜invoke name="get_weather">
<｜DSML｜parameter name="location" string="true">杭州</｜DSML｜parameter>
<｜DSML｜parameter name="date" string="true">2024-01-16</｜DSML｜parameter>
</｜DSML｜invoke>
</｜DSML｜function_calls>
```

## Jenkins Pipeline

The project includes a Jenkinsfile for CI/CD. The pipeline:

1. Checks out the repository
2. Builds the Docker image
3. Pushes to the specified registry

### Pipeline Parameters

- `IMAGE_TAG`: Docker image tag (default: `v0.19.0`)
- `GIT_REPO`: Git repository URL (optional, uses workspace if empty)
- `GIT_BRANCH`: Git branch to build (default: `master`)

### Environment Variables

- `REGISTRY`: `atl.vultrcr.com/vllm`
- `IMAGE_NAME`: `vllm-deepseek-v32-mtp`

### Credentials

The pipeline requires Docker registry credentials stored in Jenkins as `ATL_VCR_VLLM`.

## Configuration

### Jenkins Setup

1. Create a new pipeline job named `vllm-deepseek-v32-mtp`
2. Configure it to pull from: `https://sweetapi.com/biondizzle/vllm-deepseek-v32-mtp.git`
3. Set up the `ATL_VCR_VLLM` credentials in Jenkins
4. Run the pipeline

### Manual Build

```bash
# Set your registry credentials
export DOCKER_REGISTRY_USER=your_user
export DOCKER_REGISTRY_PASS=your_pass

# Build and push
docker build -t atl.vultrcr.com/vllm/vllm-deepseek-v32-mtp:v0.19.0 .
docker push atl.vultrcr.com/vllm/vllm-deepseek-v32-mtp:v0.19.0
```

## Development

### Testing

The parser includes comprehensive unit tests for:
- Content extraction with partial tag overlaps
- Invoke region detection (complete and incomplete)
- JSON argument construction
- Type conversion according to schema
- Streaming delta computation

### Contributing

1. Fork the repository
2. Create a feature branch
3. Implement your changes
4. Add tests
5. Submit a pull request

## License

Apache 2.0 - See [LICENSE](LICENSE) for details.

## Architecture

### Key Components

- **`_extract_content()`**: Extracts non-tool-call text while handling partial tag overlaps
- **`_extract_invoke_regions()`**: Finds both complete and incomplete invoke blocks
- **`_build_args_json_so_far()`**: Constructs JSON arguments incrementally
- **`_compute_args_diff()`**: Computes and emits only newly-added characters
- **`extract_tool_calls_streaming()`**: Main entry point that orchestrates the re-parse-and-diff process

### State Management

The parser maintains minimal state between calls:
- `_sent_content_idx`: Position tracker for content extraction
- `_tool_call_ids`: Generated IDs for each tool call
- `streamed_args_for_tool`: Previously sent arguments for diffing
- `prev_tool_call_arr`: Previous tool call state

## Troubleshooting

### Common Issues

**Tool calls not detected**:
- Ensure the DSML tags are correctly formatted
- Verify `skip_special_tokens=False` in the request
- Check that the tool call format matches the expected pattern

**Streaming hangs**:
- Verify the closing tags are present in the model output
- Check for partial tag overlaps that might be causing the parser to wait

**Type conversion errors**:
- Ensure your tool schema defines the correct parameter types
- Verify that string parameters are marked with `string="true"`

## Support

For issues and questions, please use the project's issue tracker.

## Related Projects

- [vLLM](https://github.com/vllm-project/vllm): The main vLLM project
- [DeepSeek](https://github.com/deepseek-ai): DeepSeek AI models
- [MTP](https://github.com/vllm-project/vllm): Multi-Token Prediction implementation

## Changelog

### v0.19.0
- Initial release with re-parse-and-diff architecture
- Full support for DeepSeek-V3.2 DSML format
- Jenkins pipeline integration
- Docker build and deployment support

## Roadmap

- Performance optimizations for very long tool calls
- Additional validation and error handling
- Support for more parameter types
- Integration with additional vLLM features