Files

biondizzle ca7c309463 Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference

- reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only)
- reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style )
- reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant)
- reference/official_inference/ — original weight's generate.py, model.py, kernel.py
- reference/README.md documents the layout and which files matter for our pipeline
- These are read-only references for cross-checking, not imported by production code

2026-06-03 10:25:23 +00:00

2.4 KiB

Raw Blame History

Reference Implementations

This directory contains read-only reference implementations from official sources. Do not modify these files — they exist to cross-check our production pipeline.

Directory Layout

reference/
├── vllm/                        # vLLM project reference (Apache-2.0)
│   ├── tokenizers/
│   │   ├── deepseek_v4.py           # Tokenizer wrapper — apply_chat_template for DSV4
│   │   └── deepseek_v4_encoding.py  # Official prompt encoder (canonical source)
│   ├── reasoning/
│   │   ├── deepseek_v3_reasoning_parser.py   # Thinking-mode dispatcher
│   │   └── deepseek_r1_reasoning_parser.py   # ）/） reasoning token parser
│   └── tool_parsers/
│       ├── deepseekv4_tool_parser.py         # DSML tool call parser (V4)
│       └── deepseekv32_tool_parser.py        # DSML tool call parser (V3.2 base)
│
└── official_inference/          # Original weight's reference inference code
    ├── generate.py                  # Official generate loop + encode_messages usage
    ├── model.py                     # BF16/FP8 model implementation
    ├── kernel.py                    # Reference CUDA kernels
    ├── convert.py                   # Weight conversion
    └── config.json                  # Model config (small variant)

Key Files for Our Pipeline

vllm/tokenizers/deepseek_v4_encoding.py — Canonical prompt encoder. Already copied to encoding/deepseek_v4_encoding.py in the repo root (our live import). If vLLM updates this file, diff and sync.
vllm/tokenizers/deepseek_v4.py — Shows how vLLM wraps the tokenizer to add apply_chat_template support. Key insight: it calls encode_messages(messages, thinking_mode=..., ...) then tokenizer.encode(prompt_str, add_special_tokens=False). This is exactly what our single_shot does.
official_inference/generate.py — The original weight's inference entry point. Uses tokenizer.encode(encode_messages(messages, thinking_mode="chat")) (default add_special_tokens=True) and parse_message_from_completion_text() for output parsing.
vllm/reasoning/ — How vLLM detects thinking mode boundaries (）、 start, ）/） end). Useful when we integrate streaming.
vllm/tool_parsers/ — DSML tool call parsing for future tool-use support.

2.4 KiB Raw Blame History Unescape Escape

Reference Implementations

Directory Layout

Key Files for Our Pipeline

2.4 KiB

Raw Blame History