- reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only) - reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style ) - reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant) - reference/official_inference/ — original weight's generate.py, model.py, kernel.py - reference/README.md documents the layout and which files matter for our pipeline - These are read-only references for cross-checking, not imported by production code
50 lines
2.4 KiB
Markdown
50 lines
2.4 KiB
Markdown
# Reference Implementations
|
||
|
||
This directory contains **read-only** reference implementations from official sources.
|
||
Do not modify these files — they exist to cross-check our production pipeline.
|
||
|
||
## Directory Layout
|
||
|
||
```
|
||
reference/
|
||
├── vllm/ # vLLM project reference (Apache-2.0)
|
||
│ ├── tokenizers/
|
||
│ │ ├── deepseek_v4.py # Tokenizer wrapper — apply_chat_template for DSV4
|
||
│ │ └── deepseek_v4_encoding.py # Official prompt encoder (canonical source)
|
||
│ ├── reasoning/
|
||
│ │ ├── deepseek_v3_reasoning_parser.py # Thinking-mode dispatcher
|
||
│ │ └── deepseek_r1_reasoning_parser.py # )/) reasoning token parser
|
||
│ └── tool_parsers/
|
||
│ ├── deepseekv4_tool_parser.py # DSML tool call parser (V4)
|
||
│ └── deepseekv32_tool_parser.py # DSML tool call parser (V3.2 base)
|
||
│
|
||
└── official_inference/ # Original weight's reference inference code
|
||
├── generate.py # Official generate loop + encode_messages usage
|
||
├── model.py # BF16/FP8 model implementation
|
||
├── kernel.py # Reference CUDA kernels
|
||
├── convert.py # Weight conversion
|
||
└── config.json # Model config (small variant)
|
||
```
|
||
|
||
## Key Files for Our Pipeline
|
||
|
||
1. **`vllm/tokenizers/deepseek_v4_encoding.py`** — Canonical prompt encoder.
|
||
Already copied to `encoding/deepseek_v4_encoding.py` in the repo root (our live import).
|
||
If vLLM updates this file, diff and sync.
|
||
|
||
2. **`vllm/tokenizers/deepseek_v4.py`** — Shows how vLLM wraps the tokenizer
|
||
to add `apply_chat_template` support. Key insight: it calls
|
||
`encode_messages(messages, thinking_mode=..., ...)` then
|
||
`tokenizer.encode(prompt_str, add_special_tokens=False)`.
|
||
This is exactly what our single_shot does.
|
||
|
||
3. **`official_inference/generate.py`** — The original weight's inference entry point.
|
||
Uses `tokenizer.encode(encode_messages(messages, thinking_mode="chat"))`
|
||
(default `add_special_tokens=True`) and `parse_message_from_completion_text()`
|
||
for output parsing.
|
||
|
||
4. **`vllm/reasoning/`** — How vLLM detects thinking mode boundaries
|
||
(`)、` start, `)/)` end). Useful when we integrate streaming.
|
||
|
||
5. **`vllm/tool_parsers/`** — DSML tool call parsing for future tool-use support.
|