50 lines
2.4 KiB
Markdown
50 lines
2.4 KiB
Markdown
|
|
# Reference Implementations
|
|||
|
|
|
|||
|
|
This directory contains **read-only** reference implementations from official sources.
|
|||
|
|
Do not modify these files — they exist to cross-check our production pipeline.
|
|||
|
|
|
|||
|
|
## Directory Layout
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
reference/
|
|||
|
|
├── vllm/ # vLLM project reference (Apache-2.0)
|
|||
|
|
│ ├── tokenizers/
|
|||
|
|
│ │ ├── deepseek_v4.py # Tokenizer wrapper — apply_chat_template for DSV4
|
|||
|
|
│ │ └── deepseek_v4_encoding.py # Official prompt encoder (canonical source)
|
|||
|
|
│ ├── reasoning/
|
|||
|
|
│ │ ├── deepseek_v3_reasoning_parser.py # Thinking-mode dispatcher
|
|||
|
|
│ │ └── deepseek_r1_reasoning_parser.py # )/) reasoning token parser
|
|||
|
|
│ └── tool_parsers/
|
|||
|
|
│ ├── deepseekv4_tool_parser.py # DSML tool call parser (V4)
|
|||
|
|
│ └── deepseekv32_tool_parser.py # DSML tool call parser (V3.2 base)
|
|||
|
|
│
|
|||
|
|
└── official_inference/ # Original weight's reference inference code
|
|||
|
|
├── generate.py # Official generate loop + encode_messages usage
|
|||
|
|
├── model.py # BF16/FP8 model implementation
|
|||
|
|
├── kernel.py # Reference CUDA kernels
|
|||
|
|
├── convert.py # Weight conversion
|
|||
|
|
└── config.json # Model config (small variant)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Key Files for Our Pipeline
|
|||
|
|
|
|||
|
|
1. **`vllm/tokenizers/deepseek_v4_encoding.py`** — Canonical prompt encoder.
|
|||
|
|
Already copied to `encoding/deepseek_v4_encoding.py` in the repo root (our live import).
|
|||
|
|
If vLLM updates this file, diff and sync.
|
|||
|
|
|
|||
|
|
2. **`vllm/tokenizers/deepseek_v4.py`** — Shows how vLLM wraps the tokenizer
|
|||
|
|
to add `apply_chat_template` support. Key insight: it calls
|
|||
|
|
`encode_messages(messages, thinking_mode=..., ...)` then
|
|||
|
|
`tokenizer.encode(prompt_str, add_special_tokens=False)`.
|
|||
|
|
This is exactly what our single_shot does.
|
|||
|
|
|
|||
|
|
3. **`official_inference/generate.py`** — The original weight's inference entry point.
|
|||
|
|
Uses `tokenizer.encode(encode_messages(messages, thinking_mode="chat"))`
|
|||
|
|
(default `add_special_tokens=True`) and `parse_message_from_completion_text()`
|
|||
|
|
for output parsing.
|
|||
|
|
|
|||
|
|
4. **`vllm/reasoning/`** — How vLLM detects thinking mode boundaries
|
|||
|
|
(`)、` start, `)/)` end). Useful when we integrate streaming.
|
|||
|
|
|
|||
|
|
5. **`vllm/tool_parsers/`** — DSML tool call parsing for future tool-use support.
|