Files
biondizzle ca7c309463 Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference
- reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only)
- reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style )
- reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant)
- reference/official_inference/ — original weight's generate.py, model.py, kernel.py
- reference/README.md documents the layout and which files matter for our pipeline
- These are read-only references for cross-checking, not imported by production code
2026-06-03 10:25:23 +00:00

50 lines
2.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Reference Implementations
This directory contains **read-only** reference implementations from official sources.
Do not modify these files — they exist to cross-check our production pipeline.
## Directory Layout
```
reference/
├── vllm/ # vLLM project reference (Apache-2.0)
│ ├── tokenizers/
│ │ ├── deepseek_v4.py # Tokenizer wrapper — apply_chat_template for DSV4
│ │ └── deepseek_v4_encoding.py # Official prompt encoder (canonical source)
│ ├── reasoning/
│ │ ├── deepseek_v3_reasoning_parser.py # Thinking-mode dispatcher
│ │ └── deepseek_r1_reasoning_parser.py # / reasoning token parser
│ └── tool_parsers/
│ ├── deepseekv4_tool_parser.py # DSML tool call parser (V4)
│ └── deepseekv32_tool_parser.py # DSML tool call parser (V3.2 base)
└── official_inference/ # Original weight's reference inference code
├── generate.py # Official generate loop + encode_messages usage
├── model.py # BF16/FP8 model implementation
├── kernel.py # Reference CUDA kernels
├── convert.py # Weight conversion
└── config.json # Model config (small variant)
```
## Key Files for Our Pipeline
1. **`vllm/tokenizers/deepseek_v4_encoding.py`** — Canonical prompt encoder.
Already copied to `encoding/deepseek_v4_encoding.py` in the repo root (our live import).
If vLLM updates this file, diff and sync.
2. **`vllm/tokenizers/deepseek_v4.py`** — Shows how vLLM wraps the tokenizer
to add `apply_chat_template` support. Key insight: it calls
`encode_messages(messages, thinking_mode=..., ...)` then
`tokenizer.encode(prompt_str, add_special_tokens=False)`.
This is exactly what our single_shot does.
3. **`official_inference/generate.py`** — The original weight's inference entry point.
Uses `tokenizer.encode(encode_messages(messages, thinking_mode="chat"))`
(default `add_special_tokens=True`) and `parse_message_from_completion_text()`
for output parsing.
4. **`vllm/reasoning/`** — How vLLM detects thinking mode boundaries
(`)、` start, `/` end). Useful when we integrate streaming.
5. **`vllm/tool_parsers/`** — DSML tool call parsing for future tool-use support.