- reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only) - reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style ) - reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant) - reference/official_inference/ — original weight's generate.py, model.py, kernel.py - reference/README.md documents the layout and which files matter for our pipeline - These are read-only references for cross-checking, not imported by production code
2.4 KiB
Reference Implementations
This directory contains read-only reference implementations from official sources. Do not modify these files — they exist to cross-check our production pipeline.
Directory Layout
reference/
├── vllm/ # vLLM project reference (Apache-2.0)
│ ├── tokenizers/
│ │ ├── deepseek_v4.py # Tokenizer wrapper — apply_chat_template for DSV4
│ │ └── deepseek_v4_encoding.py # Official prompt encoder (canonical source)
│ ├── reasoning/
│ │ ├── deepseek_v3_reasoning_parser.py # Thinking-mode dispatcher
│ │ └── deepseek_r1_reasoning_parser.py # )/) reasoning token parser
│ └── tool_parsers/
│ ├── deepseekv4_tool_parser.py # DSML tool call parser (V4)
│ └── deepseekv32_tool_parser.py # DSML tool call parser (V3.2 base)
│
└── official_inference/ # Original weight's reference inference code
├── generate.py # Official generate loop + encode_messages usage
├── model.py # BF16/FP8 model implementation
├── kernel.py # Reference CUDA kernels
├── convert.py # Weight conversion
└── config.json # Model config (small variant)
Key Files for Our Pipeline
-
vllm/tokenizers/deepseek_v4_encoding.py— Canonical prompt encoder. Already copied toencoding/deepseek_v4_encoding.pyin the repo root (our live import). If vLLM updates this file, diff and sync. -
vllm/tokenizers/deepseek_v4.py— Shows how vLLM wraps the tokenizer to addapply_chat_templatesupport. Key insight: it callsencode_messages(messages, thinking_mode=..., ...)thentokenizer.encode(prompt_str, add_special_tokens=False). This is exactly what our single_shot does. -
official_inference/generate.py— The original weight's inference entry point. Usestokenizer.encode(encode_messages(messages, thinking_mode="chat"))(defaultadd_special_tokens=True) andparse_message_from_completion_text()for output parsing. -
vllm/reasoning/— How vLLM detects thinking mode boundaries ()、start,)/)end). Useful when we integrate streaming. -
vllm/tool_parsers/— DSML tool call parsing for future tool-use support.