Files
nvfp4-megamoe-kernel/reference

Reference Implementations

This directory contains read-only reference implementations from official sources. Do not modify these files — they exist to cross-check our production pipeline.

Directory Layout

reference/
├── vllm/                        # vLLM project reference (Apache-2.0)
│   ├── tokenizers/
│   │   ├── deepseek_v4.py           # Tokenizer wrapper — apply_chat_template for DSV4
│   │   └── deepseek_v4_encoding.py  # Official prompt encoder (canonical source)
│   ├── reasoning/
│   │   ├── deepseek_v3_reasoning_parser.py   # Thinking-mode dispatcher
│   │   └── deepseek_r1_reasoning_parser.py   # / reasoning token parser
│   └── tool_parsers/
│       ├── deepseekv4_tool_parser.py         # DSML tool call parser (V4)
│       └── deepseekv32_tool_parser.py        # DSML tool call parser (V3.2 base)
│
└── official_inference/          # Original weight's reference inference code
    ├── generate.py                  # Official generate loop + encode_messages usage
    ├── model.py                     # BF16/FP8 model implementation
    ├── kernel.py                    # Reference CUDA kernels
    ├── convert.py                   # Weight conversion
    └── config.json                  # Model config (small variant)

Key Files for Our Pipeline

  1. vllm/tokenizers/deepseek_v4_encoding.py — Canonical prompt encoder. Already copied to encoding/deepseek_v4_encoding.py in the repo root (our live import). If vLLM updates this file, diff and sync.

  2. vllm/tokenizers/deepseek_v4.py — Shows how vLLM wraps the tokenizer to add apply_chat_template support. Key insight: it calls encode_messages(messages, thinking_mode=..., ...) then tokenizer.encode(prompt_str, add_special_tokens=False). This is exactly what our single_shot does.

  3. official_inference/generate.py — The original weight's inference entry point. Uses tokenizer.encode(encode_messages(messages, thinking_mode="chat")) (default add_special_tokens=True) and parse_message_from_completion_text() for output parsing.

  4. vllm/reasoning/ — How vLLM detects thinking mode boundaries ()、 start, / end). Useful when we integrate streaming.

  5. vllm/tool_parsers/ — DSML tool call parsing for future tool-use support.