nvfp4-megamoe-kernel/reference/README.md

# Reference Implementations

This directory contains **read-only** reference implementations from official sources.
Do not modify these files — they exist to cross-check our production pipeline.

## Directory Layout

```
reference/
├── vllm/                        # vLLM project reference (Apache-2.0)
│   ├── tokenizers/
│   │   ├── deepseek_v4.py           # Tokenizer wrapper — apply_chat_template for DSV4
│   │   └── deepseek_v4_encoding.py  # Official prompt encoder (canonical source)
│   ├── reasoning/
│   │   ├── deepseek_v3_reasoning_parser.py   # Thinking-mode dispatcher
│   │   └── deepseek_r1_reasoning_parser.py   # ）/） reasoning token parser
│   └── tool_parsers/
│       ├── deepseekv4_tool_parser.py         # DSML tool call parser (V4)
│       └── deepseekv32_tool_parser.py        # DSML tool call parser (V3.2 base)
│
└── official_inference/          # Original weight's reference inference code
    ├── generate.py                  # Official generate loop + encode_messages usage
    ├── model.py                     # BF16/FP8 model implementation
    ├── kernel.py                    # Reference CUDA kernels
    ├── convert.py                   # Weight conversion
    └── config.json                  # Model config (small variant)
```

## Key Files for Our Pipeline

1. **`vllm/tokenizers/deepseek_v4_encoding.py`** — Canonical prompt encoder.
   Already copied to `encoding/deepseek_v4_encoding.py` in the repo root (our live import).
   If vLLM updates this file, diff and sync.

2. **`vllm/tokenizers/deepseek_v4.py`** — Shows how vLLM wraps the tokenizer
   to add `apply_chat_template` support. Key insight: it calls
   `encode_messages(messages, thinking_mode=..., ...)` then
   `tokenizer.encode(prompt_str, add_special_tokens=False)`.
   This is exactly what our single_shot does.

3. **`official_inference/generate.py`** — The original weight's inference entry point.
   Uses `tokenizer.encode(encode_messages(messages, thinking_mode="chat"))`
   (default `add_special_tokens=True`) and `parse_message_from_completion_text()`
   for output parsing.

4. **`vllm/reasoning/`** — How vLLM detects thinking mode boundaries
   (`）、` start, `）/）` end). Useful when we integrate streaming.

5. **`vllm/tool_parsers/`** — DSML tool call parsing for future tool-use support.