# SmolLM3-3B Tool Call Fix — Notes ## Status: SOLVED ✅ All three template bugs fixed, reasoning parser working, tool calling functional. ## What Was Fixed ### Bug 1: Tool responses rendered as plain user messages Tool responses showed up as `<|im_start|>user\n...` — model couldn't distinguish them from new user turns and kept re-calling tools. Fixed by wrapping tool responses with the model's dedicated `tool_response_start`/`tool_response_end` tokens (128013/128014). ### Bug 2: Assistant tool_calls not rendered in history When assistant message had `tool_calls`, the template only rendered `content` and dropped the tool call array. Model never saw its own prior invocations. Fixed by rendering tool calls using `tool_call_start`/`tool_call_end` tokens (128015/128016). ### Bug 3: Thinking mode direction swapped `/think` mode produced bare assistant prompt (no think tags), `/no_think` wrapped in think tags. Completely backwards. Fixed: `/think` opens `...` tags, `/no_think` is plain text. ## Special Tokens | Token ID | Text | Purpose | |----------|------|---------| | 128002 | `...` | Tool call start | | 128016 | `...` | Tool call end | ## Patched Files (in model-files/) ### `chat_template.jinja` — Fixed template Three fixes applied: 1. Tool responses wrapped in `tool_response_start`/`tool_response_end` tokens 2. Assistant tool_calls rendered in `tool_call_start`/`tool_call_end` format 3. Thinking mode direction corrected Uses Jinja2 `~` operator (not `+`) to avoid type errors when `message.content` is None. ### `gen_template.py` — Template generator Regenerates `chat_template.jinja` inside the container where the tokenizer is available. Required because the special tokens are Unicode private-use-area characters that can't be typed in editors. ### `smol_tool_parser.py` — Tool call parser is just the unchanged hermes_tool_parser.py in case we need to change it The stock vLLM Hermes parser works as-is for parsing `...` blocks. No patches needed. ## Reasoning Parser — NOT PATCHED The built-in `deepseek_r1` reasoning parser in vLLM works with SmolLM3 out of the box — they share the same `...` tokens. Verified by diffing the container's copy against the vllm source: identical, no patches needed. ## Deploying 1. Generate template inside the container: ```bash docker cp model-files/gen_template.py smol-vllm-1:/tmp/ docker exec smol-vllm-1 python3 /tmp/gen_template.py ``` 2. Copy to mounted volume and restart: ```bash docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja cd /root/smol && docker compose restart ``` 3. Required vLLM flags: ``` --chat-template=/root/chat_template.jinja --enable-auto-tool-choice --tool-call-parser=hermes --reasoning-parser=deepseek_r1 --chat-template-content-format=string ``` ## Test Results - ✅ Tool response tests: All PASS (streaming + non-streaming) - ✅ Streaming tool calls: Incremental, 325+ chunks - ✅ Reasoning parser: Correctly splits thinking/content - ✅ Multi-turn tool use: Model reads results, answers properly - ⚠️ 3B model doesn't reliably choose tools over free-text for complex tasks (writes code as content instead of calling write_file). This is a model capability gap, not a parsing issue. Planned LoRA to address. ## Known Limitation: Model Doesn't Emit Native Tool-Call Tokens **Verified via raw token inspection (chat-template-debugger):** SmolLM3-3B does **not** natively emit structured tool-call tokens for any tool-use prompt. When asked to use `write_file` or `save_config`, the model writes Python code that *calls* the tool as a function (`save_config(config)`) instead of emitting the `startPos`/`endPos` token sequences that vLLM's parser expects. ### What's happening under the hood | Prompt | Raw `llm.generate()` | Via vLLM API (chat template) | |--------|---------------------|------------------------------| | write_file (short) | ❌ Code-dumps `def write_file(...)` in a loop | ❌ Fails — parser can't extract tool call from code | | save_config (nested JSON) | ❌ Writes `from tools import save_config` + prose | ✅ "Passes" — but the parser is reconstructing the call from text | | save_config (streaming) | ❌ Same as above | ✅ Streams correctly — parser extracts JSON from prose/code | The save_config "pass" is **not** the model emitting tool-call tokens. The chat template + Hermes parser is doing salvage work — it sees the model describing the tool call in text/code and restructures it into the `tool_calls` field. This works for structured JSON output (save_config) but breaks for longer code output (write_file) because the parser can't reliably extract a clean function call from a full Python implementation. ### Root cause The model was trained on code and general instruction following, not on tool-calling token sequences. It *understands* what tools are conceptually (it names them, describes them, writes code that calls them) but it was never trained to emit the `startPos`/`endPos` token delimiters that signal a real tool invocation to the parser. ### Planned fix **LoRA fine-tuning** to teach the model to emit native tool-call tokens. The training data in `smollora` already converts all tool calls to the correct `startPos`/`endPos` format. Once the model learns these token sequences, it should emit them directly instead of falling back to code-dumping. This will fix both the write_file and save_config cases at the model level, eliminating the parser's salvage work. See: `/home/openclaw/dev/smollora/README.md` for LoRA training details. See: `/home/openclaw/dev/chat-template-debugger/` for the raw token inspector that proved this.