4.2 KiB
SmolLM3-3B Tool Call Fix — Notes
Problem
The SmolLM3-3B model's chat template has three bugs that break multi-turn tool calling in vLLM.
Bugs Found
Bug 1: Tool responses rendered as plain user messages
Location: chat_template.jinja, main loop, message.role == "tool" branch
Original:
{%- elif message.role == "tool" -%}
{{ "<|im_start|>" + "user\n" + content + "<|im_end|>\n" }}
Tool responses show up as <|im_start|>user\n...<|im_end|> — the model cannot distinguish a tool result from a new user turn. When it sees weather data in a user message, it re-invokes the tool instead of answering.
Fix: Use the model's dedicated tool_response_start/tool_response_end tokens (128013/128014) to wrap tool responses so the model can distinguish them from user messages.
Bug 2: Assistant tool_calls not rendered in history
Location: chat_template.jinja, main loop, message.role == "assistant" branch
When the assistant message has tool_calls, the template only renders content (often empty/None) and drops the entire tool_calls array. The model never sees its own prior tool invocations.
Fix: Render tool calls using the model's native tool_call_start/tool_call_end tokens (128015/128016) with proper JSON format.
Bug 3: Thinking mode inverted
Location: chat_template.jinja, main loop and generation prompt
When reasoning_mode == "/think", the template does NOT wrap content in think tags. When reasoning_mode == "/no_think", it DOES wrap in ... tags. Completely backwards.
Fix: /think mode wraps content in ... tags. /no_think renders plain text.
Special Tokens
The model has these tool-related tokens in its tokenizer (added_tokens_decoder):
| Token ID | Text | Purpose |
|---|---|---|
| 128002 | ... |
Think end |
| 128013 | ... |
Tool call start |
| 128016 | ... |
Tool call end |
How the Fix Works
Template Changes
-
Tool responses now render as:
<|im_start|>user [tool_response_start] {tool result content} [tool_response_end]<|im_end|>Instead of a bare user message.
-
Assistant tool calls now render as:
<|im_start|>assistant {"name": "func_name", "arguments": {...}} [tool_call_end]<|im_end|>Instead of being dropped entirely.
-
Thinking mode is now correctly mapped:
/think→ think tags,/no_think→ plain text.
Key Technical Details
- The template uses Jinja2's
~operator instead of+for string concatenation. This avoids type errors whenmessage.contentisNone(Jinja2's~coerces to string,+does not). - The
tool_call_start/tool_call_endtokens are Unicode private-use-area characters that can't be typed in a text editor. The template must be generated programmatically usinggen_template.py. - The
tc.function.nameandtc.function.argumentsJinja2 dot notation works correctly because Jinja2 resolvesdict.keyasdict["key"]. - The
{% generation %}tag is vLLM-specific and marks the assistant output region. It must be preserved.
Files
model-files/chat_template.jinja— The fixed template (generated, contains Unicode PUA characters)model-files/gen_template.py— Script to regenerate the template inside the container where the tokenizer is availablemodel-files/hermes_tool_parser.py— vLLM Hermes tool parser (unchanged, works as-is for parsing...format)
Deploying
-
Run
gen_template.pyinside the vLLM container:docker cp model-files/gen_template.py smol-vllm-1:/tmp/ docker exec smol-vllm-1 python3 /tmp/gen_template.py -
Copy the generated template to the mounted volume:
docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja -
Restart the container:
cd /root/smol && docker compose restart
Remaining Issues
- The model sometimes re-invokes tools in a loop instead of providing a final text answer. This is likely a training issue with the
/no_thinkmode — the model outputs reasoning as content text but still generates tool calls. - The Hermes tool parser works for parsing
...blocks but the streaming parser may buffer long argument strings. This is a vLLM-level issue, not a template issue.