biondizzle/smollm3-3b-vllm

Fork 0

Files

biondizzle bcdbe08037 Document model's inability to emit native tool-call tokens

2026-04-10 16:51:42 +00:00

5.6 KiB

Raw Permalink Blame History

SmolLM3-3B Tool Call Fix — Notes

Status: SOLVED ✅

All three template bugs fixed, reasoning parser working, tool calling functional.

What Was Fixed

Bug 1: Tool responses rendered as plain user messages

Tool responses showed up as <|im_start|>user\n... — model couldn't distinguish them from new user turns and kept re-calling tools. Fixed by wrapping tool responses with the model's dedicated tool_response_start/tool_response_end tokens (128013/128014).

Bug 2: Assistant tool_calls not rendered in history

When assistant message had tool_calls, the template only rendered content and dropped the tool call array. Model never saw its own prior invocations. Fixed by rendering tool calls using tool_call_start/tool_call_end tokens (128015/128016).

Bug 3: Thinking mode direction swapped

/think mode produced bare assistant prompt (no think tags), /no_think wrapped in think tags. Completely backwards. Fixed: /think opens ... tags, /no_think is plain text.

Special Tokens

Token ID	Text	Purpose
128002	`...`	Tool call start
128016	`...`	Tool call end

Patched Files (in model-files/)

`chat_template.jinja` — Fixed template

Three fixes applied:

Tool responses wrapped in tool_response_start/tool_response_end tokens
Assistant tool_calls rendered in tool_call_start/tool_call_end format
Thinking mode direction corrected

Uses Jinja2 ~ operator (not +) to avoid type errors when message.content is None.

`gen_template.py` — Template generator

Regenerates chat_template.jinja inside the container where the tokenizer is available. Required because the special tokens are Unicode private-use-area characters that can't be typed in editors.

`smol_tool_parser.py` — Tool call parser is just the unchanged hermes_tool_parser.py in case we need to change it

The stock vLLM Hermes parser works as-is for parsing ... blocks. No patches needed.

Reasoning Parser — NOT PATCHED

The built-in deepseek_r1 reasoning parser in vLLM works with SmolLM3 out of the box — they share the same ... tokens. Verified by diffing the container's copy against the vllm source: identical, no patches needed.

Deploying

Generate template inside the container:

docker cp model-files/gen_template.py smol-vllm-1:/tmp/
docker exec smol-vllm-1 python3 /tmp/gen_template.py

Copy to mounted volume and restart:

docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja
cd /root/smol && docker compose restart

Required vLLM flags:

--chat-template=/root/chat_template.jinja
--enable-auto-tool-choice
--tool-call-parser=hermes
--reasoning-parser=deepseek_r1
--chat-template-content-format=string

Test Results

✅ Tool response tests: All PASS (streaming + non-streaming)
✅ Streaming tool calls: Incremental, 325+ chunks
✅ Reasoning parser: Correctly splits thinking/content
✅ Multi-turn tool use: Model reads results, answers properly
⚠️ 3B model doesn't reliably choose tools over free-text for complex tasks (writes code as content instead of calling write_file). This is a model capability gap, not a parsing issue. Planned LoRA to address.

Known Limitation: Model Doesn't Emit Native Tool-Call Tokens

Verified via raw token inspection (chat-template-debugger): SmolLM3-3B does not natively emit structured tool-call tokens for any tool-use prompt. When asked to use write_file or save_config, the model writes Python code that calls the tool as a function (save_config(config)) instead of emitting the startPos/endPos token sequences that vLLM's parser expects.

What's happening under the hood

Prompt	Raw `llm.generate()`	Via vLLM API (chat template)
write_file (short)	❌ Code-dumps `def write_file(...)` in a loop	❌ Fails — parser can't extract tool call from code
save_config (nested JSON)	❌ Writes `from tools import save_config` + prose	✅ "Passes" — but the parser is reconstructing the call from text
save_config (streaming)	❌ Same as above	✅ Streams correctly — parser extracts JSON from prose/code

The save_config "pass" is not the model emitting tool-call tokens. The chat template + Hermes parser is doing salvage work — it sees the model describing the tool call in text/code and restructures it into the tool_calls field. This works for structured JSON output (save_config) but breaks for longer code output (write_file) because the parser can't reliably extract a clean function call from a full Python implementation.

Root cause

The model was trained on code and general instruction following, not on tool-calling token sequences. It understands what tools are conceptually (it names them, describes them, writes code that calls them) but it was never trained to emit the startPos/endPos token delimiters that signal a real tool invocation to the parser.

Planned fix

LoRA fine-tuning to teach the model to emit native tool-call tokens. The training data in smollora already converts all tool calls to the correct startPos/endPos format. Once the model learns these token sequences, it should emit them directly instead of falling back to code-dumping. This will fix both the write_file and save_config cases at the model level, eliminating the parser's salvage work.

See: /home/openclaw/dev/smollora/README.md for LoRA training details. See: /home/openclaw/dev/chat-template-debugger/ for the raw token inspector that proved this.

5.6 KiB Raw Permalink Blame History