README.md

# SmolLM3-3B Tool Call Fix — Notes

## Status: SOLVED ✅

All three template bugs fixed, reasoning parser working, tool calling functional.

## What Was Fixed

### Bug 1: Tool responses rendered as plain user messages
Tool responses showed up as `<|im_start|>user\n...` — model couldn't distinguish them from new user turns and kept re-calling tools. Fixed by wrapping tool responses with the model's dedicated `tool_response_start`/`tool_response_end` tokens (128013/128014).

### Bug 2: Assistant tool_calls not rendered in history
When assistant message had `tool_calls`, the template only rendered `content` and dropped the tool call array. Model never saw its own prior invocations. Fixed by rendering tool calls using `tool_call_start`/`tool_call_end` tokens (128015/128016).

### Bug 3: Thinking mode direction swapped
`/think` mode produced bare assistant prompt (no think tags), `/no_think` wrapped in think tags. Completely backwards. Fixed: `/think` opens `...` tags, `/no_think` is plain text.

## Special Tokens

| Token ID | Text | Purpose |
|----------|------|---------|
| 128002 | `...` | Tool call start |
| 128016 | `...` | Tool call end |

## Patched Files (in model-files/)

### `chat_template.jinja` — Fixed template
Three fixes applied:
1. Tool responses wrapped in `tool_response_start`/`tool_response_end` tokens
2. Assistant tool_calls rendered in `tool_call_start`/`tool_call_end` format
3. Thinking mode direction corrected

Uses Jinja2 `~` operator (not `+`) to avoid type errors when `message.content` is None.

### `gen_template.py` — Template generator
Regenerates `chat_template.jinja` inside the container where the tokenizer is available. Required because the special tokens are Unicode private-use-area characters that can't be typed in editors.

### `smol_tool_parser.py` — Tool call parser is just the unchanged hermes_tool_parser.py in case we need to change it
The stock vLLM Hermes parser works as-is for parsing `...` blocks. No patches needed.

## Reasoning Parser — NOT PATCHED

The built-in `deepseek_r1` reasoning parser in vLLM works with SmolLM3 out of the box — they share the same `...` tokens. Verified by diffing the container's copy against the vllm source: identical, no patches needed.

## Deploying

1. Generate template inside the container:
   ```bash
   docker cp model-files/gen_template.py smol-vllm-1:/tmp/
   docker exec smol-vllm-1 python3 /tmp/gen_template.py
   ```

2. Copy to mounted volume and restart:
   ```bash
   docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja
   cd /root/smol && docker compose restart
   ```

3. Required vLLM flags:
   ```
   --chat-template=/root/chat_template.jinja
   --enable-auto-tool-choice
   --tool-call-parser=hermes
   --reasoning-parser=deepseek_r1
   --chat-template-content-format=string
   ```

## Test Results

- ✅ Tool response tests: All PASS (streaming + non-streaming)
- ✅ Streaming tool calls: Incremental, 325+ chunks
- ✅ Reasoning parser: Correctly splits thinking/content
- ✅ Multi-turn tool use: Model reads results, answers properly
- ⚠️ 3B model doesn't reliably choose tools over free-text for complex tasks (writes code as content instead of calling write_file). This is a model capability gap, not a parsing issue. Planned LoRA to address.

## Next Steps

- **LoRA training** to make tool calling more reliable (especially forced tool use scenarios)
- Candidate dataset: `interstellarninja/tool-calls-multiturn`
- Also worth considering: `NousResearch/Hermes-Function-Calling-V1`, `Salesforce/xLAM-function-calling-60k`
init commit 2026-04-10 13:55:43 +00:00			`# SmolLM3-3B Tool Call Fix — Notes`

			`## Status: SOLVED ✅`

			`All three template bugs fixed, reasoning parser working, tool calling functional.`

			`## What Was Fixed`

			`### Bug 1: Tool responses rendered as plain user messages`
			Tool responses showed up as `<\|im_start\|>user\n...` — model couldn't distinguish them from new user turns and kept re-calling tools. Fixed by wrapping tool responses with the model's dedicated `tool_response_start`/`tool_response_end` tokens (128013/128014).

			`### Bug 2: Assistant tool_calls not rendered in history`
			When assistant message had `tool_calls`, the template only rendered `content` and dropped the tool call array. Model never saw its own prior invocations. Fixed by rendering tool calls using `tool_call_start`/`tool_call_end` tokens (128015/128016).

			`### Bug 3: Thinking mode direction swapped`
			`/think` mode produced bare assistant prompt (no think tags), `/no_think` wrapped in think tags. Completely backwards. Fixed: `/think` opens `...` tags, `/no_think` is plain text.

			`## Special Tokens`

			`\| Token ID \| Text \| Purpose \|`
			`\|----------\|------\|---------\|`
			\| 128002 \| `...` \| Tool call start \|
			\| 128016 \| `...` \| Tool call end \|

			`## Patched Files (in model-files/)`

			### `chat_template.jinja` — Fixed template
			`Three fixes applied:`
			1. Tool responses wrapped in `tool_response_start`/`tool_response_end` tokens
			2. Assistant tool_calls rendered in `tool_call_start`/`tool_call_end` format
			`3. Thinking mode direction corrected`

			Uses Jinja2 `~` operator (not `+`) to avoid type errors when `message.content` is None.

			### `gen_template.py` — Template generator
			Regenerates `chat_template.jinja` inside the container where the tokenizer is available. Required because the special tokens are Unicode private-use-area characters that can't be typed in editors.

			### `smol_tool_parser.py` — Tool call parser is just the unchanged hermes_tool_parser.py in case we need to change it
			The stock vLLM Hermes parser works as-is for parsing `...` blocks. No patches needed.

			`## Reasoning Parser — NOT PATCHED`

			The built-in `deepseek_r1` reasoning parser in vLLM works with SmolLM3 out of the box — they share the same `...` tokens. Verified by diffing the container's copy against the vllm source: identical, no patches needed.

			`## Deploying`

			`1. Generate template inside the container:`
			```bash
			`docker cp model-files/gen_template.py smol-vllm-1:/tmp/`
			`docker exec smol-vllm-1 python3 /tmp/gen_template.py`
			```

			`2. Copy to mounted volume and restart:`
			```bash
			`docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja`
			`cd /root/smol && docker compose restart`
			```

			`3. Required vLLM flags:`
			```
			`--chat-template=/root/chat_template.jinja`
			`--enable-auto-tool-choice`
			`--tool-call-parser=hermes`
			`--reasoning-parser=deepseek_r1`
			`--chat-template-content-format=string`
			```

			`## Test Results`

			`- ✅ Tool response tests: All PASS (streaming + non-streaming)`
			`- ✅ Streaming tool calls: Incremental, 325+ chunks`
			`- ✅ Reasoning parser: Correctly splits thinking/content`
			`- ✅ Multi-turn tool use: Model reads results, answers properly`
			`- ⚠️ 3B model doesn't reliably choose tools over free-text for complex tasks (writes code as content instead of calling write_file). This is a model capability gap, not a parsing issue. Planned LoRA to address.`

			`## Next Steps`

			`- LoRA training to make tool calling more reliable (especially forced tool use scenarios)`
			- Candidate dataset: `interstellarninja/tool-calls-multiturn`
			- Also worth considering: `NousResearch/Hermes-Function-Calling-V1`, `Salesforce/xLAM-function-calling-60k`