From 0745cca33965ed6fedcda0e39f923fc76a1aa4cc Mon Sep 17 00:00:00 2001 From: Jinx Date: Sun, 12 Apr 2026 22:36:55 +0000 Subject: [PATCH] fix git ignore --- .gitignore | 2 +- NOTES.md | 103 ----------------------------------------------------- 2 files changed, 1 insertion(+), 104 deletions(-) delete mode 100644 NOTES.md diff --git a/.gitignore b/.gitignore index bca6a3b..44aad24 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,3 @@ .env -models.env +/models.env __pycache__/ diff --git a/NOTES.md b/NOTES.md deleted file mode 100644 index 8242035..0000000 --- a/NOTES.md +++ /dev/null @@ -1,103 +0,0 @@ -# SmolLM3-3B Tool Call Fix — Notes - -## Problem - -The SmolLM3-3B model's chat template has three bugs that break multi-turn tool calling in vLLM. - -## Bugs Found - -### Bug 1: Tool responses rendered as plain user messages -**Location:** `chat_template.jinja`, main loop, `message.role == "tool"` branch - -**Original:** -```jinja2 -{%- elif message.role == "tool" -%} -{{ "<|im_start|>" + "user\n" + content + "<|im_end|>\n" }} -``` - -Tool responses show up as `<|im_start|>user\n...<|im_end|>` — the model cannot distinguish a tool result from a new user turn. When it sees weather data in a user message, it re-invokes the tool instead of answering. - -**Fix:** Use the model's dedicated `tool_response_start`/`tool_response_end` tokens (128013/128014) to wrap tool responses so the model can distinguish them from user messages. - -### Bug 2: Assistant tool_calls not rendered in history -**Location:** `chat_template.jinja`, main loop, `message.role == "assistant"` branch - -When the assistant message has `tool_calls`, the template only renders `content` (often empty/None) and drops the entire `tool_calls` array. The model never sees its own prior tool invocations. - -**Fix:** Render tool calls using the model's native `tool_call_start`/`tool_call_end` tokens (128015/128016) with proper JSON format. - -### Bug 3: Thinking mode inverted -**Location:** `chat_template.jinja`, main loop and generation prompt - -When `reasoning_mode == "/think"`, the template does NOT wrap content in think tags. When `reasoning_mode == "/no_think"`, it DOES wrap in `...` tags. Completely backwards. - -**Fix:** `/think` mode wraps content in `...` tags. `/no_think` renders plain text. - -## Special Tokens - -The model has these tool-related tokens in its tokenizer (added_tokens_decoder): - -| Token ID | Text | Purpose | -|----------|------|---------| -| 128002 | `...` | Think end | -| 128013 | `...` | Tool call start | -| 128016 | `...` | Tool call end | - -## How the Fix Works - -### Template Changes - -1. **Tool responses** now render as: - ``` - <|im_start|>user - [tool_response_start] - {tool result content} - [tool_response_end]<|im_end|> - ``` - Instead of a bare user message. - -2. **Assistant tool calls** now render as: - ``` - <|im_start|>assistant - {"name": "func_name", "arguments": {...}} - [tool_call_end]<|im_end|> - ``` - Instead of being dropped entirely. - -3. **Thinking mode** is now correctly mapped: `/think` → think tags, `/no_think` → plain text. - -### Key Technical Details - -- The template uses Jinja2's `~` operator instead of `+` for string concatenation. This avoids type errors when `message.content` is `None` (Jinja2's `~` coerces to string, `+` does not). -- The `tool_call_start`/`tool_call_end` tokens are Unicode private-use-area characters that can't be typed in a text editor. The template must be generated programmatically using `gen_template.py`. -- The `tc.function.name` and `tc.function.arguments` Jinja2 dot notation works correctly because Jinja2 resolves `dict.key` as `dict["key"]`. -- The `{% generation %}` tag is vLLM-specific and marks the assistant output region. It must be preserved. - -## Files - -- `model-files/chat_template.jinja` — The fixed template (generated, contains Unicode PUA characters) -- `model-files/gen_template.py` — Script to regenerate the template inside the container where the tokenizer is available -- `model-files/hermes_tool_parser.py` — vLLM Hermes tool parser (unchanged, works as-is for parsing `...` format) - -## Deploying - -1. Run `gen_template.py` inside the vLLM container: - ```bash - docker cp model-files/gen_template.py smol-vllm-1:/tmp/ - docker exec smol-vllm-1 python3 /tmp/gen_template.py - ``` - -2. Copy the generated template to the mounted volume: - ```bash - docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja - ``` - -3. Restart the container: - ```bash - cd /root/smol && docker compose restart - ``` - -## Remaining Issues - -- The model sometimes re-invokes tools in a loop instead of providing a final text answer. This is likely a training issue with the `/no_think` mode — the model outputs reasoning as content text but still generates tool calls. -- The Hermes tool parser works for parsing `...` blocks but the streaming parser may buffer long argument strings. This is a vLLM-level issue, not a template issue.