fix git ignore
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -1,3 +1,3 @@
|
||||
.env
|
||||
models.env
|
||||
/models.env
|
||||
__pycache__/
|
||||
|
||||
103
NOTES.md
103
NOTES.md
@@ -1,103 +0,0 @@
|
||||
# SmolLM3-3B Tool Call Fix — Notes
|
||||
|
||||
## Problem
|
||||
|
||||
The SmolLM3-3B model's chat template has three bugs that break multi-turn tool calling in vLLM.
|
||||
|
||||
## Bugs Found
|
||||
|
||||
### Bug 1: Tool responses rendered as plain user messages
|
||||
**Location:** `chat_template.jinja`, main loop, `message.role == "tool"` branch
|
||||
|
||||
**Original:**
|
||||
```jinja2
|
||||
{%- elif message.role == "tool" -%}
|
||||
{{ "<|im_start|>" + "user\n" + content + "<|im_end|>\n" }}
|
||||
```
|
||||
|
||||
Tool responses show up as `<|im_start|>user\n...<|im_end|>` — the model cannot distinguish a tool result from a new user turn. When it sees weather data in a user message, it re-invokes the tool instead of answering.
|
||||
|
||||
**Fix:** Use the model's dedicated `tool_response_start`/`tool_response_end` tokens (128013/128014) to wrap tool responses so the model can distinguish them from user messages.
|
||||
|
||||
### Bug 2: Assistant tool_calls not rendered in history
|
||||
**Location:** `chat_template.jinja`, main loop, `message.role == "assistant"` branch
|
||||
|
||||
When the assistant message has `tool_calls`, the template only renders `content` (often empty/None) and drops the entire `tool_calls` array. The model never sees its own prior tool invocations.
|
||||
|
||||
**Fix:** Render tool calls using the model's native `tool_call_start`/`tool_call_end` tokens (128015/128016) with proper JSON format.
|
||||
|
||||
### Bug 3: Thinking mode inverted
|
||||
**Location:** `chat_template.jinja`, main loop and generation prompt
|
||||
|
||||
When `reasoning_mode == "/think"`, the template does NOT wrap content in think tags. When `reasoning_mode == "/no_think"`, it DOES wrap in `...` tags. Completely backwards.
|
||||
|
||||
**Fix:** `/think` mode wraps content in `...` tags. `/no_think` renders plain text.
|
||||
|
||||
## Special Tokens
|
||||
|
||||
The model has these tool-related tokens in its tokenizer (added_tokens_decoder):
|
||||
|
||||
| Token ID | Text | Purpose |
|
||||
|----------|------|---------|
|
||||
| 128002 | `...` | Think end |
|
||||
| 128013 | `...` | Tool call start |
|
||||
| 128016 | `...` | Tool call end |
|
||||
|
||||
## How the Fix Works
|
||||
|
||||
### Template Changes
|
||||
|
||||
1. **Tool responses** now render as:
|
||||
```
|
||||
<|im_start|>user
|
||||
[tool_response_start]
|
||||
{tool result content}
|
||||
[tool_response_end]<|im_end|>
|
||||
```
|
||||
Instead of a bare user message.
|
||||
|
||||
2. **Assistant tool calls** now render as:
|
||||
```
|
||||
<|im_start|>assistant
|
||||
{"name": "func_name", "arguments": {...}}
|
||||
[tool_call_end]<|im_end|>
|
||||
```
|
||||
Instead of being dropped entirely.
|
||||
|
||||
3. **Thinking mode** is now correctly mapped: `/think` → think tags, `/no_think` → plain text.
|
||||
|
||||
### Key Technical Details
|
||||
|
||||
- The template uses Jinja2's `~` operator instead of `+` for string concatenation. This avoids type errors when `message.content` is `None` (Jinja2's `~` coerces to string, `+` does not).
|
||||
- The `tool_call_start`/`tool_call_end` tokens are Unicode private-use-area characters that can't be typed in a text editor. The template must be generated programmatically using `gen_template.py`.
|
||||
- The `tc.function.name` and `tc.function.arguments` Jinja2 dot notation works correctly because Jinja2 resolves `dict.key` as `dict["key"]`.
|
||||
- The `{% generation %}` tag is vLLM-specific and marks the assistant output region. It must be preserved.
|
||||
|
||||
## Files
|
||||
|
||||
- `model-files/chat_template.jinja` — The fixed template (generated, contains Unicode PUA characters)
|
||||
- `model-files/gen_template.py` — Script to regenerate the template inside the container where the tokenizer is available
|
||||
- `model-files/hermes_tool_parser.py` — vLLM Hermes tool parser (unchanged, works as-is for parsing `...` format)
|
||||
|
||||
## Deploying
|
||||
|
||||
1. Run `gen_template.py` inside the vLLM container:
|
||||
```bash
|
||||
docker cp model-files/gen_template.py smol-vllm-1:/tmp/
|
||||
docker exec smol-vllm-1 python3 /tmp/gen_template.py
|
||||
```
|
||||
|
||||
2. Copy the generated template to the mounted volume:
|
||||
```bash
|
||||
docker cp smol-vllm-1:/root/chat_template.jinja /root/smol/chat_template.jinja
|
||||
```
|
||||
|
||||
3. Restart the container:
|
||||
```bash
|
||||
cd /root/smol && docker compose restart
|
||||
```
|
||||
|
||||
## Remaining Issues
|
||||
|
||||
- The model sometimes re-invokes tools in a loop instead of providing a final text answer. This is likely a training issue with the `/no_think` mode — the model outputs reasoning as content text but still generates tool calls.
|
||||
- The Hermes tool parser works for parsing `...` blocks but the streaming parser may buffer long argument strings. This is a vLLM-level issue, not a template issue.
|
||||
Reference in New Issue
Block a user