From af497eb16cfbf576f4d996e665fc641a29c181f1 Mon Sep 17 00:00:00 2001 From: Jinx Date: Fri, 10 Apr 2026 17:07:28 +0000 Subject: [PATCH] Add training plan: teach model to emit native tool-call tokens --- TRAINING_PLAN.md | 178 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 178 insertions(+) create mode 100644 TRAINING_PLAN.md diff --git a/TRAINING_PLAN.md b/TRAINING_PLAN.md new file mode 100644 index 0000000..0bba121 --- /dev/null +++ b/TRAINING_PLAN.md @@ -0,0 +1,178 @@ +# LoRA Training Plan — Teaching SmolLM3-3B to Emit Tool-Call Tokens + +## The Problem + +SmolLM3-3B does not emit native tool-call tokens. When asked to use tools, it writes +Python code that *calls* the tool as a function instead of emitting the structured +token sequences (IDs 128015/128016) that vLLM's Hermes parser expects. + +This was verified with a raw token inspector that bypasses all middleware: + +| Token ID | Decoded | Purpose | +|----------|---------|---------| +| 128015 | ` startPos` | Tool call start delimiter | +| 128016 | ` endPos` | Tool call end delimiter | +| 128013 | ` eni` | Tool response start delimiter | +| 128014 | ` eni_result` | Tool response end delimiter | + +The model knows these tokens exist (they're in its vocabulary and the chat template +references them) but it was never trained to *produce* them. It falls back to the +only behavior it knows: writing code. + +## Training Strategy + +### Core Principle: Train on Raw Tokens, Not Parsed Output + +The training data must contain the **actual token IDs** 128015/128016 in the assistant +response. We cannot use any parser or template that "corrects" the output — that would +mask the problem instead of fixing it. + +The `apply_chat_template()` call in `train_lora.py` handles this correctly: when the +messages contain an assistant turn with a tool call, the template renders it using the +special delimiters. The loss mask ensures only the assistant's tokens are trained on. +So the model learns: "when tools are available and the user asks me to use one, I emit +the start token, then JSON, then the end token." + +### What the Data Must Look Like + +Every training sample must have the tool-call delimiters wrapping JSON in the +assistant turn. The key is that after tokenization, token IDs 128015 and 128016 +appear in the `input_ids` array within the assistant's labeled region. + +**What we need** (tokenized input_ids contains [128015, ...json..., 128016]): +- Assistant turn with: start delimiter + JSON tool call + end delimiter + +**What the base model currently produces** (prose/code tokens, no special tokens): +- Assistant turn with: prose explaining the tool + Python code calling it + +The training data must have the first pattern. The loss function only trains on +assistant turns, so the model will learn to emit the special tokens. + +### Dataset Choice + +**Primary: NousResearch/Hermes-Function-Calling-V1** + +Why this one: +1. **Already has tool_call tags in the text** — the assistant responses contain + the standard Hermes tool call format +2. **Large and diverse** — 20k+ tool-calling conversations covering many function types +3. **Multi-turn** — includes tool responses and follow-up turns, so the model also + learns to read the response delimiters and respond to tool results +4. **Clean format** — ShareGPT, easy to convert + +The conversion in `prepare_data.py` already does the right thing: it replaces +the Hermes tags with SmolLM3's native token strings. After conversion, the +training data has the actual token IDs in the text. When `apply_chat_template()` +tokenizes this, token 128015 and 128016 end up in the `input_ids`, and since they're +in the assistant turn, the loss mask includes them. + +**Drop: interstellarninja/tool-calls-multiturn** for now — it's noisier and has more +formatting inconsistency. We can add it later if we need more volume, but starting +clean is better for this focused training run. + +**Skip: Salesforce/xLAM-function-calling-60k** — too large, lots of low-quality +samples. We want quality over quantity for a LoRA. + +### Data Mix: Don't Forget Non-Tool Turns + +If we train on *only* tool-calling samples, the model may learn to always emit +the start delimiter even when it shouldn't. We need a mix: + +- **70% tool-calling samples** (Hermes V1, filtered to only tool-call conversations) +- **30% general instruction-following samples** (from SmolLM3's original training data + or a clean instruct dataset like `HuggingFaceTB/smollm-corpus`) + +This teaches the model: "emit the tool-call tokens when tools are available AND the +user's request needs a tool call; respond with normal text otherwise." + +### Critical Fix: Verify Token IDs in Training Data + +Before training, add a verification step that confirms the processed data actually +contains token IDs 128015 and 128016. If `apply_chat_template()` doesn't render +them correctly, the model won't learn them. + +```python +# Verification: tokenize a sample and check for tool-call token IDs +sample = train_data[0] +text = tokenizer.apply_chat_template(sample["messages"], tokenize=False) +ids = tokenizer.encode(text) +assert 128015 in ids, "Tool call start token (128015) not found in training data!" +assert 128016 in ids, "Tool call end token (128016) not found in training data!" +``` + +If this fails, the `prepare_data.py` conversion isn't working or the tokenizer isn't +recognizing the special tokens. Fix before training. + +### Training Parameters + +The current defaults are reasonable. Key considerations: + +| Param | Value | Rationale | +|-------|-------|-----------| +| lora_r | 16 | Standard for 3B model. Could go to 32 if loss plateaus. | +| lora_alpha | 32 | 2x rank, standard. | +| lr | 2e-4 | Good for LoRA. If loss spikes, drop to 1e-4. | +| epochs | 3 | Start here. Check val loss — if still dropping at epoch 3, go to 5. | +| max_length | 4096 | Enough for tool calls + code content. | +| target_modules | q/k/v/o + gate/up/down + embed_tokens | Full coverage. embed_tokens critical — see below. | + +### The embed_tokens Question + +Token IDs 128015/128016 are in the vocabulary but the model has never used them in +context. Two options: + +1. **Include `embed_tokens` in LoRA target modules** — lets the adapter adjust the + embeddings for these tokens so the model can route to them during generation. + **Recommended for this run.** Add `"embed_tokens"` to `target_modules`. + +2. **Don't include it** — the base embeddings are frozen, the model can still produce + these tokens (they're in its vocab) but may not have strong enough signal to route + to them. Risky. + +Add to `target_modules` in `train_lora.py`: +```python +target_modules=[ + "q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj", + "embed_tokens", # Critical: lets LoRA adjust tool-call token embeddings +], +``` + +Note: PEFT handles `embed_tokens` with LoRA correctly — it applies the low-rank +adaptation to the embedding matrix without issues. + +### Validation: Post-Training Token Emission Test + +After training, before deploying, run the trained model through the +`chat-template-debugger` (stage 1) to verify the model now emits 128015/128016: + +1. Merge the LoRA adapter into the base model +2. Copy merged model to the GPU server's `chat-template-debugger/models/` +3. Run `stage1_debug.py` with the write_file and save_config prompts +4. **Pass criteria:** token IDs 128015 and 128016 appear in the output, followed by + valid JSON, followed by 128016. No Python code-dumping. + +### Failure Modes & Mitigations + +| Failure | Symptom | Fix | +|---------|---------|-----| +| Model still code-dumps | No 128015/128016 in output | Check verification step; increase epochs; add embed_tokens to targets | +| Model emits tokens but broken JSON | 128015 present but invalid JSON follows | Add more diverse tool-call samples; increase max_length | +| Model over-fits to tool calls | Emits start delimiter for non-tool queries | Add 30% non-tool instruction data | +| Loss doesn't decrease | Val loss flat or increasing | Check label masking (not all -100); verify data quality; lower LR | +| LoRA can't adjust embeddings | embed_tokens not in target | PEFT supports this — make sure the module name matches exactly | + +## Summary + +The training data is the key. The existing `prepare_data.py` already converts +Hermes-format tool calls to SmolLM3's native tokens. The chat template renders +them as the actual special token IDs. The loss mask trains only on assistant turns. +So the model will learn: "when I see tools available and the user wants me to use one, +I emit token 128015, then the JSON function call, then token 128016." + +The two critical changes from the current setup: +1. **Add `embed_tokens` to LoRA target modules** so the adapter can shape the + embeddings for the tool-call tokens +2. **Add 30% non-tool instruction data** to prevent overfitting to tool calls + +After training, validate with the raw token debugger before deploying to vLLM.