Update runbook for tool-call token training run

2026-04-10 17:14:57 +00:00
parent af497eb16c
commit f46995690c
1 changed files with 124 additions and 60 deletions
--- a/RUNBOOK.md
+++ b/RUNBOOK.md
@@ -1,50 +1,62 @@
 # SmolLM3-3B LoRA Training — Deployment Runbook

+## Objective
+
+Train a LoRA adapter that teaches SmolLM3-3B to emit native tool-call tokens
+(IDs 128015/128016) instead of code-dumping. See `TRAINING_PLAN.md` for the
+full strategy.
+
 ## Prerequisites

- [ ] GPU server deployed and accessible via SSH
- [ ] SSH creds from Mike (host, user, key/password)
- [ ] Docker + Docker Compose + NVIDIA Container Toolkit installed on GPU server
- [ ] This repo at `/home/openclaw/dev/smollora`
+- [ ] GPU server accessible via SSH (`root@107.191.43.158`)
+- [ ] Docker + NVIDIA Container Toolkit installed
+- [ ] This repo cloned at `/root/smollora` on the GPU server

-## Step 1: SSH In & Prep the Host
+## Step 1: Sync the Code

+From the OpenClaw workspace, push any changes:
 ```bash
-ssh <user>@<host>
+cd /home/openclaw/dev/smollora
+git add -A && git commit -m "updates" && git push
 ```

-Verify GPU is visible:
+On the GPU server, pull the latest:
 ```bash
-nvidia-smi
-docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+ssh root@107.191.43.158
+cd /root/smollora && git pull
 ```

-If Docker/nvidia toolkit missing, install before continuing.
+> **Rule:** Always mutate code on the OpenClaw side, push, then pull on the GPU server.
+> Never edit files directly on the server — changes won't propagate back.

-## Step 2: Create Persistent Directories
+## Step 2: Verify the Data Prep Will Produce Correct Tokens

-```bash
-sudo mkdir -p /srv/smollora/{data,output,hf-cache}
-sudo chown -R $(whoami):$(whoami) /srv/smollora
+Before training, confirm the processed data will contain token IDs 128015/128016.
+After data prep runs (or as a dry run), check:
+
+```python
+from transformers import AutoTokenizer
+import json
+
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B", trust_remote_code=True)
+
+with open("/data/processed/train.jsonl") as f:
+    sample = json.loads(f.readline())
+
+text = tokenizer.apply_chat_template(sample["messages"], tokenize=False)
+ids = tokenizer.encode(text)
+
+assert 128015 in ids, "Tool call start token (128015) missing — data prep is broken!"
+assert 128016 in ids, "Tool call end token (128016) missing — data prep is broken!"
+print(f"✓ Token IDs verified. Sample has {len(ids)} tokens, tool-call tokens present.")
 ```

-## Step 3: Copy Project Files to Server
+If this fails, do NOT proceed to training. Fix `prepare_data.py` or the tokenizer first.

-From the OpenClaw box:
-```bash
-scp -r /home/openclaw/dev/smollora <user>@<host>:/tmp/smollora
-```
-
-On the GPU server:
-```bash
-cp -r /tmp/smollora ~/smollora
-cd ~/smollora
-```
-
-## Step 4: Build & Start Container
+## Step 3: Build & Run

 ```bash
-cd ~/smollora
+cd /root/smollora
 docker compose build
 docker compose up -d
 ```
@@ -55,57 +67,107 @@ docker compose ps
 docker compose logs --tail=20
 ```

-## Step 5: Exec In & Kick Off Training
+## Step 4: Exec In & Run Training

 ```bash
 docker compose exec smollora bash
 ```

-Inside the container, run the pipeline:
+Inside the container:
+
 ```bash
+# Full pipeline (data prep + train)
 /app/run.sh
-```

-Watch for:
- ✅ Datasets downloading successfully
- ✅ Samples counted (should be thousands)
- ✅ Model loading without OOM
- ✅ First few training steps completing (check loss is decreasing)
- ✅ No CUDA OOM errors in first 50 steps
-
-If data prep already ran and you just want to re-train:
-```bash
+# Or skip data prep if already done
 SKIP_PREP=1 /app/run.sh
 ```

-## Step 6: Monitor
+### Key Training Parameters for This Run

-From the host (no need to stay in the container):
+These should be set in `run.sh` or passed as env vars:
+
+| Param | Value | Notes |
+|-------|-------|-------|
+| `MODEL` | `HuggingFaceTB/SmolLM3-3B` | Base model |
+| `EPOCHS` | `3` | Increase to 5 if val loss still dropping |
+| `LR` | `2e-4` | Drop to 1e-4 if loss spikes |
+| `LORA_R` | `16` | Bump to 32 if loss plateaus |
+| `BATCH_SIZE` | `4` | Reduce to 2 if OOM |
+| `MAX_LENGTH` | `4096` | Enough for tool calls + code |
+
+**Critical:** `embed_tokens` MUST be in the LoRA target modules. Verify in
+`train_lora.py` that `target_modules` includes `"embed_tokens"`. Without it,
+the adapter can't adjust the tool-call token embeddings and the model won't
+learn to emit them.
+
+## Step 5: Monitor
+
+From the host:
 ```bash
-# Follow logs
 docker compose logs -f
-
-# GPU utilization
 watch -n5 nvidia-smi
 ```

+Watch for:
+- ✅ Training loss decreasing steadily
+- ✅ Val loss decreasing (not diverging from train loss)
+- ✅ No CUDA OOM
+- ❌ Val loss increasing while train loss decreases = overfitting → reduce epochs or add more data
+
 Expected timeline on a single A100:
 - Data prep: ~10-20 min
- Training (3 epochs, ~20-40k samples): ~2-4 hours
- Could be longer on smaller GPUs
+- Training (3 epochs, ~15k samples): ~1-3 hours

-## Step 7: Verify Output
+## Step 6: Validate — Raw Token Emission Test

-```bash
-# Check the LoRA adapter was saved
-ls -la /srv/smollora/output/final/
-# Should see: adapter_config.json, adapter_model.safetensors, tokenizer files
-```
+**Do not deploy until this passes.**

-## Step 8: Notify Mike
+1. Merge the LoRA adapter into the base model:
+   ```python
+   from peft import PeftModel
+   from transformers import AutoModelForCausalLM, AutoTokenizer

-Message Mike:
-> 🎭 LoRA training is running on the GPU box. Data prep done, [N] samples, training started at [time]. Estimated completion: [est]. I'll check back periodically — will ping you if anything blows up or when it finishes.
+   base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B", trust_remote_code=True)
+   tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B", trust_remote_code=True)
+   model = PeftModel.from_pretrained(base, "/data/lora-output/final")
+   merged = model.merge_and_unload()
+   merged.save_pretrained("/data/merged-model")
+   tokenizer.save_pretrained("/data/merged-model")
+   ```
+
+2. Copy the merged model to the chat-template-debugger:
+   ```bash
+   cp -r /data/merged-model /root/chat-template-debugger/models/SmolLM3-3B-toolcall
+   ```
+
+3. Run the raw token debugger (stage 1):
+   ```bash
+   docker exec -e MODEL_PATH=/workspace/models/SmolLM3-3B-toolcall \
+     -e PROMPT_FILE=/workspace/prompts/smol_write_file.txt \
+     ct-debug-run python3 /workspace/scripts/stage1_debug.py
+   ```
+
+4. **Pass criteria:**
+   - Token IDs **128015** and **128016** appear in the output
+   - Valid JSON follows token 128015
+   - No Python code-dumping
+
+5. Also test with `smol_save_config.txt` prompt — same criteria.
+
+If the model still code-dumps, the training didn't work. Check:
+- Were tokens 128015/128016 in the training data? (Step 2)
+- Is `embed_tokens` in the LoRA targets?
+- Was there enough data / enough epochs?
+
+## Step 7: Deploy to vLLM
+
+Once validation passes:
+
+1. Copy the merged model to the vLLM model directory
+2. Update the vLLM docker-compose to point at the merged model
+3. Restart vLLM
+4. Run the streaming tool call tests from `/home/openclaw/dev/model-tool-tests`

 ## Troubleshooting

@@ -113,13 +175,15 @@ Message Mike:
 |---------|-----|
 | CUDA OOM | Reduce `BATCH_SIZE` to 2, increase `GRAD_ACCUM` to 8, or reduce `MAX_LENGTH` to 2048 |
 | Dataset download fails | Check internet; can pre-download and mount into `/data` |
-| Docker can't see GPU | Install nvidia-container-toolkit: `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` |
-| Training loss not decreasing | Check LR — try `1e-4` or `5e-5`; verify labels aren't all -100 |
-| Disk full | Clean up `/srv/smollora/hf-cache` after model loads; processed data is small |
+| Docker can't see GPU | `sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker` |
+| Training loss not decreasing | Check LR; verify labels aren't all -100; verify token IDs in data (Step 2) |
+| Model still code-dumps after training | Verify embed_tokens in targets; try more epochs; try lora_r=32 |
+| Model emits tokens but broken JSON | Need more diverse tool-call samples; increase max_length |
+| Model emits tool tokens for everything | Overfit — add 30% non-tool instruction data to training mix |
+| Disk full | Clean up `/srv/smollora/hf-cache` after model loads |

 ## Rollback

-If everything goes sideways:
 ```bash
 docker compose down
 rm -rf /srv/smollora/output/*