90% of models break on streaming tool calls. Is it the model generating garbage, or is something in the middleware stack mangling the output? This debugger lets us answer that definitively.

Plan of Attack

1. Build & Run the Container

docker build -t ct-debug .
docker run --gpus all -v $(pwd)/scripts:/workspace/scripts -v $(pwd)/models:/workspace/models -it ct-debug

2. Stage 0 — Download Weights (if not mounted)

# Inside the container:
python /workspace/scripts/stage0_download.py

This downloads HuggingFaceTB/SmolLM3-3B to /workspace/models/SmolLM3-3B if it doesn't already exist.

3. Stage 1 — Run the Debugger

Edit scripts/stage1_debug.py to point at the model path and your test prompt. Then:

# Inside the container:
python /workspace/scripts/stage1_debug.py

This runs the model with a raw prompt (no chat template applied by vLLM's serving layer — you control the prompt string directly). It dumps:

The raw generated text
The actual token IDs
A per-token decode so you can see exactly what the model emitted

4. Analyze

If the model emits correct tool-call tokens → parser/template problem
If the model emits garbage or broken tokens → model problem, go fix the LoRA/chat template

Directory Layout

chat-template-debugger/
├── Dockerfile
├── README.md
├── models/              # Downloaded weights (gitignored)
├── scripts/
│   ├── stage0_download.py
│   └── stage1_debug.py
└── prompts/
    └── smol_tool_call.txt

Swapping Models

Change MODEL_ID in stage0_download.py and MODEL_PATH in stage1_debug.py. Works with any HF model.

Swapping Prompts

Drop a .txt file in prompts/ and update the path in stage1_debug.py. The prompt is passed as a raw string — no chat template is applied by vLLM. You control the full context.