Model Tool Tests
Universal test suite for validating LLM endpoints against OpenAI-compatible APIs. Tests chat, tool calls, multi-turn tool response flows, streaming chunking, and schema compatibility.
Quick Start
1. Create your models config
Create a models.env file (gitignored) with one line per model:
API_BASE | API_KEY | MODEL_ID
Example:
https://api.vultrinference.com/v1 | your-api-key-here | mistralai/Devstral-2-123B-Instruct-2512
https://api.vultrinference.com/v1 | your-api-key-here | MiniMaxAI/MiniMax-M2.7
A template models.env.example is provided — copy it to get started:
cp models.env.example models.env
# Edit models.env with your API key and models
2. Install dependencies
pip install -r requirements.txt
3. Run the tests
# Test all models from models.env
python3 run_suite.py --all
# Test a specific model by index (1-based)
python3 run_suite.py --model 1
# Test models matching a substring
python3 run_suite.py --filter Devstral
# Test a single model via env vars (no models.env needed)
TOOLTEST_API_BASE=https://api.vultrinference.com/v1 \
TOOLTEST_API_KEY=your-key \
TOOLTEST_MODEL=mistralai/Devstral-2-123B-Instruct-2512 \
python3 run_suite.py
# Using the shell wrapper
./run_tests.sh --all
./run_tests.sh --filter Qwen
What It Tests
| # | Test | What it checks |
|---|---|---|
| 1 | Basic non-streaming chat | Model responds to a simple prompt |
| 2 | Basic streaming chat | SSE streaming works, content arrives in chunks |
| 3 | Tool call (non-streaming) | Model calls a tool when given a tool-use prompt |
| 4 | Tool call (streaming) | Tool calls work over SSE |
| 5 | Tool response flow | Full multi-turn: call → tool result → model uses the data |
| 6 | Tool response flow (stream) | Same as #5 but step 1 is streaming |
| 7 | Bad tool schema (properties=[]) |
Endpoint accepts or fixes invalid properties: [] in tool schemas |
| 8 | Nested bad schema (items.properties=[]) |
Endpoint handles deeply nested invalid schemas (the "Tool 21" bug) |
| 9 | Streaming tool chunking | Tool call arguments are actually streamed in multiple chunks, not buffered into one |
| 10 | Parameter sweep | Which vLLM-specific params (chat_template_kwargs, logprobs, etc.) the endpoint accepts |
Reasoning Models
The suite handles reasoning models (Qwen3.5, MiniMax, etc.) that return responses in a reasoning field with content: null. These are reported as "Reasoning-only" in test output and still count as passing.
Output
Each model gets a per-test pass/fail with timing, then a cross-model comparison table:
Test MiniMax-M2.7 Devstral-2-123B-In gemma-4-31B-it-int
────────────────────────────────────────────────────────────────────────────────────
basic non-stream ✓ ✓ ✓
tool call stream ✓ ✓ ✓
tool response flow (stream) ✓ ✓ ✓
streaming tool chunking ✓ ✓ ✓
Exit code is 1 if any test fails, 0 if all pass — works in CI.
Files
| File | Purpose |
|---|---|
run_suite.py |
The test suite — single entry point for all models |
models.env |
Model configs (pipe-delimited, gitignored) |
models.env.example |
Template for models.env |
run_tests.sh |
Thin shell wrapper around run_suite.py |
requirements.txt |
Python dependencies (just httpx) |
NOTES.md |
Historical notes from SmolLM3 debugging |
model-files/ |
Reference chat template and parser files from prior debugging |
Adding a New Model
- Add a line to
models.env:api_base | api_key | model_id - Run
python3 run_suite.py --all - Check the cross-model comparison for failures
That's it. No code changes needed.
Description
Languages
Python
91%
Jinja
8.2%
Shell
0.8%