Model Tool Tests

Universal test suite for validating LLM endpoints against OpenAI-compatible APIs. Tests chat, tool calls, multi-turn tool response flows, streaming chunking, and schema compatibility.

Quick Start

1. Create your models config

Create a models.env file (gitignored) with one line per model:

API_BASE | API_KEY | MODEL_ID

Example:

https://api.vultrinference.com/v1 | your-api-key-here | mistralai/Devstral-2-123B-Instruct-2512
https://api.vultrinference.com/v1 | your-api-key-here | MiniMaxAI/MiniMax-M2.7

A template models.env.example is provided — copy it to get started:

cp models.env.example models.env
# Edit models.env with your API key and models

2. Install dependencies

pip install -r requirements.txt

3. Run the tests

# Test all models from models.env
python3 run_suite.py --all

# Test a specific model by index (1-based)
python3 run_suite.py --model 1

# Test models matching a substring
python3 run_suite.py --filter Devstral

# Test a single model via env vars (no models.env needed)
TOOLTEST_API_BASE=https://api.vultrinference.com/v1 \
TOOLTEST_API_KEY=your-key \
TOOLTEST_MODEL=mistralai/Devstral-2-123B-Instruct-2512 \
python3 run_suite.py

# Using the shell wrapper
./run_tests.sh --all
./run_tests.sh --filter Qwen

What It Tests

# Test What it checks
1 Basic non-streaming chat Model responds to a simple prompt
2 Basic streaming chat SSE streaming works, content arrives in chunks
3 Tool call (non-streaming) Model calls a tool when given a tool-use prompt
4 Tool call (streaming) Tool calls work over SSE
5 Tool response flow Full multi-turn: call → tool result → model uses the data
6 Tool response flow (stream) Same as #5 but step 1 is streaming
7 Bad tool schema (properties=[]) Endpoint accepts or fixes invalid properties: [] in tool schemas
8 Nested bad schema (items.properties=[]) Endpoint handles deeply nested invalid schemas (the "Tool 21" bug)
9 Streaming tool chunking Tool call arguments are actually streamed in multiple chunks, not buffered into one
10 Parameter sweep Which vLLM-specific params (chat_template_kwargs, logprobs, etc.) the endpoint accepts

Reasoning Models

The suite handles reasoning models (Qwen3.5, MiniMax, etc.) that return responses in a reasoning field with content: null. These are reported as "Reasoning-only" in test output and still count as passing.

Output

Each model gets a per-test pass/fail with timing, then a cross-model comparison table:

Test                                         MiniMax-M2.7 Devstral-2-123B-In gemma-4-31B-it-int
────────────────────────────────────────────────────────────────────────────────────
basic non-stream                                          ✓                  ✓                  ✓
tool call stream                                          ✓                  ✓                  ✓
tool response flow (stream)                               ✓                  ✓                  ✓
streaming tool chunking                                   ✓                  ✓                  ✓

Exit code is 1 if any test fails, 0 if all pass — works in CI.

Files

File Purpose
run_suite.py The test suite — single entry point for all models
models.env Model configs (pipe-delimited, gitignored)
models.env.example Template for models.env
run_tests.sh Thin shell wrapper around run_suite.py
requirements.txt Python dependencies (just httpx)
NOTES.md Historical notes from SmolLM3 debugging
model-files/ Reference chat template and parser files from prior debugging

Adding a New Model

  1. Add a line to models.env: api_base | api_key | model_id
  2. Run python3 run_suite.py --all
  3. Check the cross-model comparison for failures

That's it. No code changes needed.

Description
No description provided
Readme 78 KiB
Languages
Python 91%
Jinja 8.2%
Shell 0.8%