# Model Tool Tests Universal test suite for validating LLM endpoints against OpenAI-compatible APIs. Tests chat, tool calls, multi-turn tool response flows, streaming chunking, and schema compatibility. ## Quick Start ### 1. Create your models config Create a `models.env` file (gitignored) with one line per model: ``` API_BASE | API_KEY | MODEL_ID ``` Example: ``` https://api.vultrinference.com/v1 | your-api-key-here | mistralai/Devstral-2-123B-Instruct-2512 https://api.vultrinference.com/v1 | your-api-key-here | MiniMaxAI/MiniMax-M2.7 ``` A template `models.env.example` is provided — copy it to get started: ```bash cp models.env.example models.env # Edit models.env with your API key and models ``` ### 2. Install dependencies ```bash pip install -r requirements.txt ``` ### 3. Run the tests ```bash # Test all models from models.env python3 run_suite.py --all # Test a specific model by index (1-based) python3 run_suite.py --model 1 # Test models matching a substring python3 run_suite.py --filter Devstral # Test a single model via env vars (no models.env needed) TOOLTEST_API_BASE=https://api.vultrinference.com/v1 \ TOOLTEST_API_KEY=your-key \ TOOLTEST_MODEL=mistralai/Devstral-2-123B-Instruct-2512 \ python3 run_suite.py # Using the shell wrapper ./run_tests.sh --all ./run_tests.sh --filter Qwen ``` ## What It Tests | # | Test | What it checks | |---|------|---------------| | 1 | Basic non-streaming chat | Model responds to a simple prompt | | 2 | Basic streaming chat | SSE streaming works, content arrives in chunks | | 3 | Tool call (non-streaming) | Model calls a tool when given a tool-use prompt | | 4 | Tool call (streaming) | Tool calls work over SSE | | 5 | Tool response flow | Full multi-turn: call → tool result → model uses the data | | 6 | Tool response flow (stream) | Same as #5 but step 1 is streaming | | 7 | Bad tool schema (`properties=[]`) | Endpoint accepts or fixes invalid `properties: []` in tool schemas | | 8 | Nested bad schema (`items.properties=[]`) | Endpoint handles deeply nested invalid schemas (the "Tool 21" bug) | | 9 | Streaming tool chunking | Tool call arguments are actually streamed in multiple chunks, not buffered into one | | 10 | Parameter sweep | Which vLLM-specific params (`chat_template_kwargs`, `logprobs`, etc.) the endpoint accepts | ## Reasoning Models The suite handles reasoning models (Qwen3.5, MiniMax, etc.) that return responses in a `reasoning` field with `content: null`. These are reported as "Reasoning-only" in test output and still count as passing. ## Output Each model gets a per-test pass/fail with timing, then a cross-model comparison table: ``` Test MiniMax-M2.7 Devstral-2-123B-In gemma-4-31B-it-int ──────────────────────────────────────────────────────────────────────────────────── basic non-stream ✓ ✓ ✓ tool call stream ✓ ✓ ✓ tool response flow (stream) ✓ ✓ ✓ streaming tool chunking ✓ ✓ ✓ ``` Exit code is 1 if any test fails, 0 if all pass — works in CI. ## Files | File | Purpose | |------|---------| | `run_suite.py` | The test suite — single entry point for all models | | `models.env` | Model configs (pipe-delimited, gitignored) | | `models.env.example` | Template for models.env | | `run_tests.sh` | Thin shell wrapper around run_suite.py | | `requirements.txt` | Python dependencies (just `httpx`) | | `NOTES.md` | Historical notes from SmolLM3 debugging | | `model-files/` | Reference chat template and parser files from prior debugging | ## Adding a New Model 1. Add a line to `models.env`: `api_base | api_key | model_id` 2. Run `python3 run_suite.py --all` 3. Check the cross-model comparison for failures That's it. No code changes needed.