diff --git a/README.md b/README.md new file mode 100644 index 0000000..5cc7830 --- /dev/null +++ b/README.md @@ -0,0 +1,110 @@ +# Model Tool Tests + +Universal test suite for validating LLM endpoints against OpenAI-compatible APIs. Tests chat, tool calls, multi-turn tool response flows, streaming chunking, and schema compatibility. + +## Quick Start + +### 1. Create your models config + +Create a `models.env` file (gitignored) with one line per model: + +``` +API_BASE | API_KEY | MODEL_ID +``` + +Example: + +``` +https://api.vultrinference.com/v1 | your-api-key-here | mistralai/Devstral-2-123B-Instruct-2512 +https://api.vultrinference.com/v1 | your-api-key-here | MiniMaxAI/MiniMax-M2.7 +``` + +A template `models.env.example` is provided — copy it to get started: + +```bash +cp models.env.example models.env +# Edit models.env with your API key and models +``` + +### 2. Install dependencies + +```bash +pip install -r requirements.txt +``` + +### 3. Run the tests + +```bash +# Test all models from models.env +python3 run_suite.py --all + +# Test a specific model by index (1-based) +python3 run_suite.py --model 1 + +# Test models matching a substring +python3 run_suite.py --filter Devstral + +# Test a single model via env vars (no models.env needed) +TOOLTEST_API_BASE=https://api.vultrinference.com/v1 \ +TOOLTEST_API_KEY=your-key \ +TOOLTEST_MODEL=mistralai/Devstral-2-123B-Instruct-2512 \ +python3 run_suite.py + +# Using the shell wrapper +./run_tests.sh --all +./run_tests.sh --filter Qwen +``` + +## What It Tests + +| # | Test | What it checks | +|---|------|---------------| +| 1 | Basic non-streaming chat | Model responds to a simple prompt | +| 2 | Basic streaming chat | SSE streaming works, content arrives in chunks | +| 3 | Tool call (non-streaming) | Model calls a tool when given a tool-use prompt | +| 4 | Tool call (streaming) | Tool calls work over SSE | +| 5 | Tool response flow | Full multi-turn: call → tool result → model uses the data | +| 6 | Tool response flow (stream) | Same as #5 but step 1 is streaming | +| 7 | Bad tool schema (`properties=[]`) | Endpoint accepts or fixes invalid `properties: []` in tool schemas | +| 8 | Nested bad schema (`items.properties=[]`) | Endpoint handles deeply nested invalid schemas (the "Tool 21" bug) | +| 9 | Streaming tool chunking | Tool call arguments are actually streamed in multiple chunks, not buffered into one | +| 10 | Parameter sweep | Which vLLM-specific params (`chat_template_kwargs`, `logprobs`, etc.) the endpoint accepts | + +## Reasoning Models + +The suite handles reasoning models (Qwen3.5, MiniMax, etc.) that return responses in a `reasoning` field with `content: null`. These are reported as "Reasoning-only" in test output and still count as passing. + +## Output + +Each model gets a per-test pass/fail with timing, then a cross-model comparison table: + +``` +Test MiniMax-M2.7 Devstral-2-123B-In gemma-4-31B-it-int +──────────────────────────────────────────────────────────────────────────────────── +basic non-stream ✓ ✓ ✓ +tool call stream ✓ ✓ ✓ +tool response flow (stream) ✓ ✓ ✓ +streaming tool chunking ✓ ✓ ✓ +``` + +Exit code is 1 if any test fails, 0 if all pass — works in CI. + +## Files + +| File | Purpose | +|------|---------| +| `run_suite.py` | The test suite — single entry point for all models | +| `models.env` | Model configs (pipe-delimited, gitignored) | +| `models.env.example` | Template for models.env | +| `run_tests.sh` | Thin shell wrapper around run_suite.py | +| `requirements.txt` | Python dependencies (just `httpx`) | +| `NOTES.md` | Historical notes from SmolLM3 debugging | +| `model-files/` | Reference chat template and parser files from prior debugging | + +## Adding a New Model + +1. Add a line to `models.env`: `api_base | api_key | model_id` +2. Run `python3 run_suite.py --all` +3. Check the cross-model comparison for failures + +That's it. No code changes needed. diff --git a/models.env.example b/models.env.example new file mode 100644 index 0000000..06daac1 --- /dev/null +++ b/models.env.example @@ -0,0 +1,7 @@ +# Model config template +# Copy this file to models.env and fill in your API key and models +# Format: API_BASE | API_KEY | MODEL_ID +# Lines starting with # are comments + +# https://api.vultrinference.com/v1 | YOUR_API_KEY | mistralai/Devstral-2-123B-Instruct-2512 +# https://api.vultrinference.com/v1 | YOUR_API_KEY | MiniMaxAI/MiniMax-M2.7