add README, models.env.example, add Qwen3.5-27B to models.env
This commit is contained in:
110
README.md
Normal file
110
README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# Model Tool Tests
|
||||
|
||||
Universal test suite for validating LLM endpoints against OpenAI-compatible APIs. Tests chat, tool calls, multi-turn tool response flows, streaming chunking, and schema compatibility.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Create your models config
|
||||
|
||||
Create a `models.env` file (gitignored) with one line per model:
|
||||
|
||||
```
|
||||
API_BASE | API_KEY | MODEL_ID
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
https://api.vultrinference.com/v1 | your-api-key-here | mistralai/Devstral-2-123B-Instruct-2512
|
||||
https://api.vultrinference.com/v1 | your-api-key-here | MiniMaxAI/MiniMax-M2.7
|
||||
```
|
||||
|
||||
A template `models.env.example` is provided — copy it to get started:
|
||||
|
||||
```bash
|
||||
cp models.env.example models.env
|
||||
# Edit models.env with your API key and models
|
||||
```
|
||||
|
||||
### 2. Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Run the tests
|
||||
|
||||
```bash
|
||||
# Test all models from models.env
|
||||
python3 run_suite.py --all
|
||||
|
||||
# Test a specific model by index (1-based)
|
||||
python3 run_suite.py --model 1
|
||||
|
||||
# Test models matching a substring
|
||||
python3 run_suite.py --filter Devstral
|
||||
|
||||
# Test a single model via env vars (no models.env needed)
|
||||
TOOLTEST_API_BASE=https://api.vultrinference.com/v1 \
|
||||
TOOLTEST_API_KEY=your-key \
|
||||
TOOLTEST_MODEL=mistralai/Devstral-2-123B-Instruct-2512 \
|
||||
python3 run_suite.py
|
||||
|
||||
# Using the shell wrapper
|
||||
./run_tests.sh --all
|
||||
./run_tests.sh --filter Qwen
|
||||
```
|
||||
|
||||
## What It Tests
|
||||
|
||||
| # | Test | What it checks |
|
||||
|---|------|---------------|
|
||||
| 1 | Basic non-streaming chat | Model responds to a simple prompt |
|
||||
| 2 | Basic streaming chat | SSE streaming works, content arrives in chunks |
|
||||
| 3 | Tool call (non-streaming) | Model calls a tool when given a tool-use prompt |
|
||||
| 4 | Tool call (streaming) | Tool calls work over SSE |
|
||||
| 5 | Tool response flow | Full multi-turn: call → tool result → model uses the data |
|
||||
| 6 | Tool response flow (stream) | Same as #5 but step 1 is streaming |
|
||||
| 7 | Bad tool schema (`properties=[]`) | Endpoint accepts or fixes invalid `properties: []` in tool schemas |
|
||||
| 8 | Nested bad schema (`items.properties=[]`) | Endpoint handles deeply nested invalid schemas (the "Tool 21" bug) |
|
||||
| 9 | Streaming tool chunking | Tool call arguments are actually streamed in multiple chunks, not buffered into one |
|
||||
| 10 | Parameter sweep | Which vLLM-specific params (`chat_template_kwargs`, `logprobs`, etc.) the endpoint accepts |
|
||||
|
||||
## Reasoning Models
|
||||
|
||||
The suite handles reasoning models (Qwen3.5, MiniMax, etc.) that return responses in a `reasoning` field with `content: null`. These are reported as "Reasoning-only" in test output and still count as passing.
|
||||
|
||||
## Output
|
||||
|
||||
Each model gets a per-test pass/fail with timing, then a cross-model comparison table:
|
||||
|
||||
```
|
||||
Test MiniMax-M2.7 Devstral-2-123B-In gemma-4-31B-it-int
|
||||
────────────────────────────────────────────────────────────────────────────────────
|
||||
basic non-stream ✓ ✓ ✓
|
||||
tool call stream ✓ ✓ ✓
|
||||
tool response flow (stream) ✓ ✓ ✓
|
||||
streaming tool chunking ✓ ✓ ✓
|
||||
```
|
||||
|
||||
Exit code is 1 if any test fails, 0 if all pass — works in CI.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `run_suite.py` | The test suite — single entry point for all models |
|
||||
| `models.env` | Model configs (pipe-delimited, gitignored) |
|
||||
| `models.env.example` | Template for models.env |
|
||||
| `run_tests.sh` | Thin shell wrapper around run_suite.py |
|
||||
| `requirements.txt` | Python dependencies (just `httpx`) |
|
||||
| `NOTES.md` | Historical notes from SmolLM3 debugging |
|
||||
| `model-files/` | Reference chat template and parser files from prior debugging |
|
||||
|
||||
## Adding a New Model
|
||||
|
||||
1. Add a line to `models.env`: `api_base | api_key | model_id`
|
||||
2. Run `python3 run_suite.py --all`
|
||||
3. Check the cross-model comparison for failures
|
||||
|
||||
That's it. No code changes needed.
|
||||
7
models.env.example
Normal file
7
models.env.example
Normal file
@@ -0,0 +1,7 @@
|
||||
# Model config template
|
||||
# Copy this file to models.env and fill in your API key and models
|
||||
# Format: API_BASE | API_KEY | MODEL_ID
|
||||
# Lines starting with # are comments
|
||||
|
||||
# https://api.vultrinference.com/v1 | YOUR_API_KEY | mistralai/Devstral-2-123B-Instruct-2512
|
||||
# https://api.vultrinference.com/v1 | YOUR_API_KEY | MiniMaxAI/MiniMax-M2.7
|
||||
Reference in New Issue
Block a user