460 lines
12 KiB
Markdown
460 lines
12 KiB
Markdown
|
|
# Context Paging
|
||
|
|
|
||
|
|
**Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Problem
|
||
|
|
|
||
|
|
Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.
|
||
|
|
|
||
|
|
## The Solution
|
||
|
|
|
||
|
|
Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.
|
||
|
|
|
||
|
|
## The Analogy
|
||
|
|
|
||
|
|
This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Architecture: Three Nested Loops
|
||
|
|
|
||
|
|
```
|
||
|
|
USER sends message
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ LOOP 2 — Context Fitting │
|
||
|
|
│ Compress history until it fits │
|
||
|
|
└─────────────┬───────────────────┘
|
||
|
|
│ fitted context
|
||
|
|
▼
|
||
|
|
┌─────────────────────────────────┐
|
||
|
|
│ LOOP 3 — Dereference │
|
||
|
|
│ LLM may request full msgs │
|
||
|
|
│ via MD5 → inject & re-run │
|
||
|
|
└─────────────┬───────────────────┘
|
||
|
|
│ final response
|
||
|
|
▼
|
||
|
|
USER receives response
|
||
|
|
```
|
||
|
|
|
||
|
|
### Loop 2 — Fit
|
||
|
|
|
||
|
|
`ContextPaging::fit()` compresses messages until they fit within the context window:
|
||
|
|
|
||
|
|
1. Count total tokens in all messages
|
||
|
|
2. If under budget → done
|
||
|
|
3. Take oldest non-summarized message
|
||
|
|
4. Compute MD5 hash, store original in message store
|
||
|
|
5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
|
||
|
|
6. Repeat until under budget
|
||
|
|
|
||
|
|
**Rule:** The last message (current user request) is **never** summarized.
|
||
|
|
|
||
|
|
### Loop 3 — Execute
|
||
|
|
|
||
|
|
`ContextPaging::execute()` runs the LLM and handles dereference requests:
|
||
|
|
|
||
|
|
1. Send fitted context to LLM
|
||
|
|
2. If response contains `fetch_message` tool call with MD5 → continue
|
||
|
|
3. Look up original message, inject into context
|
||
|
|
4. Re-send to LLM
|
||
|
|
5. If response is normal text (no tool calls) → done, return to user
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Project Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
context-paging/
|
||
|
|
├── src/
|
||
|
|
│ ├── ContextPaging.php # Main class — fit() + execute()
|
||
|
|
│ ├── TokenCounter.php # Shells out to Rust binary
|
||
|
|
│ ├── ContextRequest.php # Extended ServerRequest
|
||
|
|
│ ├── OpenAICompatibleClient.php # Guzzle-based LLM client
|
||
|
|
│ ├── CompletionsClientInterface.php
|
||
|
|
│ ├── LLMSummarizer.php # LLM-backed summarizer
|
||
|
|
│ ├── SummarizerInterface.php
|
||
|
|
│ ├── CacheInterface.php # Cache abstraction
|
||
|
|
│ ├── InMemoryCache.php # In-memory implementation
|
||
|
|
│ ├── RedisCache.php # Redis implementation
|
||
|
|
│ ├── ToolCallParser.php # Parse tool calls from responses
|
||
|
|
│ ├── ToolFormatter.php # Format tools for requests
|
||
|
|
│ └── ToolCallMode.php # NATIVE/RAW/AUTO enum
|
||
|
|
├── tests/
|
||
|
|
│ ├── ContextPagingTest.php # Core functionality tests
|
||
|
|
│ ├── OpenAICompatibleClientTest.php # LLM client tests
|
||
|
|
│ ├── SummarizerTest.php # Summarization tests
|
||
|
|
│ ├── RedisCacheTest.php # Redis persistence tests
|
||
|
|
│ ├── ToolCallParserTest.php
|
||
|
|
│ ├── ToolFormatterTest.php
|
||
|
|
│ └── fluff.md # Test article for summarization
|
||
|
|
├── token-counter # Rust binary (tiktoken)
|
||
|
|
├── index.php # CLI entry point
|
||
|
|
├── composer.json
|
||
|
|
├── phpunit.xml
|
||
|
|
└── README.md
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### Prerequisites
|
||
|
|
|
||
|
|
- PHP 8.5+
|
||
|
|
- Composer
|
||
|
|
- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)
|
||
|
|
|
||
|
|
### Install
|
||
|
|
|
||
|
|
```bash
|
||
|
|
composer install
|
||
|
|
```
|
||
|
|
|
||
|
|
This installs:
|
||
|
|
- `guzzlehttp/guzzle` — HTTP client for LLM API calls
|
||
|
|
- `guzzlehttp/psr7` — PSR-7 message implementations
|
||
|
|
- `predis/predis` — Redis client (optional, only if using RedisCache)
|
||
|
|
|
||
|
|
### Run Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./vendor/bin/phpunit
|
||
|
|
|
||
|
|
# With testdox output
|
||
|
|
./vendor/bin/phpunit --testdox
|
||
|
|
|
||
|
|
# Run specific test file
|
||
|
|
./vendor/bin/phpunit tests/SummarizerTest.php
|
||
|
|
```
|
||
|
|
|
||
|
|
### CLI Usage
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Pipe JSON payload
|
||
|
|
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php
|
||
|
|
|
||
|
|
# Or pass as argument
|
||
|
|
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## API
|
||
|
|
|
||
|
|
### ContextPaging
|
||
|
|
|
||
|
|
```php
|
||
|
|
use ContextPaging\ContextPaging;
|
||
|
|
use ContextPaging\TokenCounter;
|
||
|
|
use ContextPaging\LLMSummarizer;
|
||
|
|
use ContextPaging\OpenAICompatibleClient;
|
||
|
|
use ContextPaging\ToolCallMode;
|
||
|
|
|
||
|
|
// Create summarizer (optional — falls back to truncation if not provided)
|
||
|
|
$summarizerClient = new OpenAICompatibleClient(
|
||
|
|
baseUrl: 'http://your-llm-endpoint/v1',
|
||
|
|
apiKey: null, // optional for local endpoints
|
||
|
|
timeout: 120
|
||
|
|
);
|
||
|
|
|
||
|
|
$summarizer = new LLMSummarizer(
|
||
|
|
client: $summarizerClient,
|
||
|
|
model: 'HuggingFaceTB/SmolLM3-3B',
|
||
|
|
maxTokens: 200,
|
||
|
|
temperature: 0.3
|
||
|
|
);
|
||
|
|
|
||
|
|
// Create main instance
|
||
|
|
$contextPaging = new ContextPaging(
|
||
|
|
tokenCounter: new TokenCounter(),
|
||
|
|
summarizer: $summarizer
|
||
|
|
);
|
||
|
|
|
||
|
|
// Configure for your model
|
||
|
|
$contextPaging
|
||
|
|
->setMaxContextTokens(128000)
|
||
|
|
->setResponseReserve(4096);
|
||
|
|
|
||
|
|
// Set tool call mode (for models with broken tool parsers)
|
||
|
|
$contextPaging->setToolCallMode(ToolCallMode::RAW);
|
||
|
|
|
||
|
|
// LOOP 2: Fit the context
|
||
|
|
$fittedRequest = $contextPaging->fit($request);
|
||
|
|
|
||
|
|
// LOOP 3: Execute with dereference handling
|
||
|
|
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
|
||
|
|
return $client->chat($messages, $options);
|
||
|
|
});
|
||
|
|
```
|
||
|
|
|
||
|
|
### TokenCounter
|
||
|
|
|
||
|
|
```php
|
||
|
|
use ContextPaging\TokenCounter;
|
||
|
|
|
||
|
|
$counter = new TokenCounter();
|
||
|
|
|
||
|
|
// Count tokens in a string
|
||
|
|
$tokens = $counter->count("Hello, world!");
|
||
|
|
// Returns: 4
|
||
|
|
|
||
|
|
// Count with different encoding
|
||
|
|
$tokens = $counter->count("Hello, world!", "o200k_base");
|
||
|
|
|
||
|
|
// Count context size for chat messages
|
||
|
|
$tokens = $counter->contextSize([
|
||
|
|
['role' => 'user', 'content' => 'Hello!'],
|
||
|
|
['role' => 'assistant', 'content' => 'Hi there!'],
|
||
|
|
]);
|
||
|
|
```
|
||
|
|
|
||
|
|
### OpenAICompatibleClient
|
||
|
|
|
||
|
|
```php
|
||
|
|
use ContextPaging\OpenAICompatibleClient;
|
||
|
|
|
||
|
|
$client = new OpenAICompatibleClient(
|
||
|
|
baseUrl: 'http://95.179.247.150/v1',
|
||
|
|
apiKey: null,
|
||
|
|
timeout: 120,
|
||
|
|
verifySsl: false
|
||
|
|
);
|
||
|
|
|
||
|
|
// Chat completion
|
||
|
|
$response = $client->chat([
|
||
|
|
['role' => 'user', 'content' => 'Hello!']
|
||
|
|
], [
|
||
|
|
'model' => 'HuggingFaceTB/SmolLM3-3B',
|
||
|
|
'max_tokens' => 100
|
||
|
|
]);
|
||
|
|
|
||
|
|
// List models
|
||
|
|
$models = $client->listModels();
|
||
|
|
```
|
||
|
|
|
||
|
|
### LLMSummarizer
|
||
|
|
|
||
|
|
```php
|
||
|
|
use ContextPaging\LLMSummarizer;
|
||
|
|
|
||
|
|
$summarizer = new LLMSummarizer(
|
||
|
|
client: $client,
|
||
|
|
model: 'HuggingFaceTB/SmolLM3-3B',
|
||
|
|
systemPrompt: 'Summarize concisely, preserving key information.',
|
||
|
|
maxTokens: 200,
|
||
|
|
temperature: 0.3
|
||
|
|
);
|
||
|
|
|
||
|
|
$summary = $summarizer->summarize($longText);
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Tool Call Modes
|
||
|
|
|
||
|
|
The system supports two tool call modes for the dereference operation:
|
||
|
|
|
||
|
|
### NATIVE Mode
|
||
|
|
|
||
|
|
For models with working tool call parsers (GPT-4, Claude, etc.):
|
||
|
|
|
||
|
|
```php
|
||
|
|
$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
|
||
|
|
```
|
||
|
|
|
||
|
|
- Tools sent as `tools` array in request payload
|
||
|
|
- Tool calls returned in `tool_calls` array in response
|
||
|
|
|
||
|
|
### RAW Mode
|
||
|
|
|
||
|
|
For models with broken/missing tool parsers (SmolLM3, etc.):
|
||
|
|
|
||
|
|
```php
|
||
|
|
$contextPaging->setToolCallMode(ToolCallMode::RAW);
|
||
|
|
```
|
||
|
|
|
||
|
|
- Tools injected into system prompt with XML-style format
|
||
|
|
- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
|
||
|
|
- Parsed from response content
|
||
|
|
|
||
|
|
### AUTO Mode
|
||
|
|
|
||
|
|
Detects mode from first response:
|
||
|
|
|
||
|
|
```php
|
||
|
|
$contextPaging->setToolCallMode(ToolCallMode::AUTO);
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation Status
|
||
|
|
|
||
|
|
| Component | Status | Notes |
|
||
|
|
|-----------|--------|-------|
|
||
|
|
| Token counting | ✅ Done | Rust binary via `tiktoken-rs` |
|
||
|
|
| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
|
||
|
|
| Message store | ✅ Redis or in-memory | Persistent cache support |
|
||
|
|
| Summary cache | ✅ Redis or in-memory | Persistent cache support |
|
||
|
|
| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
|
||
|
|
| Tool call parser | ✅ Done | NATIVE and RAW modes |
|
||
|
|
| Tool formatter | ✅ Done | NATIVE and RAW modes |
|
||
|
|
| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
|
||
|
|
| LLMSummarizer | ✅ Done | Uses configured model |
|
||
|
|
| RedisCache | ✅ Done | Persistent storage via Predis |
|
||
|
|
| Tests | ✅ 36 passing | Unit + integration tests |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Caching
|
||
|
|
|
||
|
|
### In-Memory Cache (Default)
|
||
|
|
|
||
|
|
By default, ContextPaging uses in-memory caches that exist for the duration of a single request:
|
||
|
|
|
||
|
|
```php
|
||
|
|
$contextPaging = new ContextPaging();
|
||
|
|
// Uses InMemoryCache internally
|
||
|
|
```
|
||
|
|
|
||
|
|
### Redis Cache (Persistent)
|
||
|
|
|
||
|
|
For persistent storage across requests, use Redis:
|
||
|
|
|
||
|
|
```php
|
||
|
|
use ContextPaging\RedisCache;
|
||
|
|
|
||
|
|
// Create Redis-backed caches
|
||
|
|
$messageStore = RedisCache::fromUrl(
|
||
|
|
'rediss://user:password@host:port',
|
||
|
|
prefix: 'ctx_msg:', // Key prefix for namespacing
|
||
|
|
defaultTtl: null // No expiry (or set TTL in seconds)
|
||
|
|
);
|
||
|
|
|
||
|
|
$summaryCache = RedisCache::fromUrl(
|
||
|
|
'rediss://user:password@host:port',
|
||
|
|
prefix: 'ctx_sum:'
|
||
|
|
);
|
||
|
|
|
||
|
|
// Inject into ContextPaging
|
||
|
|
$contextPaging = new ContextPaging(
|
||
|
|
tokenCounter: new TokenCounter(),
|
||
|
|
messageStore: $messageStore,
|
||
|
|
summaryCache: $summaryCache
|
||
|
|
);
|
||
|
|
```
|
||
|
|
|
||
|
|
**Benefits of Redis:**
|
||
|
|
- Summaries persist between requests (no re-summarization)
|
||
|
|
- Message store survives process restarts
|
||
|
|
- Share context across multiple workers/servers
|
||
|
|
|
||
|
|
**Key Namespacing:**
|
||
|
|
- Message store uses keys: `prefix:msg:{md5}`
|
||
|
|
- Summary cache uses keys: `prefix:summary:{md5}`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
### Run All Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./vendor/bin/phpunit --testdox
|
||
|
|
```
|
||
|
|
|
||
|
|
### Test Categories
|
||
|
|
|
||
|
|
**ContextPagingTest** (6 tests)
|
||
|
|
- Small payloads pass through unchanged
|
||
|
|
- Large payloads trigger summarization
|
||
|
|
- Last message is never summarized
|
||
|
|
- Original messages stored for dereferencing
|
||
|
|
- Error when last message is too large
|
||
|
|
|
||
|
|
**OpenAICompatibleClientTest** (8 tests)
|
||
|
|
- Basic chat completion
|
||
|
|
- Usage stats returned
|
||
|
|
- Multi-turn conversation context retention
|
||
|
|
- List models endpoint
|
||
|
|
- RAW tool formatting
|
||
|
|
- Tool call parser detection
|
||
|
|
|
||
|
|
**SummarizerTest** (4 tests)
|
||
|
|
- Summarization reduces token count (typically 75-85%)
|
||
|
|
- Key information preserved
|
||
|
|
- Multi-article summarization
|
||
|
|
- Usage stats accuracy
|
||
|
|
|
||
|
|
**ToolCallParserTest** (5 tests)
|
||
|
|
- Extract native OpenAI tool calls
|
||
|
|
- Extract raw XML-style tool calls
|
||
|
|
- Auto-detect mode from response
|
||
|
|
|
||
|
|
**ToolFormatterTest** (5 tests)
|
||
|
|
- Format for native API
|
||
|
|
- Format for raw system prompt injection
|
||
|
|
|
||
|
|
**RedisCacheTest** (9 tests)
|
||
|
|
- Set and get operations
|
||
|
|
- Key existence checks
|
||
|
|
- Delete operations
|
||
|
|
- TTL expiration
|
||
|
|
- ContextPaging with Redis cache
|
||
|
|
- Summary persistence between requests
|
||
|
|
- In-memory vs Redis parity
|
||
|
|
- Message store persistence across instances
|
||
|
|
|
||
|
|
### Integration Test Requirements
|
||
|
|
|
||
|
|
Some tests require a running LLM endpoint. The default configuration uses:
|
||
|
|
|
||
|
|
- **URL:** `http://95.179.247.150/v1`
|
||
|
|
- **Model:** `HuggingFaceTB/SmolLM3-3B`
|
||
|
|
|
||
|
|
To use a different endpoint, modify `setUp()` in the test files.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Token Counter Binary
|
||
|
|
|
||
|
|
The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Default: cl100k_base (GPT-4/3.5)
|
||
|
|
echo "Hello, world!" | ./token-counter
|
||
|
|
# 4
|
||
|
|
|
||
|
|
# GPT-4o encoding
|
||
|
|
echo "Hello, world!" | ./token-counter o200k_base
|
||
|
|
# 4
|
||
|
|
```
|
||
|
|
|
||
|
|
Source: `~/dev/token-counter/`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Open Design Decisions
|
||
|
|
|
||
|
|
### Dereference Overage
|
||
|
|
|
||
|
|
When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:
|
||
|
|
|
||
|
|
1. Allow temporary overage for one turn
|
||
|
|
2. Drop other messages flagged as irrelevant
|
||
|
|
3. Re-summarize something else
|
||
|
|
4. Tighten summary quality to reduce dereferences
|
||
|
|
|
||
|
|
**Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Theory
|
||
|
|
|
||
|
|
Full design doc: See the original `Context Paging` spec.
|
||
|
|
|
||
|
|
The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."
|