Initial commit
This commit is contained in:
459
README.md
Normal file
459
README.md
Normal file
@@ -0,0 +1,459 @@
|
||||
# Context Paging
|
||||
|
||||
**Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.**
|
||||
|
||||
---
|
||||
|
||||
## The Problem
|
||||
|
||||
Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.
|
||||
|
||||
## The Solution
|
||||
|
||||
Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.
|
||||
|
||||
## The Analogy
|
||||
|
||||
This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.
|
||||
|
||||
---
|
||||
|
||||
## Architecture: Three Nested Loops
|
||||
|
||||
```
|
||||
USER sends message
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ LOOP 2 — Context Fitting │
|
||||
│ Compress history until it fits │
|
||||
└─────────────┬───────────────────┘
|
||||
│ fitted context
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ LOOP 3 — Dereference │
|
||||
│ LLM may request full msgs │
|
||||
│ via MD5 → inject & re-run │
|
||||
└─────────────┬───────────────────┘
|
||||
│ final response
|
||||
▼
|
||||
USER receives response
|
||||
```
|
||||
|
||||
### Loop 2 — Fit
|
||||
|
||||
`ContextPaging::fit()` compresses messages until they fit within the context window:
|
||||
|
||||
1. Count total tokens in all messages
|
||||
2. If under budget → done
|
||||
3. Take oldest non-summarized message
|
||||
4. Compute MD5 hash, store original in message store
|
||||
5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
|
||||
6. Repeat until under budget
|
||||
|
||||
**Rule:** The last message (current user request) is **never** summarized.
|
||||
|
||||
### Loop 3 — Execute
|
||||
|
||||
`ContextPaging::execute()` runs the LLM and handles dereference requests:
|
||||
|
||||
1. Send fitted context to LLM
|
||||
2. If response contains `fetch_message` tool call with MD5 → continue
|
||||
3. Look up original message, inject into context
|
||||
4. Re-send to LLM
|
||||
5. If response is normal text (no tool calls) → done, return to user
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
context-paging/
|
||||
├── src/
|
||||
│ ├── ContextPaging.php # Main class — fit() + execute()
|
||||
│ ├── TokenCounter.php # Shells out to Rust binary
|
||||
│ ├── ContextRequest.php # Extended ServerRequest
|
||||
│ ├── OpenAICompatibleClient.php # Guzzle-based LLM client
|
||||
│ ├── CompletionsClientInterface.php
|
||||
│ ├── LLMSummarizer.php # LLM-backed summarizer
|
||||
│ ├── SummarizerInterface.php
|
||||
│ ├── CacheInterface.php # Cache abstraction
|
||||
│ ├── InMemoryCache.php # In-memory implementation
|
||||
│ ├── RedisCache.php # Redis implementation
|
||||
│ ├── ToolCallParser.php # Parse tool calls from responses
|
||||
│ ├── ToolFormatter.php # Format tools for requests
|
||||
│ └── ToolCallMode.php # NATIVE/RAW/AUTO enum
|
||||
├── tests/
|
||||
│ ├── ContextPagingTest.php # Core functionality tests
|
||||
│ ├── OpenAICompatibleClientTest.php # LLM client tests
|
||||
│ ├── SummarizerTest.php # Summarization tests
|
||||
│ ├── RedisCacheTest.php # Redis persistence tests
|
||||
│ ├── ToolCallParserTest.php
|
||||
│ ├── ToolFormatterTest.php
|
||||
│ └── fluff.md # Test article for summarization
|
||||
├── token-counter # Rust binary (tiktoken)
|
||||
├── index.php # CLI entry point
|
||||
├── composer.json
|
||||
├── phpunit.xml
|
||||
└── README.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- PHP 8.5+
|
||||
- Composer
|
||||
- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)
|
||||
|
||||
### Install
|
||||
|
||||
```bash
|
||||
composer install
|
||||
```
|
||||
|
||||
This installs:
|
||||
- `guzzlehttp/guzzle` — HTTP client for LLM API calls
|
||||
- `guzzlehttp/psr7` — PSR-7 message implementations
|
||||
- `predis/predis` — Redis client (optional, only if using RedisCache)
|
||||
|
||||
### Run Tests
|
||||
|
||||
```bash
|
||||
./vendor/bin/phpunit
|
||||
|
||||
# With testdox output
|
||||
./vendor/bin/phpunit --testdox
|
||||
|
||||
# Run specific test file
|
||||
./vendor/bin/phpunit tests/SummarizerTest.php
|
||||
```
|
||||
|
||||
### CLI Usage
|
||||
|
||||
```bash
|
||||
# Pipe JSON payload
|
||||
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php
|
||||
|
||||
# Or pass as argument
|
||||
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API
|
||||
|
||||
### ContextPaging
|
||||
|
||||
```php
|
||||
use ContextPaging\ContextPaging;
|
||||
use ContextPaging\TokenCounter;
|
||||
use ContextPaging\LLMSummarizer;
|
||||
use ContextPaging\OpenAICompatibleClient;
|
||||
use ContextPaging\ToolCallMode;
|
||||
|
||||
// Create summarizer (optional — falls back to truncation if not provided)
|
||||
$summarizerClient = new OpenAICompatibleClient(
|
||||
baseUrl: 'http://your-llm-endpoint/v1',
|
||||
apiKey: null, // optional for local endpoints
|
||||
timeout: 120
|
||||
);
|
||||
|
||||
$summarizer = new LLMSummarizer(
|
||||
client: $summarizerClient,
|
||||
model: 'HuggingFaceTB/SmolLM3-3B',
|
||||
maxTokens: 200,
|
||||
temperature: 0.3
|
||||
);
|
||||
|
||||
// Create main instance
|
||||
$contextPaging = new ContextPaging(
|
||||
tokenCounter: new TokenCounter(),
|
||||
summarizer: $summarizer
|
||||
);
|
||||
|
||||
// Configure for your model
|
||||
$contextPaging
|
||||
->setMaxContextTokens(128000)
|
||||
->setResponseReserve(4096);
|
||||
|
||||
// Set tool call mode (for models with broken tool parsers)
|
||||
$contextPaging->setToolCallMode(ToolCallMode::RAW);
|
||||
|
||||
// LOOP 2: Fit the context
|
||||
$fittedRequest = $contextPaging->fit($request);
|
||||
|
||||
// LOOP 3: Execute with dereference handling
|
||||
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
|
||||
return $client->chat($messages, $options);
|
||||
});
|
||||
```
|
||||
|
||||
### TokenCounter
|
||||
|
||||
```php
|
||||
use ContextPaging\TokenCounter;
|
||||
|
||||
$counter = new TokenCounter();
|
||||
|
||||
// Count tokens in a string
|
||||
$tokens = $counter->count("Hello, world!");
|
||||
// Returns: 4
|
||||
|
||||
// Count with different encoding
|
||||
$tokens = $counter->count("Hello, world!", "o200k_base");
|
||||
|
||||
// Count context size for chat messages
|
||||
$tokens = $counter->contextSize([
|
||||
['role' => 'user', 'content' => 'Hello!'],
|
||||
['role' => 'assistant', 'content' => 'Hi there!'],
|
||||
]);
|
||||
```
|
||||
|
||||
### OpenAICompatibleClient
|
||||
|
||||
```php
|
||||
use ContextPaging\OpenAICompatibleClient;
|
||||
|
||||
$client = new OpenAICompatibleClient(
|
||||
baseUrl: 'http://95.179.247.150/v1',
|
||||
apiKey: null,
|
||||
timeout: 120,
|
||||
verifySsl: false
|
||||
);
|
||||
|
||||
// Chat completion
|
||||
$response = $client->chat([
|
||||
['role' => 'user', 'content' => 'Hello!']
|
||||
], [
|
||||
'model' => 'HuggingFaceTB/SmolLM3-3B',
|
||||
'max_tokens' => 100
|
||||
]);
|
||||
|
||||
// List models
|
||||
$models = $client->listModels();
|
||||
```
|
||||
|
||||
### LLMSummarizer
|
||||
|
||||
```php
|
||||
use ContextPaging\LLMSummarizer;
|
||||
|
||||
$summarizer = new LLMSummarizer(
|
||||
client: $client,
|
||||
model: 'HuggingFaceTB/SmolLM3-3B',
|
||||
systemPrompt: 'Summarize concisely, preserving key information.',
|
||||
maxTokens: 200,
|
||||
temperature: 0.3
|
||||
);
|
||||
|
||||
$summary = $summarizer->summarize($longText);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tool Call Modes
|
||||
|
||||
The system supports two tool call modes for the dereference operation:
|
||||
|
||||
### NATIVE Mode
|
||||
|
||||
For models with working tool call parsers (GPT-4, Claude, etc.):
|
||||
|
||||
```php
|
||||
$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
|
||||
```
|
||||
|
||||
- Tools sent as `tools` array in request payload
|
||||
- Tool calls returned in `tool_calls` array in response
|
||||
|
||||
### RAW Mode
|
||||
|
||||
For models with broken/missing tool parsers (SmolLM3, etc.):
|
||||
|
||||
```php
|
||||
$contextPaging->setToolCallMode(ToolCallMode::RAW);
|
||||
```
|
||||
|
||||
- Tools injected into system prompt with XML-style format
|
||||
- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
|
||||
- Parsed from response content
|
||||
|
||||
### AUTO Mode
|
||||
|
||||
Detects mode from first response:
|
||||
|
||||
```php
|
||||
$contextPaging->setToolCallMode(ToolCallMode::AUTO);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
| Component | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| Token counting | ✅ Done | Rust binary via `tiktoken-rs` |
|
||||
| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
|
||||
| Message store | ✅ Redis or in-memory | Persistent cache support |
|
||||
| Summary cache | ✅ Redis or in-memory | Persistent cache support |
|
||||
| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
|
||||
| Tool call parser | ✅ Done | NATIVE and RAW modes |
|
||||
| Tool formatter | ✅ Done | NATIVE and RAW modes |
|
||||
| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
|
||||
| LLMSummarizer | ✅ Done | Uses configured model |
|
||||
| RedisCache | ✅ Done | Persistent storage via Predis |
|
||||
| Tests | ✅ 36 passing | Unit + integration tests |
|
||||
|
||||
---
|
||||
|
||||
## Caching
|
||||
|
||||
### In-Memory Cache (Default)
|
||||
|
||||
By default, ContextPaging uses in-memory caches that exist for the duration of a single request:
|
||||
|
||||
```php
|
||||
$contextPaging = new ContextPaging();
|
||||
// Uses InMemoryCache internally
|
||||
```
|
||||
|
||||
### Redis Cache (Persistent)
|
||||
|
||||
For persistent storage across requests, use Redis:
|
||||
|
||||
```php
|
||||
use ContextPaging\RedisCache;
|
||||
|
||||
// Create Redis-backed caches
|
||||
$messageStore = RedisCache::fromUrl(
|
||||
'rediss://user:password@host:port',
|
||||
prefix: 'ctx_msg:', // Key prefix for namespacing
|
||||
defaultTtl: null // No expiry (or set TTL in seconds)
|
||||
);
|
||||
|
||||
$summaryCache = RedisCache::fromUrl(
|
||||
'rediss://user:password@host:port',
|
||||
prefix: 'ctx_sum:'
|
||||
);
|
||||
|
||||
// Inject into ContextPaging
|
||||
$contextPaging = new ContextPaging(
|
||||
tokenCounter: new TokenCounter(),
|
||||
messageStore: $messageStore,
|
||||
summaryCache: $summaryCache
|
||||
);
|
||||
```
|
||||
|
||||
**Benefits of Redis:**
|
||||
- Summaries persist between requests (no re-summarization)
|
||||
- Message store survives process restarts
|
||||
- Share context across multiple workers/servers
|
||||
|
||||
**Key Namespacing:**
|
||||
- Message store uses keys: `prefix:msg:{md5}`
|
||||
- Summary cache uses keys: `prefix:summary:{md5}`
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Run All Tests
|
||||
|
||||
```bash
|
||||
./vendor/bin/phpunit --testdox
|
||||
```
|
||||
|
||||
### Test Categories
|
||||
|
||||
**ContextPagingTest** (6 tests)
|
||||
- Small payloads pass through unchanged
|
||||
- Large payloads trigger summarization
|
||||
- Last message is never summarized
|
||||
- Original messages stored for dereferencing
|
||||
- Error when last message is too large
|
||||
|
||||
**OpenAICompatibleClientTest** (8 tests)
|
||||
- Basic chat completion
|
||||
- Usage stats returned
|
||||
- Multi-turn conversation context retention
|
||||
- List models endpoint
|
||||
- RAW tool formatting
|
||||
- Tool call parser detection
|
||||
|
||||
**SummarizerTest** (4 tests)
|
||||
- Summarization reduces token count (typically 75-85%)
|
||||
- Key information preserved
|
||||
- Multi-article summarization
|
||||
- Usage stats accuracy
|
||||
|
||||
**ToolCallParserTest** (5 tests)
|
||||
- Extract native OpenAI tool calls
|
||||
- Extract raw XML-style tool calls
|
||||
- Auto-detect mode from response
|
||||
|
||||
**ToolFormatterTest** (5 tests)
|
||||
- Format for native API
|
||||
- Format for raw system prompt injection
|
||||
|
||||
**RedisCacheTest** (9 tests)
|
||||
- Set and get operations
|
||||
- Key existence checks
|
||||
- Delete operations
|
||||
- TTL expiration
|
||||
- ContextPaging with Redis cache
|
||||
- Summary persistence between requests
|
||||
- In-memory vs Redis parity
|
||||
- Message store persistence across instances
|
||||
|
||||
### Integration Test Requirements
|
||||
|
||||
Some tests require a running LLM endpoint. The default configuration uses:
|
||||
|
||||
- **URL:** `http://95.179.247.150/v1`
|
||||
- **Model:** `HuggingFaceTB/SmolLM3-3B`
|
||||
|
||||
To use a different endpoint, modify `setUp()` in the test files.
|
||||
|
||||
---
|
||||
|
||||
## Token Counter Binary
|
||||
|
||||
The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:
|
||||
|
||||
```bash
|
||||
# Default: cl100k_base (GPT-4/3.5)
|
||||
echo "Hello, world!" | ./token-counter
|
||||
# 4
|
||||
|
||||
# GPT-4o encoding
|
||||
echo "Hello, world!" | ./token-counter o200k_base
|
||||
# 4
|
||||
```
|
||||
|
||||
Source: `~/dev/token-counter/`
|
||||
|
||||
---
|
||||
|
||||
## Open Design Decisions
|
||||
|
||||
### Dereference Overage
|
||||
|
||||
When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:
|
||||
|
||||
1. Allow temporary overage for one turn
|
||||
2. Drop other messages flagged as irrelevant
|
||||
3. Re-summarize something else
|
||||
4. Tighten summary quality to reduce dereferences
|
||||
|
||||
**Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.
|
||||
|
||||
---
|
||||
|
||||
## The Theory
|
||||
|
||||
Full design doc: See the original `Context Paging` spec.
|
||||
|
||||
The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."
|
||||
Reference in New Issue
Block a user