Initial commit

This commit is contained in:
2026-03-28 08:54:57 +00:00
commit 26f8b33db7
154 changed files with 23075 additions and 0 deletions

459
README.md Normal file
View File

@@ -0,0 +1,459 @@
# Context Paging
**Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.**
---
## The Problem
Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.
## The Solution
Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.
## The Analogy
This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.
---
## Architecture: Three Nested Loops
```
USER sends message
┌─────────────────────────────────┐
│ LOOP 2 — Context Fitting │
│ Compress history until it fits │
└─────────────┬───────────────────┘
│ fitted context
┌─────────────────────────────────┐
│ LOOP 3 — Dereference │
│ LLM may request full msgs │
│ via MD5 → inject & re-run │
└─────────────┬───────────────────┘
│ final response
USER receives response
```
### Loop 2 — Fit
`ContextPaging::fit()` compresses messages until they fit within the context window:
1. Count total tokens in all messages
2. If under budget → done
3. Take oldest non-summarized message
4. Compute MD5 hash, store original in message store
5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
6. Repeat until under budget
**Rule:** The last message (current user request) is **never** summarized.
### Loop 3 — Execute
`ContextPaging::execute()` runs the LLM and handles dereference requests:
1. Send fitted context to LLM
2. If response contains `fetch_message` tool call with MD5 → continue
3. Look up original message, inject into context
4. Re-send to LLM
5. If response is normal text (no tool calls) → done, return to user
---
## Project Structure
```
context-paging/
├── src/
│ ├── ContextPaging.php # Main class — fit() + execute()
│ ├── TokenCounter.php # Shells out to Rust binary
│ ├── ContextRequest.php # Extended ServerRequest
│ ├── OpenAICompatibleClient.php # Guzzle-based LLM client
│ ├── CompletionsClientInterface.php
│ ├── LLMSummarizer.php # LLM-backed summarizer
│ ├── SummarizerInterface.php
│ ├── CacheInterface.php # Cache abstraction
│ ├── InMemoryCache.php # In-memory implementation
│ ├── RedisCache.php # Redis implementation
│ ├── ToolCallParser.php # Parse tool calls from responses
│ ├── ToolFormatter.php # Format tools for requests
│ └── ToolCallMode.php # NATIVE/RAW/AUTO enum
├── tests/
│ ├── ContextPagingTest.php # Core functionality tests
│ ├── OpenAICompatibleClientTest.php # LLM client tests
│ ├── SummarizerTest.php # Summarization tests
│ ├── RedisCacheTest.php # Redis persistence tests
│ ├── ToolCallParserTest.php
│ ├── ToolFormatterTest.php
│ └── fluff.md # Test article for summarization
├── token-counter # Rust binary (tiktoken)
├── index.php # CLI entry point
├── composer.json
├── phpunit.xml
└── README.md
```
---
## Quick Start
### Prerequisites
- PHP 8.5+
- Composer
- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)
### Install
```bash
composer install
```
This installs:
- `guzzlehttp/guzzle` — HTTP client for LLM API calls
- `guzzlehttp/psr7` — PSR-7 message implementations
- `predis/predis` — Redis client (optional, only if using RedisCache)
### Run Tests
```bash
./vendor/bin/phpunit
# With testdox output
./vendor/bin/phpunit --testdox
# Run specific test file
./vendor/bin/phpunit tests/SummarizerTest.php
```
### CLI Usage
```bash
# Pipe JSON payload
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php
# Or pass as argument
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
```
---
## API
### ContextPaging
```php
use ContextPaging\ContextPaging;
use ContextPaging\TokenCounter;
use ContextPaging\LLMSummarizer;
use ContextPaging\OpenAICompatibleClient;
use ContextPaging\ToolCallMode;
// Create summarizer (optional — falls back to truncation if not provided)
$summarizerClient = new OpenAICompatibleClient(
baseUrl: 'http://your-llm-endpoint/v1',
apiKey: null, // optional for local endpoints
timeout: 120
);
$summarizer = new LLMSummarizer(
client: $summarizerClient,
model: 'HuggingFaceTB/SmolLM3-3B',
maxTokens: 200,
temperature: 0.3
);
// Create main instance
$contextPaging = new ContextPaging(
tokenCounter: new TokenCounter(),
summarizer: $summarizer
);
// Configure for your model
$contextPaging
->setMaxContextTokens(128000)
->setResponseReserve(4096);
// Set tool call mode (for models with broken tool parsers)
$contextPaging->setToolCallMode(ToolCallMode::RAW);
// LOOP 2: Fit the context
$fittedRequest = $contextPaging->fit($request);
// LOOP 3: Execute with dereference handling
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
return $client->chat($messages, $options);
});
```
### TokenCounter
```php
use ContextPaging\TokenCounter;
$counter = new TokenCounter();
// Count tokens in a string
$tokens = $counter->count("Hello, world!");
// Returns: 4
// Count with different encoding
$tokens = $counter->count("Hello, world!", "o200k_base");
// Count context size for chat messages
$tokens = $counter->contextSize([
['role' => 'user', 'content' => 'Hello!'],
['role' => 'assistant', 'content' => 'Hi there!'],
]);
```
### OpenAICompatibleClient
```php
use ContextPaging\OpenAICompatibleClient;
$client = new OpenAICompatibleClient(
baseUrl: 'http://95.179.247.150/v1',
apiKey: null,
timeout: 120,
verifySsl: false
);
// Chat completion
$response = $client->chat([
['role' => 'user', 'content' => 'Hello!']
], [
'model' => 'HuggingFaceTB/SmolLM3-3B',
'max_tokens' => 100
]);
// List models
$models = $client->listModels();
```
### LLMSummarizer
```php
use ContextPaging\LLMSummarizer;
$summarizer = new LLMSummarizer(
client: $client,
model: 'HuggingFaceTB/SmolLM3-3B',
systemPrompt: 'Summarize concisely, preserving key information.',
maxTokens: 200,
temperature: 0.3
);
$summary = $summarizer->summarize($longText);
```
---
## Tool Call Modes
The system supports two tool call modes for the dereference operation:
### NATIVE Mode
For models with working tool call parsers (GPT-4, Claude, etc.):
```php
$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
```
- Tools sent as `tools` array in request payload
- Tool calls returned in `tool_calls` array in response
### RAW Mode
For models with broken/missing tool parsers (SmolLM3, etc.):
```php
$contextPaging->setToolCallMode(ToolCallMode::RAW);
```
- Tools injected into system prompt with XML-style format
- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
- Parsed from response content
### AUTO Mode
Detects mode from first response:
```php
$contextPaging->setToolCallMode(ToolCallMode::AUTO);
```
---
## Implementation Status
| Component | Status | Notes |
|-----------|--------|-------|
| Token counting | ✅ Done | Rust binary via `tiktoken-rs` |
| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
| Message store | ✅ Redis or in-memory | Persistent cache support |
| Summary cache | ✅ Redis or in-memory | Persistent cache support |
| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
| Tool call parser | ✅ Done | NATIVE and RAW modes |
| Tool formatter | ✅ Done | NATIVE and RAW modes |
| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
| LLMSummarizer | ✅ Done | Uses configured model |
| RedisCache | ✅ Done | Persistent storage via Predis |
| Tests | ✅ 36 passing | Unit + integration tests |
---
## Caching
### In-Memory Cache (Default)
By default, ContextPaging uses in-memory caches that exist for the duration of a single request:
```php
$contextPaging = new ContextPaging();
// Uses InMemoryCache internally
```
### Redis Cache (Persistent)
For persistent storage across requests, use Redis:
```php
use ContextPaging\RedisCache;
// Create Redis-backed caches
$messageStore = RedisCache::fromUrl(
'rediss://user:password@host:port',
prefix: 'ctx_msg:', // Key prefix for namespacing
defaultTtl: null // No expiry (or set TTL in seconds)
);
$summaryCache = RedisCache::fromUrl(
'rediss://user:password@host:port',
prefix: 'ctx_sum:'
);
// Inject into ContextPaging
$contextPaging = new ContextPaging(
tokenCounter: new TokenCounter(),
messageStore: $messageStore,
summaryCache: $summaryCache
);
```
**Benefits of Redis:**
- Summaries persist between requests (no re-summarization)
- Message store survives process restarts
- Share context across multiple workers/servers
**Key Namespacing:**
- Message store uses keys: `prefix:msg:{md5}`
- Summary cache uses keys: `prefix:summary:{md5}`
---
## Testing
### Run All Tests
```bash
./vendor/bin/phpunit --testdox
```
### Test Categories
**ContextPagingTest** (6 tests)
- Small payloads pass through unchanged
- Large payloads trigger summarization
- Last message is never summarized
- Original messages stored for dereferencing
- Error when last message is too large
**OpenAICompatibleClientTest** (8 tests)
- Basic chat completion
- Usage stats returned
- Multi-turn conversation context retention
- List models endpoint
- RAW tool formatting
- Tool call parser detection
**SummarizerTest** (4 tests)
- Summarization reduces token count (typically 75-85%)
- Key information preserved
- Multi-article summarization
- Usage stats accuracy
**ToolCallParserTest** (5 tests)
- Extract native OpenAI tool calls
- Extract raw XML-style tool calls
- Auto-detect mode from response
**ToolFormatterTest** (5 tests)
- Format for native API
- Format for raw system prompt injection
**RedisCacheTest** (9 tests)
- Set and get operations
- Key existence checks
- Delete operations
- TTL expiration
- ContextPaging with Redis cache
- Summary persistence between requests
- In-memory vs Redis parity
- Message store persistence across instances
### Integration Test Requirements
Some tests require a running LLM endpoint. The default configuration uses:
- **URL:** `http://95.179.247.150/v1`
- **Model:** `HuggingFaceTB/SmolLM3-3B`
To use a different endpoint, modify `setUp()` in the test files.
---
## Token Counter Binary
The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:
```bash
# Default: cl100k_base (GPT-4/3.5)
echo "Hello, world!" | ./token-counter
# 4
# GPT-4o encoding
echo "Hello, world!" | ./token-counter o200k_base
# 4
```
Source: `~/dev/token-counter/`
---
## Open Design Decisions
### Dereference Overage
When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:
1. Allow temporary overage for one turn
2. Drop other messages flagged as irrelevant
3. Re-summarize something else
4. Tighten summary quality to reduce dereferences
**Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.
---
## The Theory
Full design doc: See the original `Context Paging` spec.
The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."