- Read REDIS_URL from environment - Skip tests if not set
Context Paging
Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.
The Problem
Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.
The Solution
Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.
The Analogy
This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.
Architecture: Three Nested Loops
USER sends message
│
▼
┌─────────────────────────────────┐
│ LOOP 2 — Context Fitting │
│ Compress history until it fits │
└─────────────┬───────────────────┘
│ fitted context
▼
┌─────────────────────────────────┐
│ LOOP 3 — Dereference │
│ LLM may request full msgs │
│ via MD5 → inject & re-run │
└─────────────┬───────────────────┘
│ final response
▼
USER receives response
Loop 2 — Fit
ContextPaging::fit() compresses messages until they fit within the context window:
- Count total tokens in all messages
- If under budget → done
- Take oldest non-summarized message
- Compute MD5 hash, store original in message store
- Replace with summary + hash pointer:
[md5:a3f8c1e9...] User asked about Q3 revenue... - Repeat until under budget
Rule: The last message (current user request) is never summarized.
Loop 3 — Execute
ContextPaging::execute() runs the LLM and handles dereference requests:
- Send fitted context to LLM
- If response contains
fetch_messagetool call with MD5 → continue - Look up original message, inject into context
- Re-send to LLM
- If response is normal text (no tool calls) → done, return to user
Project Structure
context-paging/
├── src/
│ ├── ContextPaging.php # Main class — fit() + execute()
│ ├── TokenCounter.php # Shells out to Rust binary
│ ├── ContextRequest.php # Extended ServerRequest
│ ├── OpenAICompatibleClient.php # Guzzle-based LLM client
│ ├── CompletionsClientInterface.php
│ ├── LLMSummarizer.php # LLM-backed summarizer
│ ├── SummarizerInterface.php
│ ├── CacheInterface.php # Cache abstraction
│ ├── InMemoryCache.php # In-memory implementation
│ ├── RedisCache.php # Redis implementation
│ ├── ToolCallParser.php # Parse tool calls from responses
│ ├── ToolFormatter.php # Format tools for requests
│ └── ToolCallMode.php # NATIVE/RAW/AUTO enum
├── tests/
│ ├── ContextPagingTest.php # Core functionality tests
│ ├── OpenAICompatibleClientTest.php # LLM client tests
│ ├── SummarizerTest.php # Summarization tests
│ ├── RedisCacheTest.php # Redis persistence tests
│ ├── ToolCallParserTest.php
│ ├── ToolFormatterTest.php
│ └── fluff.md # Test article for summarization
├── token-counter # Rust binary (tiktoken)
├── index.php # CLI entry point
├── composer.json
├── phpunit.xml
└── README.md
Quick Start
Prerequisites
- PHP 8.5+
- Composer
- Rust binary at
./token-counter(or rebuild from~/dev/token-counter/)
Install
composer install
This installs:
guzzlehttp/guzzle— HTTP client for LLM API callsguzzlehttp/psr7— PSR-7 message implementationspredis/predis— Redis client (optional, only if using RedisCache)
Run Tests
./vendor/bin/phpunit
# With testdox output
./vendor/bin/phpunit --testdox
# Run specific test file
./vendor/bin/phpunit tests/SummarizerTest.php
CLI Usage
# Pipe JSON payload
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php
# Or pass as argument
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
API
ContextPaging
use ContextPaging\ContextPaging;
use ContextPaging\TokenCounter;
use ContextPaging\LLMSummarizer;
use ContextPaging\OpenAICompatibleClient;
use ContextPaging\ToolCallMode;
// Create summarizer (optional — falls back to truncation if not provided)
$summarizerClient = new OpenAICompatibleClient(
baseUrl: 'http://your-llm-endpoint/v1',
apiKey: null, // optional for local endpoints
timeout: 120
);
$summarizer = new LLMSummarizer(
client: $summarizerClient,
model: 'HuggingFaceTB/SmolLM3-3B',
maxTokens: 200,
temperature: 0.3
);
// Create main instance
$contextPaging = new ContextPaging(
tokenCounter: new TokenCounter(),
summarizer: $summarizer
);
// Configure for your model
$contextPaging
->setMaxContextTokens(128000)
->setResponseReserve(4096);
// Set tool call mode (for models with broken tool parsers)
$contextPaging->setToolCallMode(ToolCallMode::RAW);
// LOOP 2: Fit the context
$fittedRequest = $contextPaging->fit($request);
// LOOP 3: Execute with dereference handling
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
return $client->chat($messages, $options);
});
TokenCounter
use ContextPaging\TokenCounter;
$counter = new TokenCounter();
// Count tokens in a string
$tokens = $counter->count("Hello, world!");
// Returns: 4
// Count with different encoding
$tokens = $counter->count("Hello, world!", "o200k_base");
// Count context size for chat messages
$tokens = $counter->contextSize([
['role' => 'user', 'content' => 'Hello!'],
['role' => 'assistant', 'content' => 'Hi there!'],
]);
OpenAICompatibleClient
use ContextPaging\OpenAICompatibleClient;
$client = new OpenAICompatibleClient(
baseUrl: 'http://95.179.247.150/v1',
apiKey: null,
timeout: 120,
verifySsl: false
);
// Chat completion
$response = $client->chat([
['role' => 'user', 'content' => 'Hello!']
], [
'model' => 'HuggingFaceTB/SmolLM3-3B',
'max_tokens' => 100
]);
// List models
$models = $client->listModels();
LLMSummarizer
use ContextPaging\LLMSummarizer;
$summarizer = new LLMSummarizer(
client: $client,
model: 'HuggingFaceTB/SmolLM3-3B',
systemPrompt: 'Summarize concisely, preserving key information.',
maxTokens: 200,
temperature: 0.3
);
$summary = $summarizer->summarize($longText);
Tool Call Modes
The system supports two tool call modes for the dereference operation:
NATIVE Mode
For models with working tool call parsers (GPT-4, Claude, etc.):
$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
- Tools sent as
toolsarray in request payload - Tool calls returned in
tool_callsarray in response
RAW Mode
For models with broken/missing tool parsers (SmolLM3, etc.):
$contextPaging->setToolCallMode(ToolCallMode::RAW);
- Tools injected into system prompt with XML-style format
- Model outputs tool calls as markers:
<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call> - Parsed from response content
AUTO Mode
Detects mode from first response:
$contextPaging->setToolCallMode(ToolCallMode::AUTO);
Implementation Status
| Component | Status | Notes |
|---|---|---|
| Token counting | ✅ Done | Rust binary via tiktoken-rs |
| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
| Message store | ✅ Redis or in-memory | Persistent cache support |
| Summary cache | ✅ Redis or in-memory | Persistent cache support |
| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
| Tool call parser | ✅ Done | NATIVE and RAW modes |
| Tool formatter | ✅ Done | NATIVE and RAW modes |
| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
| LLMSummarizer | ✅ Done | Uses configured model |
| RedisCache | ✅ Done | Persistent storage via Predis |
| Tests | ✅ 36 passing | Unit + integration tests |
Caching
In-Memory Cache (Default)
By default, ContextPaging uses in-memory caches that exist for the duration of a single request:
$contextPaging = new ContextPaging();
// Uses InMemoryCache internally
Redis Cache (Persistent)
For persistent storage across requests, use Redis:
use ContextPaging\RedisCache;
// Create Redis-backed caches
$messageStore = RedisCache::fromUrl(
'rediss://user:password@host:port',
prefix: 'ctx_msg:', // Key prefix for namespacing
defaultTtl: null // No expiry (or set TTL in seconds)
);
$summaryCache = RedisCache::fromUrl(
'rediss://user:password@host:port',
prefix: 'ctx_sum:'
);
// Inject into ContextPaging
$contextPaging = new ContextPaging(
tokenCounter: new TokenCounter(),
messageStore: $messageStore,
summaryCache: $summaryCache
);
Benefits of Redis:
- Summaries persist between requests (no re-summarization)
- Message store survives process restarts
- Share context across multiple workers/servers
Key Namespacing:
- Message store uses keys:
prefix:msg:{md5} - Summary cache uses keys:
prefix:summary:{md5}
Testing
Run All Tests
./vendor/bin/phpunit --testdox
Test Categories
ContextPagingTest (6 tests)
- Small payloads pass through unchanged
- Large payloads trigger summarization
- Last message is never summarized
- Original messages stored for dereferencing
- Error when last message is too large
OpenAICompatibleClientTest (8 tests)
- Basic chat completion
- Usage stats returned
- Multi-turn conversation context retention
- List models endpoint
- RAW tool formatting
- Tool call parser detection
SummarizerTest (4 tests)
- Summarization reduces token count (typically 75-85%)
- Key information preserved
- Multi-article summarization
- Usage stats accuracy
ToolCallParserTest (5 tests)
- Extract native OpenAI tool calls
- Extract raw XML-style tool calls
- Auto-detect mode from response
ToolFormatterTest (5 tests)
- Format for native API
- Format for raw system prompt injection
RedisCacheTest (9 tests)
- Set and get operations
- Key existence checks
- Delete operations
- TTL expiration
- ContextPaging with Redis cache
- Summary persistence between requests
- In-memory vs Redis parity
- Message store persistence across instances
Integration Test Requirements
Some tests require a running LLM endpoint. The default configuration uses:
- URL:
http://95.179.247.150/v1 - Model:
HuggingFaceTB/SmolLM3-3B
To use a different endpoint, modify setUp() in the test files.
Token Counter Binary
The token-counter binary is a Rust CLI tool using tiktoken-rs:
# Default: cl100k_base (GPT-4/3.5)
echo "Hello, world!" | ./token-counter
# 4
# GPT-4o encoding
echo "Hello, world!" | ./token-counter o200k_base
# 4
Source: ~/dev/token-counter/
Open Design Decisions
Dereference Overage
When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:
- Allow temporary overage for one turn
- Drop other messages flagged as irrelevant
- Re-summarize something else
- Tighten summary quality to reduce dereferences
Recommendation: Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.
The Theory
Full design doc: See the original Context Paging spec.
The key insight: full messages are never discarded. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."