# Context Paging **Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.** --- ## The Problem Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality. ## The Solution Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call. ## The Analogy This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault. --- ## Architecture: Three Nested Loops ``` USER sends message │ ▼ ┌─────────────────────────────────┐ │ LOOP 2 — Context Fitting │ │ Compress history until it fits │ └─────────────┬───────────────────┘ │ fitted context ▼ ┌─────────────────────────────────┐ │ LOOP 3 — Dereference │ │ LLM may request full msgs │ │ via MD5 → inject & re-run │ └─────────────┬───────────────────┘ │ final response ▼ USER receives response ``` ### Loop 2 — Fit `ContextPaging::fit()` compresses messages until they fit within the context window: 1. Count total tokens in all messages 2. If under budget → done 3. Take oldest non-summarized message 4. Compute MD5 hash, store original in message store 5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...` 6. Repeat until under budget **Rule:** The last message (current user request) is **never** summarized. ### Loop 3 — Execute `ContextPaging::execute()` runs the LLM and handles dereference requests: 1. Send fitted context to LLM 2. If response contains `fetch_message` tool call with MD5 → continue 3. Look up original message, inject into context 4. Re-send to LLM 5. If response is normal text (no tool calls) → done, return to user --- ## Project Structure ``` context-paging/ ├── src/ │ ├── ContextPaging.php # Main class — fit() + execute() │ ├── TokenCounter.php # Shells out to Rust binary │ ├── ContextRequest.php # Extended ServerRequest │ ├── OpenAICompatibleClient.php # Guzzle-based LLM client │ ├── CompletionsClientInterface.php │ ├── LLMSummarizer.php # LLM-backed summarizer │ ├── SummarizerInterface.php │ ├── CacheInterface.php # Cache abstraction │ ├── InMemoryCache.php # In-memory implementation │ ├── RedisCache.php # Redis implementation │ ├── ToolCallParser.php # Parse tool calls from responses │ ├── ToolFormatter.php # Format tools for requests │ └── ToolCallMode.php # NATIVE/RAW/AUTO enum ├── tests/ │ ├── ContextPagingTest.php # Core functionality tests │ ├── OpenAICompatibleClientTest.php # LLM client tests │ ├── SummarizerTest.php # Summarization tests │ ├── RedisCacheTest.php # Redis persistence tests │ ├── ToolCallParserTest.php │ ├── ToolFormatterTest.php │ └── fluff.md # Test article for summarization ├── token-counter # Rust binary (tiktoken) ├── index.php # CLI entry point ├── composer.json ├── phpunit.xml └── README.md ``` --- ## Quick Start ### Prerequisites - PHP 8.5+ - Composer - Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`) ### Install ```bash composer install ``` This installs: - `guzzlehttp/guzzle` — HTTP client for LLM API calls - `guzzlehttp/psr7` — PSR-7 message implementations - `predis/predis` — Redis client (optional, only if using RedisCache) ### Run Tests ```bash ./vendor/bin/phpunit # With testdox output ./vendor/bin/phpunit --testdox # Run specific test file ./vendor/bin/phpunit tests/SummarizerTest.php ``` ### CLI Usage ```bash # Pipe JSON payload echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php # Or pass as argument php index.php '{"messages":[{"role":"user","content":"Hello!"}]}' ``` --- ## API ### ContextPaging ```php use ContextPaging\ContextPaging; use ContextPaging\TokenCounter; use ContextPaging\LLMSummarizer; use ContextPaging\OpenAICompatibleClient; use ContextPaging\ToolCallMode; // Create summarizer (optional — falls back to truncation if not provided) $summarizerClient = new OpenAICompatibleClient( baseUrl: 'http://your-llm-endpoint/v1', apiKey: null, // optional for local endpoints timeout: 120 ); $summarizer = new LLMSummarizer( client: $summarizerClient, model: 'HuggingFaceTB/SmolLM3-3B', maxTokens: 200, temperature: 0.3 ); // Create main instance $contextPaging = new ContextPaging( tokenCounter: new TokenCounter(), summarizer: $summarizer ); // Configure for your model $contextPaging ->setMaxContextTokens(128000) ->setResponseReserve(4096); // Set tool call mode (for models with broken tool parsers) $contextPaging->setToolCallMode(ToolCallMode::RAW); // LOOP 2: Fit the context $fittedRequest = $contextPaging->fit($request); // LOOP 3: Execute with dereference handling $response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) { return $client->chat($messages, $options); }); ``` ### TokenCounter ```php use ContextPaging\TokenCounter; $counter = new TokenCounter(); // Count tokens in a string $tokens = $counter->count("Hello, world!"); // Returns: 4 // Count with different encoding $tokens = $counter->count("Hello, world!", "o200k_base"); // Count context size for chat messages $tokens = $counter->contextSize([ ['role' => 'user', 'content' => 'Hello!'], ['role' => 'assistant', 'content' => 'Hi there!'], ]); ``` ### OpenAICompatibleClient ```php use ContextPaging\OpenAICompatibleClient; $client = new OpenAICompatibleClient( baseUrl: 'http://95.179.247.150/v1', apiKey: null, timeout: 120, verifySsl: false ); // Chat completion $response = $client->chat([ ['role' => 'user', 'content' => 'Hello!'] ], [ 'model' => 'HuggingFaceTB/SmolLM3-3B', 'max_tokens' => 100 ]); // List models $models = $client->listModels(); ``` ### LLMSummarizer ```php use ContextPaging\LLMSummarizer; $summarizer = new LLMSummarizer( client: $client, model: 'HuggingFaceTB/SmolLM3-3B', systemPrompt: 'Summarize concisely, preserving key information.', maxTokens: 200, temperature: 0.3 ); $summary = $summarizer->summarize($longText); ``` --- ## Tool Call Modes The system supports two tool call modes for the dereference operation: ### NATIVE Mode For models with working tool call parsers (GPT-4, Claude, etc.): ```php $contextPaging->setToolCallMode(ToolCallMode::NATIVE); ``` - Tools sent as `tools` array in request payload - Tool calls returned in `tool_calls` array in response ### RAW Mode For models with broken/missing tool parsers (SmolLM3, etc.): ```php $contextPaging->setToolCallMode(ToolCallMode::RAW); ``` - Tools injected into system prompt with XML-style format - Model outputs tool calls as markers: `{"name": "fetch_message", "arguments": {"md5": "..."}}` - Parsed from response content ### AUTO Mode Detects mode from first response: ```php $contextPaging->setToolCallMode(ToolCallMode::AUTO); ``` --- ## Implementation Status | Component | Status | Notes | |-----------|--------|-------| | Token counting | ✅ Done | Rust binary via `tiktoken-rs` | | Fit loop (Loop 2) | ✅ Done | Summarization via LLM | | Message store | ✅ Redis or in-memory | Persistent cache support | | Summary cache | ✅ Redis or in-memory | Persistent cache support | | Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented | | Tool call parser | ✅ Done | NATIVE and RAW modes | | Tool formatter | ✅ Done | NATIVE and RAW modes | | LLM client | ✅ Done | OpenAI-compatible via Guzzle | | LLMSummarizer | ✅ Done | Uses configured model | | RedisCache | ✅ Done | Persistent storage via Predis | | Tests | ✅ 36 passing | Unit + integration tests | --- ## Caching ### In-Memory Cache (Default) By default, ContextPaging uses in-memory caches that exist for the duration of a single request: ```php $contextPaging = new ContextPaging(); // Uses InMemoryCache internally ``` ### Redis Cache (Persistent) For persistent storage across requests, use Redis: ```php use ContextPaging\RedisCache; // Create Redis-backed caches $messageStore = RedisCache::fromUrl( 'rediss://user:password@host:port', prefix: 'ctx_msg:', // Key prefix for namespacing defaultTtl: null // No expiry (or set TTL in seconds) ); $summaryCache = RedisCache::fromUrl( 'rediss://user:password@host:port', prefix: 'ctx_sum:' ); // Inject into ContextPaging $contextPaging = new ContextPaging( tokenCounter: new TokenCounter(), messageStore: $messageStore, summaryCache: $summaryCache ); ``` **Benefits of Redis:** - Summaries persist between requests (no re-summarization) - Message store survives process restarts - Share context across multiple workers/servers **Key Namespacing:** - Message store uses keys: `prefix:msg:{md5}` - Summary cache uses keys: `prefix:summary:{md5}` --- ## Testing ### Run All Tests ```bash ./vendor/bin/phpunit --testdox ``` ### Test Categories **ContextPagingTest** (6 tests) - Small payloads pass through unchanged - Large payloads trigger summarization - Last message is never summarized - Original messages stored for dereferencing - Error when last message is too large **OpenAICompatibleClientTest** (8 tests) - Basic chat completion - Usage stats returned - Multi-turn conversation context retention - List models endpoint - RAW tool formatting - Tool call parser detection **SummarizerTest** (4 tests) - Summarization reduces token count (typically 75-85%) - Key information preserved - Multi-article summarization - Usage stats accuracy **ToolCallParserTest** (5 tests) - Extract native OpenAI tool calls - Extract raw XML-style tool calls - Auto-detect mode from response **ToolFormatterTest** (5 tests) - Format for native API - Format for raw system prompt injection **RedisCacheTest** (9 tests) - Set and get operations - Key existence checks - Delete operations - TTL expiration - ContextPaging with Redis cache - Summary persistence between requests - In-memory vs Redis parity - Message store persistence across instances ### Integration Test Requirements Some tests require a running LLM endpoint. The default configuration uses: - **URL:** `http://95.179.247.150/v1` - **Model:** `HuggingFaceTB/SmolLM3-3B` To use a different endpoint, modify `setUp()` in the test files. --- ## Token Counter Binary The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`: ```bash # Default: cl100k_base (GPT-4/3.5) echo "Hello, world!" | ./token-counter # 4 # GPT-4o encoding echo "Hello, world!" | ./token-counter o200k_base # 4 ``` Source: `~/dev/token-counter/` --- ## Open Design Decisions ### Dereference Overage When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options: 1. Allow temporary overage for one turn 2. Drop other messages flagged as irrelevant 3. Re-summarize something else 4. Tighten summary quality to reduce dereferences **Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision. --- ## The Theory Full design doc: See the original `Context Paging` spec. The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."