Initial commit

2026-03-28 08:54:57 +00:00
commit 26f8b33db7
154 changed files with 23075 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,459 @@
+# Context Paging
+
+**Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.**
+
+---
+
+## The Problem
+
+Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.
+
+## The Solution
+
+Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.
+
+## The Analogy
+
+This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.
+
+---
+
+## Architecture: Three Nested Loops
+
+```
+USER sends message
+ │
+ ▼
+┌─────────────────────────────────┐
+│ LOOP 2 — Context Fitting        │
+│ Compress history until it fits  │
+└─────────────┬───────────────────┘
+ │ fitted context
+ ▼
+┌─────────────────────────────────┐
+│ LOOP 3 — Dereference            │
+│ LLM may request full msgs       │
+│ via MD5 → inject & re-run       │
+└─────────────┬───────────────────┘
+ │ final response
+ ▼
+USER receives response
+```
+
+### Loop 2 — Fit
+
+`ContextPaging::fit()` compresses messages until they fit within the context window:
+
+1. Count total tokens in all messages
+2. If under budget → done
+3. Take oldest non-summarized message
+4. Compute MD5 hash, store original in message store
+5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
+6. Repeat until under budget
+
+**Rule:** The last message (current user request) is **never** summarized.
+
+### Loop 3 — Execute
+
+`ContextPaging::execute()` runs the LLM and handles dereference requests:
+
+1. Send fitted context to LLM
+2. If response contains `fetch_message` tool call with MD5 → continue
+3. Look up original message, inject into context
+4. Re-send to LLM
+5. If response is normal text (no tool calls) → done, return to user
+
+---
+
+## Project Structure
+
+```
+context-paging/
+├── src/
+│   ├── ContextPaging.php            # Main class — fit() + execute()
+│   ├── TokenCounter.php             # Shells out to Rust binary
+│   ├── ContextRequest.php           # Extended ServerRequest
+│   ├── OpenAICompatibleClient.php   # Guzzle-based LLM client
+│   ├── CompletionsClientInterface.php
+│   ├── LLMSummarizer.php            # LLM-backed summarizer
+│   ├── SummarizerInterface.php
+│   ├── CacheInterface.php           # Cache abstraction
+│   ├── InMemoryCache.php            # In-memory implementation
+│   ├── RedisCache.php               # Redis implementation
+│   ├── ToolCallParser.php           # Parse tool calls from responses
+│   ├── ToolFormatter.php            # Format tools for requests
+│   └── ToolCallMode.php             # NATIVE/RAW/AUTO enum
+├── tests/
+│   ├── ContextPagingTest.php        # Core functionality tests
+│   ├── OpenAICompatibleClientTest.php # LLM client tests
+│   ├── SummarizerTest.php           # Summarization tests
+│   ├── RedisCacheTest.php           # Redis persistence tests
+│   ├── ToolCallParserTest.php
+│   ├── ToolFormatterTest.php
+│   └── fluff.md                     # Test article for summarization
+├── token-counter                    # Rust binary (tiktoken)
+├── index.php                        # CLI entry point
+├── composer.json
+├── phpunit.xml
+└── README.md
+```
+
+---
+
+## Quick Start
+
+### Prerequisites
+
+- PHP 8.5+
+- Composer
+- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)
+
+### Install
+
+```bash
+composer install
+```
+
+This installs:
+- `guzzlehttp/guzzle` — HTTP client for LLM API calls
+- `guzzlehttp/psr7` — PSR-7 message implementations
+- `predis/predis` — Redis client (optional, only if using RedisCache)
+
+### Run Tests
+
+```bash
+./vendor/bin/phpunit
+
+# With testdox output
+./vendor/bin/phpunit --testdox
+
+# Run specific test file
+./vendor/bin/phpunit tests/SummarizerTest.php
+```
+
+### CLI Usage
+
+```bash
+# Pipe JSON payload
+echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php
+
+# Or pass as argument
+php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
+```
+
+---
+
+## API
+
+### ContextPaging
+
+```php
+use ContextPaging\ContextPaging;
+use ContextPaging\TokenCounter;
+use ContextPaging\LLMSummarizer;
+use ContextPaging\OpenAICompatibleClient;
+use ContextPaging\ToolCallMode;
+
+// Create summarizer (optional — falls back to truncation if not provided)
+$summarizerClient = new OpenAICompatibleClient(
+    baseUrl: 'http://your-llm-endpoint/v1',
+    apiKey: null, // optional for local endpoints
+    timeout: 120
+);
+
+$summarizer = new LLMSummarizer(
+    client: $summarizerClient,
+    model: 'HuggingFaceTB/SmolLM3-3B',
+    maxTokens: 200,
+    temperature: 0.3
+);
+
+// Create main instance
+$contextPaging = new ContextPaging(
+    tokenCounter: new TokenCounter(),
+    summarizer: $summarizer
+);
+
+// Configure for your model
+$contextPaging
+    ->setMaxContextTokens(128000)
+    ->setResponseReserve(4096);
+
+// Set tool call mode (for models with broken tool parsers)
+$contextPaging->setToolCallMode(ToolCallMode::RAW);
+
+// LOOP 2: Fit the context
+$fittedRequest = $contextPaging->fit($request);
+
+// LOOP 3: Execute with dereference handling
+$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
+    return $client->chat($messages, $options);
+});
+```
+
+### TokenCounter
+
+```php
+use ContextPaging\TokenCounter;
+
+$counter = new TokenCounter();
+
+// Count tokens in a string
+$tokens = $counter->count("Hello, world!");
+// Returns: 4
+
+// Count with different encoding
+$tokens = $counter->count("Hello, world!", "o200k_base");
+
+// Count context size for chat messages
+$tokens = $counter->contextSize([
+    ['role' => 'user', 'content' => 'Hello!'],
+    ['role' => 'assistant', 'content' => 'Hi there!'],
+]);
+```
+
+### OpenAICompatibleClient
+
+```php
+use ContextPaging\OpenAICompatibleClient;
+
+$client = new OpenAICompatibleClient(
+    baseUrl: 'http://95.179.247.150/v1',
+    apiKey: null,
+    timeout: 120,
+    verifySsl: false
+);
+
+// Chat completion
+$response = $client->chat([
+    ['role' => 'user', 'content' => 'Hello!']
+], [
+    'model' => 'HuggingFaceTB/SmolLM3-3B',
+    'max_tokens' => 100
+]);
+
+// List models
+$models = $client->listModels();
+```
+
+### LLMSummarizer
+
+```php
+use ContextPaging\LLMSummarizer;
+
+$summarizer = new LLMSummarizer(
+    client: $client,
+    model: 'HuggingFaceTB/SmolLM3-3B',
+    systemPrompt: 'Summarize concisely, preserving key information.',
+    maxTokens: 200,
+    temperature: 0.3
+);
+
+$summary = $summarizer->summarize($longText);
+```
+
+---
+
+## Tool Call Modes
+
+The system supports two tool call modes for the dereference operation:
+
+### NATIVE Mode
+
+For models with working tool call parsers (GPT-4, Claude, etc.):
+
+```php
+$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
+```
+
+- Tools sent as `tools` array in request payload
+- Tool calls returned in `tool_calls` array in response
+
+### RAW Mode
+
+For models with broken/missing tool parsers (SmolLM3, etc.):
+
+```php
+$contextPaging->setToolCallMode(ToolCallMode::RAW);
+```
+
+- Tools injected into system prompt with XML-style format
+- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
+- Parsed from response content
+
+### AUTO Mode
+
+Detects mode from first response:
+
+```php
+$contextPaging->setToolCallMode(ToolCallMode::AUTO);
+```
+
+---
+
+## Implementation Status
+
+| Component | Status | Notes |
+|-----------|--------|-------|
+| Token counting | ✅ Done | Rust binary via `tiktoken-rs` |
+| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
+| Message store | ✅ Redis or in-memory | Persistent cache support |
+| Summary cache | ✅ Redis or in-memory | Persistent cache support |
+| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
+| Tool call parser | ✅ Done | NATIVE and RAW modes |
+| Tool formatter | ✅ Done | NATIVE and RAW modes |
+| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
+| LLMSummarizer | ✅ Done | Uses configured model |
+| RedisCache | ✅ Done | Persistent storage via Predis |
+| Tests | ✅ 36 passing | Unit + integration tests |
+
+---
+
+## Caching
+
+### In-Memory Cache (Default)
+
+By default, ContextPaging uses in-memory caches that exist for the duration of a single request:
+
+```php
+$contextPaging = new ContextPaging();
+// Uses InMemoryCache internally
+```
+
+### Redis Cache (Persistent)
+
+For persistent storage across requests, use Redis:
+
+```php
+use ContextPaging\RedisCache;
+
+// Create Redis-backed caches
+$messageStore = RedisCache::fromUrl(
+    'rediss://user:password@host:port',
+    prefix: 'ctx_msg:',    // Key prefix for namespacing
+    defaultTtl: null       // No expiry (or set TTL in seconds)
+);
+
+$summaryCache = RedisCache::fromUrl(
+    'rediss://user:password@host:port',
+    prefix: 'ctx_sum:'
+);
+
+// Inject into ContextPaging
+$contextPaging = new ContextPaging(
+    tokenCounter: new TokenCounter(),
+    messageStore: $messageStore,
+    summaryCache: $summaryCache
+);
+```
+
+**Benefits of Redis:**
+- Summaries persist between requests (no re-summarization)
+- Message store survives process restarts
+- Share context across multiple workers/servers
+
+**Key Namespacing:**
+- Message store uses keys: `prefix:msg:{md5}`
+- Summary cache uses keys: `prefix:summary:{md5}`
+
+---
+
+## Testing
+
+### Run All Tests
+
+```bash
+./vendor/bin/phpunit --testdox
+```
+
+### Test Categories
+
+**ContextPagingTest** (6 tests)
+- Small payloads pass through unchanged
+- Large payloads trigger summarization
+- Last message is never summarized
+- Original messages stored for dereferencing
+- Error when last message is too large
+
+**OpenAICompatibleClientTest** (8 tests)
+- Basic chat completion
+- Usage stats returned
+- Multi-turn conversation context retention
+- List models endpoint
+- RAW tool formatting
+- Tool call parser detection
+
+**SummarizerTest** (4 tests)
+- Summarization reduces token count (typically 75-85%)
+- Key information preserved
+- Multi-article summarization
+- Usage stats accuracy
+
+**ToolCallParserTest** (5 tests)
+- Extract native OpenAI tool calls
+- Extract raw XML-style tool calls
+- Auto-detect mode from response
+
+**ToolFormatterTest** (5 tests)
+- Format for native API
+- Format for raw system prompt injection
+
+**RedisCacheTest** (9 tests)
+- Set and get operations
+- Key existence checks
+- Delete operations
+- TTL expiration
+- ContextPaging with Redis cache
+- Summary persistence between requests
+- In-memory vs Redis parity
+- Message store persistence across instances
+
+### Integration Test Requirements
+
+Some tests require a running LLM endpoint. The default configuration uses:
+
+- **URL:** `http://95.179.247.150/v1`
+- **Model:** `HuggingFaceTB/SmolLM3-3B`
+
+To use a different endpoint, modify `setUp()` in the test files.
+
+---
+
+## Token Counter Binary
+
+The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:
+
+```bash
+# Default: cl100k_base (GPT-4/3.5)
+echo "Hello, world!" | ./token-counter
+# 4
+
+# GPT-4o encoding
+echo "Hello, world!" | ./token-counter o200k_base
+# 4
+```
+
+Source: `~/dev/token-counter/`
+
+---
+
+## Open Design Decisions
+
+### Dereference Overage
+
+When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:
+
+1. Allow temporary overage for one turn
+2. Drop other messages flagged as irrelevant
+3. Re-summarize something else
+4. Tighten summary quality to reduce dereferences
+
+**Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.
+
+---
+
+## The Theory
+
+Full design doc: See the original `Context Paging` spec.
+
+The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."