# Context Paging

**Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.**

---

## The Problem

Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.

## The Solution

Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.

## The Analogy

This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.

---

## Architecture: Three Nested Loops

```
USER sends message
 │
 ▼
┌─────────────────────────────────┐
│ LOOP 2 — Context Fitting        │
│ Compress history until it fits  │
└─────────────┬───────────────────┘
 │ fitted context
 ▼
┌─────────────────────────────────┐
│ LOOP 3 — Dereference            │
│ LLM may request full msgs       │
│ via MD5 → inject & re-run       │
└─────────────┬───────────────────┘
 │ final response
 ▼
USER receives response
```

### Loop 2 — Fit

`ContextPaging::fit()` compresses messages until they fit within the context window:

1. Count total tokens in all messages
2. If under budget → done
3. Take oldest non-summarized message
4. Compute MD5 hash, store original in message store
5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
6. Repeat until under budget

**Rule:** The last message (current user request) is **never** summarized.

### Loop 3 — Execute

`ContextPaging::execute()` runs the LLM and handles dereference requests:

1. Send fitted context to LLM
2. If response contains `fetch_message` tool call with MD5 → continue
3. Look up original message, inject into context
4. Re-send to LLM
5. If response is normal text (no tool calls) → done, return to user

---

## Project Structure

```
context-paging/
├── src/
│   ├── ContextPaging.php            # Main class — fit() + execute()
│   ├── TokenCounter.php             # Shells out to Rust binary
│   ├── ContextRequest.php           # Extended ServerRequest
│   ├── OpenAICompatibleClient.php   # Guzzle-based LLM client
│   ├── CompletionsClientInterface.php
│   ├── LLMSummarizer.php            # LLM-backed summarizer
│   ├── SummarizerInterface.php
│   ├── CacheInterface.php           # Cache abstraction
│   ├── InMemoryCache.php            # In-memory implementation
│   ├── RedisCache.php               # Redis implementation
│   ├── ToolCallParser.php           # Parse tool calls from responses
│   ├── ToolFormatter.php            # Format tools for requests
│   └── ToolCallMode.php             # NATIVE/RAW/AUTO enum
├── tests/
│   ├── ContextPagingTest.php        # Core functionality tests
│   ├── OpenAICompatibleClientTest.php # LLM client tests
│   ├── SummarizerTest.php           # Summarization tests
│   ├── RedisCacheTest.php           # Redis persistence tests
│   ├── ToolCallParserTest.php
│   ├── ToolFormatterTest.php
│   └── fluff.md                     # Test article for summarization
├── token-counter                    # Rust binary (tiktoken)
├── index.php                        # CLI entry point
├── composer.json
├── phpunit.xml
└── README.md
```

---

## Quick Start

### Prerequisites

- PHP 8.5+
- Composer
- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)

### Install

```bash
composer install
```

This installs:
- `guzzlehttp/guzzle` — HTTP client for LLM API calls
- `guzzlehttp/psr7` — PSR-7 message implementations
- `predis/predis` — Redis client (optional, only if using RedisCache)

### Run Tests

```bash
./vendor/bin/phpunit

# With testdox output
./vendor/bin/phpunit --testdox

# Run specific test file
./vendor/bin/phpunit tests/SummarizerTest.php
```

### CLI Usage

```bash
# Pipe JSON payload
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php

# Or pass as argument
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
```

---

## API

### ContextPaging

```php
use ContextPaging\ContextPaging;
use ContextPaging\TokenCounter;
use ContextPaging\LLMSummarizer;
use ContextPaging\OpenAICompatibleClient;
use ContextPaging\ToolCallMode;

// Create summarizer (optional — falls back to truncation if not provided)
$summarizerClient = new OpenAICompatibleClient(
    baseUrl: 'http://your-llm-endpoint/v1',
    apiKey: null, // optional for local endpoints
    timeout: 120
);

$summarizer = new LLMSummarizer(
    client: $summarizerClient,
    model: 'HuggingFaceTB/SmolLM3-3B',
    maxTokens: 200,
    temperature: 0.3
);

// Create main instance
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    summarizer: $summarizer
);

// Configure for your model
$contextPaging
    ->setMaxContextTokens(128000)
    ->setResponseReserve(4096);

// Set tool call mode (for models with broken tool parsers)
$contextPaging->setToolCallMode(ToolCallMode::RAW);

// LOOP 2: Fit the context
$fittedRequest = $contextPaging->fit($request);

// LOOP 3: Execute with dereference handling
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
    return $client->chat($messages, $options);
});
```

### TokenCounter

```php
use ContextPaging\TokenCounter;

$counter = new TokenCounter();

// Count tokens in a string
$tokens = $counter->count("Hello, world!");
// Returns: 4

// Count with different encoding
$tokens = $counter->count("Hello, world!", "o200k_base");

// Count context size for chat messages
$tokens = $counter->contextSize([
    ['role' => 'user', 'content' => 'Hello!'],
    ['role' => 'assistant', 'content' => 'Hi there!'],
]);
```

### OpenAICompatibleClient

```php
use ContextPaging\OpenAICompatibleClient;

$client = new OpenAICompatibleClient(
    baseUrl: 'http://95.179.247.150/v1',
    apiKey: null,
    timeout: 120,
    verifySsl: false
);

// Chat completion
$response = $client->chat([
    ['role' => 'user', 'content' => 'Hello!']
], [
    'model' => 'HuggingFaceTB/SmolLM3-3B',
    'max_tokens' => 100
]);

// List models
$models = $client->listModels();
```

### LLMSummarizer

```php
use ContextPaging\LLMSummarizer;

$summarizer = new LLMSummarizer(
    client: $client,
    model: 'HuggingFaceTB/SmolLM3-3B',
    systemPrompt: 'Summarize concisely, preserving key information.',
    maxTokens: 200,
    temperature: 0.3
);

$summary = $summarizer->summarize($longText);
```

---

## Tool Call Modes

The system supports two tool call modes for the dereference operation:

### NATIVE Mode

For models with working tool call parsers (GPT-4, Claude, etc.):

```php
$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
```

- Tools sent as `tools` array in request payload
- Tool calls returned in `tool_calls` array in response

### RAW Mode

For models with broken/missing tool parsers (SmolLM3, etc.):

```php
$contextPaging->setToolCallMode(ToolCallMode::RAW);
```

- Tools injected into system prompt with XML-style format
- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
- Parsed from response content

### AUTO Mode

Detects mode from first response:

```php
$contextPaging->setToolCallMode(ToolCallMode::AUTO);
```

---

## Implementation Status

| Component | Status | Notes |
|-----------|--------|-------|
| Token counting | ✅ Done | Rust binary via `tiktoken-rs` |
| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
| Message store | ✅ Redis or in-memory | Persistent cache support |
| Summary cache | ✅ Redis or in-memory | Persistent cache support |
| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
| Tool call parser | ✅ Done | NATIVE and RAW modes |
| Tool formatter | ✅ Done | NATIVE and RAW modes |
| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
| LLMSummarizer | ✅ Done | Uses configured model |
| RedisCache | ✅ Done | Persistent storage via Predis |
| Tests | ✅ 36 passing | Unit + integration tests |

---

## Caching

### In-Memory Cache (Default)

By default, ContextPaging uses in-memory caches that exist for the duration of a single request:

```php
$contextPaging = new ContextPaging();
// Uses InMemoryCache internally
```

### Redis Cache (Persistent)

For persistent storage across requests, use Redis:

```php
use ContextPaging\RedisCache;

// Create Redis-backed caches
$messageStore = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_msg:',    // Key prefix for namespacing
    defaultTtl: null       // No expiry (or set TTL in seconds)
);

$summaryCache = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_sum:'
);

// Inject into ContextPaging
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    messageStore: $messageStore,
    summaryCache: $summaryCache
);
```

**Benefits of Redis:**
- Summaries persist between requests (no re-summarization)
- Message store survives process restarts
- Share context across multiple workers/servers

**Key Namespacing:**
- Message store uses keys: `prefix:msg:{md5}`
- Summary cache uses keys: `prefix:summary:{md5}`

---

## Testing

### Run All Tests

```bash
./vendor/bin/phpunit --testdox
```

### Test Categories

**ContextPagingTest** (6 tests)
- Small payloads pass through unchanged
- Large payloads trigger summarization
- Last message is never summarized
- Original messages stored for dereferencing
- Error when last message is too large

**OpenAICompatibleClientTest** (8 tests)
- Basic chat completion
- Usage stats returned
- Multi-turn conversation context retention
- List models endpoint
- RAW tool formatting
- Tool call parser detection

**SummarizerTest** (4 tests)
- Summarization reduces token count (typically 75-85%)
- Key information preserved
- Multi-article summarization
- Usage stats accuracy

**ToolCallParserTest** (5 tests)
- Extract native OpenAI tool calls
- Extract raw XML-style tool calls
- Auto-detect mode from response

**ToolFormatterTest** (5 tests)
- Format for native API
- Format for raw system prompt injection

**RedisCacheTest** (9 tests)
- Set and get operations
- Key existence checks
- Delete operations
- TTL expiration
- ContextPaging with Redis cache
- Summary persistence between requests
- In-memory vs Redis parity
- Message store persistence across instances

### Integration Test Requirements

Some tests require a running LLM endpoint. The default configuration uses:

- **URL:** `http://95.179.247.150/v1`
- **Model:** `HuggingFaceTB/SmolLM3-3B`

To use a different endpoint, modify `setUp()` in the test files.

---

## Token Counter Binary

The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:

```bash
# Default: cl100k_base (GPT-4/3.5)
echo "Hello, world!" | ./token-counter
# 4

# GPT-4o encoding
echo "Hello, world!" | ./token-counter o200k_base
# 4
```

Source: `~/dev/token-counter/`

---

## Open Design Decisions

### Dereference Overage

When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:

1. Allow temporary overage for one turn
2. Drop other messages flagged as irrelevant
3. Re-summarize something else
4. Tighten summary quality to reduce dereferences

**Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.

---

## The Theory

Full design doc: See the original `Context Paging` spec.

The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."