README.md

# Context Paging

**Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.**

---

## The Problem

Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.

## The Solution

Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.

## The Analogy

This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.

---

## Architecture: Three Nested Loops

```
USER sends message
 │
 ▼
┌─────────────────────────────────┐
│ LOOP 2 — Context Fitting        │
│ Compress history until it fits  │
└─────────────┬───────────────────┘
 │ fitted context
 ▼
┌─────────────────────────────────┐
│ LOOP 3 — Dereference            │
│ LLM may request full msgs       │
│ via MD5 → inject & re-run       │
└─────────────┬───────────────────┘
 │ final response
 ▼
USER receives response
```

### Loop 2 — Fit

`ContextPaging::fit()` compresses messages until they fit within the context window:

1. Count total tokens in all messages
2. If under budget → done
3. Take oldest non-summarized message
4. Compute MD5 hash, store original in message store
5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
6. Repeat until under budget

**Rule:** The last message (current user request) is **never** summarized.

### Loop 3 — Execute

`ContextPaging::execute()` runs the LLM and handles dereference requests:

1. Send fitted context to LLM
2. If response contains `fetch_message` tool call with MD5 → continue
3. Look up original message, inject into context
4. Re-send to LLM
5. If response is normal text (no tool calls) → done, return to user

---

## Project Structure

```
context-paging/
├── src/
│   ├── ContextPaging.php            # Main class — fit() + execute()
│   ├── TokenCounter.php             # Shells out to Rust binary
│   ├── ContextRequest.php           # Extended ServerRequest
│   ├── OpenAICompatibleClient.php   # Guzzle-based LLM client
│   ├── CompletionsClientInterface.php
│   ├── LLMSummarizer.php            # LLM-backed summarizer
│   ├── SummarizerInterface.php
│   ├── CacheInterface.php           # Cache abstraction
│   ├── InMemoryCache.php            # In-memory implementation
│   ├── RedisCache.php               # Redis implementation
│   ├── ToolCallParser.php           # Parse tool calls from responses
│   ├── ToolFormatter.php            # Format tools for requests
│   └── ToolCallMode.php             # NATIVE/RAW/AUTO enum
├── tests/
│   ├── ContextPagingTest.php        # Core functionality tests
│   ├── OpenAICompatibleClientTest.php # LLM client tests
│   ├── SummarizerTest.php           # Summarization tests
│   ├── RedisCacheTest.php           # Redis persistence tests
│   ├── ToolCallParserTest.php
│   ├── ToolFormatterTest.php
│   └── fluff.md                     # Test article for summarization
├── token-counter                    # Rust binary (tiktoken)
├── index.php                        # CLI entry point
├── composer.json
├── phpunit.xml
└── README.md
```

---

## Quick Start

### Prerequisites

- PHP 8.5+
- Composer
- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)

### Install

```bash
composer install
```

This installs:
- `guzzlehttp/guzzle` — HTTP client for LLM API calls
- `guzzlehttp/psr7` — PSR-7 message implementations
- `predis/predis` — Redis client (optional, only if using RedisCache)

### Run Tests

```bash
./vendor/bin/phpunit

# With testdox output
./vendor/bin/phpunit --testdox

# Run specific test file
./vendor/bin/phpunit tests/SummarizerTest.php
```

### CLI Usage

```bash
# Pipe JSON payload
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php

# Or pass as argument
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'
```

---

## API

### ContextPaging

```php
use ContextPaging\ContextPaging;
use ContextPaging\TokenCounter;
use ContextPaging\LLMSummarizer;
use ContextPaging\OpenAICompatibleClient;
use ContextPaging\ToolCallMode;

// Create summarizer (optional — falls back to truncation if not provided)
$summarizerClient = new OpenAICompatibleClient(
    baseUrl: 'http://your-llm-endpoint/v1',
    apiKey: null, // optional for local endpoints
    timeout: 120
);

$summarizer = new LLMSummarizer(
    client: $summarizerClient,
    model: 'HuggingFaceTB/SmolLM3-3B',
    maxTokens: 200,
    temperature: 0.3
);

// Create main instance
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    summarizer: $summarizer
);

// Configure for your model
$contextPaging
    ->setMaxContextTokens(128000)
    ->setResponseReserve(4096);

// Set tool call mode (for models with broken tool parsers)
$contextPaging->setToolCallMode(ToolCallMode::RAW);

// LOOP 2: Fit the context
$fittedRequest = $contextPaging->fit($request);

// LOOP 3: Execute with dereference handling
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
    return $client->chat($messages, $options);
});
```

### TokenCounter

```php
use ContextPaging\TokenCounter;

$counter = new TokenCounter();

// Count tokens in a string
$tokens = $counter->count("Hello, world!");
// Returns: 4

// Count with different encoding
$tokens = $counter->count("Hello, world!", "o200k_base");

// Count context size for chat messages
$tokens = $counter->contextSize([
    ['role' => 'user', 'content' => 'Hello!'],
    ['role' => 'assistant', 'content' => 'Hi there!'],
]);
```

### OpenAICompatibleClient

```php
use ContextPaging\OpenAICompatibleClient;

$client = new OpenAICompatibleClient(
    baseUrl: 'http://95.179.247.150/v1',
    apiKey: null,
    timeout: 120,
    verifySsl: false
);

// Chat completion
$response = $client->chat([
    ['role' => 'user', 'content' => 'Hello!']
], [
    'model' => 'HuggingFaceTB/SmolLM3-3B',
    'max_tokens' => 100
]);

// List models
$models = $client->listModels();
```

### LLMSummarizer

```php
use ContextPaging\LLMSummarizer;

$summarizer = new LLMSummarizer(
    client: $client,
    model: 'HuggingFaceTB/SmolLM3-3B',
    systemPrompt: 'Summarize concisely, preserving key information.',
    maxTokens: 200,
    temperature: 0.3
);

$summary = $summarizer->summarize($longText);
```

---

## Tool Call Modes

The system supports two tool call modes for the dereference operation:

### NATIVE Mode

For models with working tool call parsers (GPT-4, Claude, etc.):

```php
$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
```

- Tools sent as `tools` array in request payload
- Tool calls returned in `tool_calls` array in response

### RAW Mode

For models with broken/missing tool parsers (SmolLM3, etc.):

```php
$contextPaging->setToolCallMode(ToolCallMode::RAW);
```

- Tools injected into system prompt with XML-style format
- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
- Parsed from response content

### AUTO Mode

Detects mode from first response:

```php
$contextPaging->setToolCallMode(ToolCallMode::AUTO);
```

---

## Implementation Status

| Component | Status | Notes |
|-----------|--------|-------|
| Token counting | ✅ Done | Rust binary via `tiktoken-rs` |
| Fit loop (Loop 2) | ✅ Done | Summarization via LLM |
| Message store | ✅ Redis or in-memory | Persistent cache support |
| Summary cache | ✅ Redis or in-memory | Persistent cache support |
| Dereference loop (Loop 3) | ✅ Done | Tool call parsing implemented |
| Tool call parser | ✅ Done | NATIVE and RAW modes |
| Tool formatter | ✅ Done | NATIVE and RAW modes |
| LLM client | ✅ Done | OpenAI-compatible via Guzzle |
| LLMSummarizer | ✅ Done | Uses configured model |
| RedisCache | ✅ Done | Persistent storage via Predis |
| Tests | ✅ 36 passing | Unit + integration tests |

---

## Caching

### In-Memory Cache (Default)

By default, ContextPaging uses in-memory caches that exist for the duration of a single request:

```php
$contextPaging = new ContextPaging();
// Uses InMemoryCache internally
```

### Redis Cache (Persistent)

For persistent storage across requests, use Redis:

```php
use ContextPaging\RedisCache;

// Create Redis-backed caches
$messageStore = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_msg:',    // Key prefix for namespacing
    defaultTtl: null       // No expiry (or set TTL in seconds)
);

$summaryCache = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_sum:'
);

// Inject into ContextPaging
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    messageStore: $messageStore,
    summaryCache: $summaryCache
);
```

**Benefits of Redis:**
- Summaries persist between requests (no re-summarization)
- Message store survives process restarts
- Share context across multiple workers/servers

**Key Namespacing:**
- Message store uses keys: `prefix:msg:{md5}`
- Summary cache uses keys: `prefix:summary:{md5}`

---

## Testing

### Run All Tests

```bash
./vendor/bin/phpunit --testdox
```

### Test Categories

**ContextPagingTest** (6 tests)
- Small payloads pass through unchanged
- Large payloads trigger summarization
- Last message is never summarized
- Original messages stored for dereferencing
- Error when last message is too large

**OpenAICompatibleClientTest** (8 tests)
- Basic chat completion
- Usage stats returned
- Multi-turn conversation context retention
- List models endpoint
- RAW tool formatting
- Tool call parser detection

**SummarizerTest** (4 tests)
- Summarization reduces token count (typically 75-85%)
- Key information preserved
- Multi-article summarization
- Usage stats accuracy

**ToolCallParserTest** (5 tests)
- Extract native OpenAI tool calls
- Extract raw XML-style tool calls
- Auto-detect mode from response

**ToolFormatterTest** (5 tests)
- Format for native API
- Format for raw system prompt injection

**RedisCacheTest** (9 tests)
- Set and get operations
- Key existence checks
- Delete operations
- TTL expiration
- ContextPaging with Redis cache
- Summary persistence between requests
- In-memory vs Redis parity
- Message store persistence across instances

### Integration Test Requirements

Some tests require a running LLM endpoint. The default configuration uses:

- **URL:** `http://95.179.247.150/v1`
- **Model:** `HuggingFaceTB/SmolLM3-3B`

To use a different endpoint, modify `setUp()` in the test files.

---

## Token Counter Binary

The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:

```bash
# Default: cl100k_base (GPT-4/3.5)
echo "Hello, world!" | ./token-counter
# 4

# GPT-4o encoding
echo "Hello, world!" | ./token-counter o200k_base
# 4
```

Source: `~/dev/token-counter/`

---

## Open Design Decisions

### Dereference Overage

When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:

1. Allow temporary overage for one turn
2. Drop other messages flagged as irrelevant
3. Re-summarize something else
4. Tighten summary quality to reduce dereferences

**Recommendation:** Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.

---

## The Theory

Full design doc: See the original `Context Paging` spec.

The key insight: **full messages are never discarded**. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."
Initial commit 2026-03-28 08:54:57 +00:00			`# Context Paging`

			`Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.`

			`---`

			`## The Problem`

			`Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.`

			`## The Solution`

			`Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.`

			`## The Analogy`

			`This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.`

			`---`

			`## Architecture: Three Nested Loops`

			```
			`USER sends message`
			`│`
			`▼`
			`┌─────────────────────────────────┐`
			`│ LOOP 2 — Context Fitting │`
			`│ Compress history until it fits │`
			`└─────────────┬───────────────────┘`
			`│ fitted context`
			`▼`
			`┌─────────────────────────────────┐`
			`│ LOOP 3 — Dereference │`
			`│ LLM may request full msgs │`
			`│ via MD5 → inject & re-run │`
			`└─────────────┬───────────────────┘`
			`│ final response`
			`▼`
			`USER receives response`
			```

			`### Loop 2 — Fit`

			`ContextPaging::fit()` compresses messages until they fit within the context window:

			`1. Count total tokens in all messages`
			`2. If under budget → done`
			`3. Take oldest non-summarized message`
			`4. Compute MD5 hash, store original in message store`
			5. Replace with summary + hash pointer: `[md5:a3f8c1e9...] User asked about Q3 revenue...`
			`6. Repeat until under budget`

			`Rule: The last message (current user request) is never summarized.`

			`### Loop 3 — Execute`

			`ContextPaging::execute()` runs the LLM and handles dereference requests:

			`1. Send fitted context to LLM`
			2. If response contains `fetch_message` tool call with MD5 → continue
			`3. Look up original message, inject into context`
			`4. Re-send to LLM`
			`5. If response is normal text (no tool calls) → done, return to user`

			`---`

			`## Project Structure`

			```
			`context-paging/`
			`├── src/`
			`│ ├── ContextPaging.php # Main class — fit() + execute()`
			`│ ├── TokenCounter.php # Shells out to Rust binary`
			`│ ├── ContextRequest.php # Extended ServerRequest`
			`│ ├── OpenAICompatibleClient.php # Guzzle-based LLM client`
			`│ ├── CompletionsClientInterface.php`
			`│ ├── LLMSummarizer.php # LLM-backed summarizer`
			`│ ├── SummarizerInterface.php`
			`│ ├── CacheInterface.php # Cache abstraction`
			`│ ├── InMemoryCache.php # In-memory implementation`
			`│ ├── RedisCache.php # Redis implementation`
			`│ ├── ToolCallParser.php # Parse tool calls from responses`
			`│ ├── ToolFormatter.php # Format tools for requests`
			`│ └── ToolCallMode.php # NATIVE/RAW/AUTO enum`
			`├── tests/`
			`│ ├── ContextPagingTest.php # Core functionality tests`
			`│ ├── OpenAICompatibleClientTest.php # LLM client tests`
			`│ ├── SummarizerTest.php # Summarization tests`
			`│ ├── RedisCacheTest.php # Redis persistence tests`
			`│ ├── ToolCallParserTest.php`
			`│ ├── ToolFormatterTest.php`
			`│ └── fluff.md # Test article for summarization`
			`├── token-counter # Rust binary (tiktoken)`
			`├── index.php # CLI entry point`
			`├── composer.json`
			`├── phpunit.xml`
			`└── README.md`
			```

			`---`

			`## Quick Start`

			`### Prerequisites`

			`- PHP 8.5+`
			`- Composer`
			- Rust binary at `./token-counter` (or rebuild from `~/dev/token-counter/`)

			`### Install`

			```bash
			`composer install`
			```

			`This installs:`
			- `guzzlehttp/guzzle` — HTTP client for LLM API calls
			- `guzzlehttp/psr7` — PSR-7 message implementations
			- `predis/predis` — Redis client (optional, only if using RedisCache)

			`### Run Tests`

			```bash
			`./vendor/bin/phpunit`

			`# With testdox output`
			`./vendor/bin/phpunit --testdox`

			`# Run specific test file`
			`./vendor/bin/phpunit tests/SummarizerTest.php`
			```

			`### CLI Usage`

			```bash
			`# Pipe JSON payload`
			`echo '{"messages":[{"role":"user","content":"Hello!"}]}' \| php index.php`

			`# Or pass as argument`
			`php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'`
			```

			`---`

			`## API`

			`### ContextPaging`

			```php
			`use ContextPaging\ContextPaging;`
			`use ContextPaging\TokenCounter;`
			`use ContextPaging\LLMSummarizer;`
			`use ContextPaging\OpenAICompatibleClient;`
			`use ContextPaging\ToolCallMode;`

			`// Create summarizer (optional — falls back to truncation if not provided)`
			`$summarizerClient = new OpenAICompatibleClient(`
			`baseUrl: 'http://your-llm-endpoint/v1',`
			`apiKey: null, // optional for local endpoints`
			`timeout: 120`
			`);`

			`$summarizer = new LLMSummarizer(`
			`client: $summarizerClient,`
			`model: 'HuggingFaceTB/SmolLM3-3B',`
			`maxTokens: 200,`
			`temperature: 0.3`
			`);`

			`// Create main instance`
			`$contextPaging = new ContextPaging(`
			`tokenCounter: new TokenCounter(),`
			`summarizer: $summarizer`
			`);`

			`// Configure for your model`
			`$contextPaging`
			`->setMaxContextTokens(128000)`
			`->setResponseReserve(4096);`

			`// Set tool call mode (for models with broken tool parsers)`
			`$contextPaging->setToolCallMode(ToolCallMode::RAW);`

			`// LOOP 2: Fit the context`
			`$fittedRequest = $contextPaging->fit($request);`

			`// LOOP 3: Execute with dereference handling`
			`$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {`
			`return $client->chat($messages, $options);`
			`});`
			```

			`### TokenCounter`

			```php
			`use ContextPaging\TokenCounter;`

			`$counter = new TokenCounter();`

			`// Count tokens in a string`
			`$tokens = $counter->count("Hello, world!");`
			`// Returns: 4`

			`// Count with different encoding`
			`$tokens = $counter->count("Hello, world!", "o200k_base");`

			`// Count context size for chat messages`
			`$tokens = $counter->contextSize([`
			`['role' => 'user', 'content' => 'Hello!'],`
			`['role' => 'assistant', 'content' => 'Hi there!'],`
			`]);`
			```

			`### OpenAICompatibleClient`

			```php
			`use ContextPaging\OpenAICompatibleClient;`

			`$client = new OpenAICompatibleClient(`
			`baseUrl: 'http://95.179.247.150/v1',`
			`apiKey: null,`
			`timeout: 120,`
			`verifySsl: false`
			`);`

			`// Chat completion`
			`$response = $client->chat([`
			`['role' => 'user', 'content' => 'Hello!']`
			`], [`
			`'model' => 'HuggingFaceTB/SmolLM3-3B',`
			`'max_tokens' => 100`
			`]);`

			`// List models`
			`$models = $client->listModels();`
			```

			`### LLMSummarizer`

			```php
			`use ContextPaging\LLMSummarizer;`

			`$summarizer = new LLMSummarizer(`
			`client: $client,`
			`model: 'HuggingFaceTB/SmolLM3-3B',`
			`systemPrompt: 'Summarize concisely, preserving key information.',`
			`maxTokens: 200,`
			`temperature: 0.3`
			`);`

			`$summary = $summarizer->summarize($longText);`
			```

			`---`

			`## Tool Call Modes`

			`The system supports two tool call modes for the dereference operation:`

			`### NATIVE Mode`

			`For models with working tool call parsers (GPT-4, Claude, etc.):`

			```php
			`$contextPaging->setToolCallMode(ToolCallMode::NATIVE);`
			```

			- Tools sent as `tools` array in request payload
			- Tool calls returned in `tool_calls` array in response

			`### RAW Mode`

			`For models with broken/missing tool parsers (SmolLM3, etc.):`

			```php
			`$contextPaging->setToolCallMode(ToolCallMode::RAW);`
			```

			`- Tools injected into system prompt with XML-style format`
			- Model outputs tool calls as markers: `<tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>`
			`- Parsed from response content`

			`### AUTO Mode`

			`Detects mode from first response:`

			```php
			`$contextPaging->setToolCallMode(ToolCallMode::AUTO);`
			```

			`---`

			`## Implementation Status`

			`\| Component \| Status \| Notes \|`
			`\|-----------\|--------\|-------\|`
			\| Token counting \| ✅ Done \| Rust binary via `tiktoken-rs` \|
			`\| Fit loop (Loop 2) \| ✅ Done \| Summarization via LLM \|`
			`\| Message store \| ✅ Redis or in-memory \| Persistent cache support \|`
			`\| Summary cache \| ✅ Redis or in-memory \| Persistent cache support \|`
			`\| Dereference loop (Loop 3) \| ✅ Done \| Tool call parsing implemented \|`
			`\| Tool call parser \| ✅ Done \| NATIVE and RAW modes \|`
			`\| Tool formatter \| ✅ Done \| NATIVE and RAW modes \|`
			`\| LLM client \| ✅ Done \| OpenAI-compatible via Guzzle \|`
			`\| LLMSummarizer \| ✅ Done \| Uses configured model \|`
			`\| RedisCache \| ✅ Done \| Persistent storage via Predis \|`
			`\| Tests \| ✅ 36 passing \| Unit + integration tests \|`

			`---`

			`## Caching`

			`### In-Memory Cache (Default)`

			`By default, ContextPaging uses in-memory caches that exist for the duration of a single request:`

			```php
			`$contextPaging = new ContextPaging();`
			`// Uses InMemoryCache internally`
			```

			`### Redis Cache (Persistent)`

			`For persistent storage across requests, use Redis:`

			```php
			`use ContextPaging\RedisCache;`

			`// Create Redis-backed caches`
			`$messageStore = RedisCache::fromUrl(`
			`'rediss://user:password@host:port',`
			`prefix: 'ctx_msg:', // Key prefix for namespacing`
			`defaultTtl: null // No expiry (or set TTL in seconds)`
			`);`

			`$summaryCache = RedisCache::fromUrl(`
			`'rediss://user:password@host:port',`
			`prefix: 'ctx_sum:'`
			`);`

			`// Inject into ContextPaging`
			`$contextPaging = new ContextPaging(`
			`tokenCounter: new TokenCounter(),`
			`messageStore: $messageStore,`
			`summaryCache: $summaryCache`
			`);`
			```

			`Benefits of Redis:`
			`- Summaries persist between requests (no re-summarization)`
			`- Message store survives process restarts`
			`- Share context across multiple workers/servers`

			`Key Namespacing:`
			- Message store uses keys: `prefix:msg:{md5}`
			- Summary cache uses keys: `prefix:summary:{md5}`

			`---`

			`## Testing`

			`### Run All Tests`

			```bash
			`./vendor/bin/phpunit --testdox`
			```

			`### Test Categories`

			`ContextPagingTest (6 tests)`
			`- Small payloads pass through unchanged`
			`- Large payloads trigger summarization`
			`- Last message is never summarized`
			`- Original messages stored for dereferencing`
			`- Error when last message is too large`

			`OpenAICompatibleClientTest (8 tests)`
			`- Basic chat completion`
			`- Usage stats returned`
			`- Multi-turn conversation context retention`
			`- List models endpoint`
			`- RAW tool formatting`
			`- Tool call parser detection`

			`SummarizerTest (4 tests)`
			`- Summarization reduces token count (typically 75-85%)`
			`- Key information preserved`
			`- Multi-article summarization`
			`- Usage stats accuracy`

			`ToolCallParserTest (5 tests)`
			`- Extract native OpenAI tool calls`
			`- Extract raw XML-style tool calls`
			`- Auto-detect mode from response`

			`ToolFormatterTest (5 tests)`
			`- Format for native API`
			`- Format for raw system prompt injection`

			`RedisCacheTest (9 tests)`
			`- Set and get operations`
			`- Key existence checks`
			`- Delete operations`
			`- TTL expiration`
			`- ContextPaging with Redis cache`
			`- Summary persistence between requests`
			`- In-memory vs Redis parity`
			`- Message store persistence across instances`

			`### Integration Test Requirements`

			`Some tests require a running LLM endpoint. The default configuration uses:`

			- URL: `http://95.179.247.150/v1`
			- Model: `HuggingFaceTB/SmolLM3-3B`

			To use a different endpoint, modify `setUp()` in the test files.

			`---`

			`## Token Counter Binary`

			The `token-counter` binary is a Rust CLI tool using `tiktoken-rs`:

			```bash
			`# Default: cl100k_base (GPT-4/3.5)`
			`echo "Hello, world!" \| ./token-counter`
			`# 4`

			`# GPT-4o encoding`
			`echo "Hello, world!" \| ./token-counter o200k_base`
			`# 4`
			```

			Source: `~/dev/token-counter/`

			`---`

			`## Open Design Decisions`

			`### Dereference Overage`

			`When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:`

			`1. Allow temporary overage for one turn`
			`2. Drop other messages flagged as irrelevant`
			`3. Re-summarize something else`
			`4. Tighten summary quality to reduce dereferences`

			`Recommendation: Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.`

			`---`

			`## The Theory`

			Full design doc: See the original `Context Paging` spec.

			`The key insight: full messages are never discarded. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."`