Go to file

biondizzle baad4a271a Remove hardcoded Redis URL from tests

- Read REDIS_URL from environment
- Skip tests if not set

2026-03-28 09:43:34 +00:00

.phpunit.cache

Initial commit

2026-03-28 09:01:07 +00:00

conversations

Initial commit

2026-03-28 09:01:07 +00:00

output

Initial commit

2026-03-28 09:01:07 +00:00

src

Initial commit

2026-03-28 09:01:07 +00:00

tests

Remove hardcoded Redis URL from tests

2026-03-28 09:43:34 +00:00

.env.example

Initial commit

2026-03-28 09:01:07 +00:00

.gitignore

Initial commit

2026-03-28 09:01:07 +00:00

analyze-context.sh

Initial commit

2026-03-28 09:01:07 +00:00

composer.json

Initial commit

2026-03-28 09:01:07 +00:00

composer.lock

Initial commit

2026-03-28 09:01:07 +00:00

conversation-runner.php

Remove hardcoded Redis URL from conversation-runner.php

2026-03-28 09:08:37 +00:00

conversation-runner.sh

Initial commit

2026-03-28 09:01:07 +00:00

index.php

Initial commit

2026-03-28 09:01:07 +00:00

PAPER.md

Initial commit

2026-03-28 09:01:07 +00:00

phpunit.xml

Initial commit

2026-03-28 09:01:07 +00:00

prompts.php

Initial commit

2026-03-28 09:01:07 +00:00

README.md

Initial commit

2026-03-28 09:01:07 +00:00

token-counter

Initial commit

2026-03-28 09:01:07 +00:00

README.md

Context Paging

Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.

The Problem

Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.

The Solution

Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.

The Analogy

This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.

Architecture: Three Nested Loops

USER sends message
 │
 ▼
┌─────────────────────────────────┐
│ LOOP 2 — Context Fitting        │
│ Compress history until it fits  │
└─────────────┬───────────────────┘
 │ fitted context
 ▼
┌─────────────────────────────────┐
│ LOOP 3 — Dereference            │
│ LLM may request full msgs       │
│ via MD5 → inject & re-run       │
└─────────────┬───────────────────┘
 │ final response
 ▼
USER receives response

Loop 2 — Fit

ContextPaging::fit() compresses messages until they fit within the context window:

Count total tokens in all messages
If under budget → done
Take oldest non-summarized message
Compute MD5 hash, store original in message store
Replace with summary + hash pointer: [md5:a3f8c1e9...] User asked about Q3 revenue...
Repeat until under budget

Rule: The last message (current user request) is never summarized.

Loop 3 — Execute

ContextPaging::execute() runs the LLM and handles dereference requests:

Send fitted context to LLM
If response contains fetch_message tool call with MD5 → continue
Look up original message, inject into context
Re-send to LLM
If response is normal text (no tool calls) → done, return to user

Project Structure

context-paging/
├── src/
│   ├── ContextPaging.php            # Main class — fit() + execute()
│   ├── TokenCounter.php             # Shells out to Rust binary
│   ├── ContextRequest.php           # Extended ServerRequest
│   ├── OpenAICompatibleClient.php   # Guzzle-based LLM client
│   ├── CompletionsClientInterface.php
│   ├── LLMSummarizer.php            # LLM-backed summarizer
│   ├── SummarizerInterface.php
│   ├── CacheInterface.php           # Cache abstraction
│   ├── InMemoryCache.php            # In-memory implementation
│   ├── RedisCache.php               # Redis implementation
│   ├── ToolCallParser.php           # Parse tool calls from responses
│   ├── ToolFormatter.php            # Format tools for requests
│   └── ToolCallMode.php             # NATIVE/RAW/AUTO enum
├── tests/
│   ├── ContextPagingTest.php        # Core functionality tests
│   ├── OpenAICompatibleClientTest.php # LLM client tests
│   ├── SummarizerTest.php           # Summarization tests
│   ├── RedisCacheTest.php           # Redis persistence tests
│   ├── ToolCallParserTest.php
│   ├── ToolFormatterTest.php
│   └── fluff.md                     # Test article for summarization
├── token-counter                    # Rust binary (tiktoken)
├── index.php                        # CLI entry point
├── composer.json
├── phpunit.xml
└── README.md

Quick Start

Prerequisites

PHP 8.5+
Composer
Rust binary at ./token-counter (or rebuild from ~/dev/token-counter/)

Install

composer install

This installs:

guzzlehttp/guzzle — HTTP client for LLM API calls
guzzlehttp/psr7 — PSR-7 message implementations
predis/predis — Redis client (optional, only if using RedisCache)

Run Tests

./vendor/bin/phpunit

# With testdox output
./vendor/bin/phpunit --testdox

# Run specific test file
./vendor/bin/phpunit tests/SummarizerTest.php

CLI Usage

# Pipe JSON payload
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php

# Or pass as argument
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'

API

ContextPaging

use ContextPaging\ContextPaging;
use ContextPaging\TokenCounter;
use ContextPaging\LLMSummarizer;
use ContextPaging\OpenAICompatibleClient;
use ContextPaging\ToolCallMode;

// Create summarizer (optional — falls back to truncation if not provided)
$summarizerClient = new OpenAICompatibleClient(
    baseUrl: 'http://your-llm-endpoint/v1',
    apiKey: null, // optional for local endpoints
    timeout: 120
);

$summarizer = new LLMSummarizer(
    client: $summarizerClient,
    model: 'HuggingFaceTB/SmolLM3-3B',
    maxTokens: 200,
    temperature: 0.3
);

// Create main instance
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    summarizer: $summarizer
);

// Configure for your model
$contextPaging
    ->setMaxContextTokens(128000)
    ->setResponseReserve(4096);

// Set tool call mode (for models with broken tool parsers)
$contextPaging->setToolCallMode(ToolCallMode::RAW);

// LOOP 2: Fit the context
$fittedRequest = $contextPaging->fit($request);

// LOOP 3: Execute with dereference handling
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
    return $client->chat($messages, $options);
});

TokenCounter

use ContextPaging\TokenCounter;

$counter = new TokenCounter();

// Count tokens in a string
$tokens = $counter->count("Hello, world!");
// Returns: 4

// Count with different encoding
$tokens = $counter->count("Hello, world!", "o200k_base");

// Count context size for chat messages
$tokens = $counter->contextSize([
    ['role' => 'user', 'content' => 'Hello!'],
    ['role' => 'assistant', 'content' => 'Hi there!'],
]);

OpenAICompatibleClient

use ContextPaging\OpenAICompatibleClient;

$client = new OpenAICompatibleClient(
    baseUrl: 'http://95.179.247.150/v1',
    apiKey: null,
    timeout: 120,
    verifySsl: false
);

// Chat completion
$response = $client->chat([
    ['role' => 'user', 'content' => 'Hello!']
], [
    'model' => 'HuggingFaceTB/SmolLM3-3B',
    'max_tokens' => 100
]);

// List models
$models = $client->listModels();

LLMSummarizer

use ContextPaging\LLMSummarizer;

$summarizer = new LLMSummarizer(
    client: $client,
    model: 'HuggingFaceTB/SmolLM3-3B',
    systemPrompt: 'Summarize concisely, preserving key information.',
    maxTokens: 200,
    temperature: 0.3
);

$summary = $summarizer->summarize($longText);

Tool Call Modes

The system supports two tool call modes for the dereference operation:

NATIVE Mode

For models with working tool call parsers (GPT-4, Claude, etc.):

$contextPaging->setToolCallMode(ToolCallMode::NATIVE);

Tools sent as tools array in request payload
Tool calls returned in tool_calls array in response

RAW Mode

For models with broken/missing tool parsers (SmolLM3, etc.):

$contextPaging->setToolCallMode(ToolCallMode::RAW);

Tools injected into system prompt with XML-style format
Model outputs tool calls as markers: <tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>
Parsed from response content

AUTO Mode

Detects mode from first response:

$contextPaging->setToolCallMode(ToolCallMode::AUTO);

Implementation Status

Component	Status	Notes
Token counting	✅ Done	Rust binary via `tiktoken-rs`
Fit loop (Loop 2)	✅ Done	Summarization via LLM
Message store	✅ Redis or in-memory	Persistent cache support
Summary cache	✅ Redis or in-memory	Persistent cache support
Dereference loop (Loop 3)	✅ Done	Tool call parsing implemented
Tool call parser	✅ Done	NATIVE and RAW modes
Tool formatter	✅ Done	NATIVE and RAW modes
LLM client	✅ Done	OpenAI-compatible via Guzzle
LLMSummarizer	✅ Done	Uses configured model
RedisCache	✅ Done	Persistent storage via Predis
Tests	✅ 36 passing	Unit + integration tests

Caching

In-Memory Cache (Default)

By default, ContextPaging uses in-memory caches that exist for the duration of a single request:

$contextPaging = new ContextPaging();
// Uses InMemoryCache internally

Redis Cache (Persistent)

For persistent storage across requests, use Redis:

use ContextPaging\RedisCache;

// Create Redis-backed caches
$messageStore = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_msg:',    // Key prefix for namespacing
    defaultTtl: null       // No expiry (or set TTL in seconds)
);

$summaryCache = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_sum:'
);

// Inject into ContextPaging
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    messageStore: $messageStore,
    summaryCache: $summaryCache
);

Benefits of Redis:

Summaries persist between requests (no re-summarization)
Message store survives process restarts
Share context across multiple workers/servers

Key Namespacing:

Message store uses keys: prefix:msg:{md5}
Summary cache uses keys: prefix:summary:{md5}

Testing

Run All Tests

./vendor/bin/phpunit --testdox

Test Categories

ContextPagingTest (6 tests)

Small payloads pass through unchanged
Large payloads trigger summarization
Last message is never summarized
Original messages stored for dereferencing
Error when last message is too large

OpenAICompatibleClientTest (8 tests)

Basic chat completion
Usage stats returned
Multi-turn conversation context retention
List models endpoint
RAW tool formatting
Tool call parser detection

SummarizerTest (4 tests)

Summarization reduces token count (typically 75-85%)
Key information preserved
Multi-article summarization
Usage stats accuracy

ToolCallParserTest (5 tests)

Extract native OpenAI tool calls
Extract raw XML-style tool calls
Auto-detect mode from response

ToolFormatterTest (5 tests)

Format for native API
Format for raw system prompt injection

RedisCacheTest (9 tests)

Set and get operations
Key existence checks
Delete operations
TTL expiration
ContextPaging with Redis cache
Summary persistence between requests
In-memory vs Redis parity
Message store persistence across instances

Integration Test Requirements

Some tests require a running LLM endpoint. The default configuration uses:

URL: http://95.179.247.150/v1
Model: HuggingFaceTB/SmolLM3-3B

To use a different endpoint, modify setUp() in the test files.

Token Counter Binary

The token-counter binary is a Rust CLI tool using tiktoken-rs:

# Default: cl100k_base (GPT-4/3.5)
echo "Hello, world!" | ./token-counter
# 4

# GPT-4o encoding
echo "Hello, world!" | ./token-counter o200k_base
# 4

Source: ~/dev/token-counter/

Open Design Decisions

Dereference Overage

When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:

Allow temporary overage for one turn
Drop other messages flagged as irrelevant
Re-summarize something else
Tighten summary quality to reduce dereferences

Recommendation: Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.

The Theory

Full design doc: See the original Context Paging spec.

The key insight: full messages are never discarded. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."