biondizzle baad4a271a Remove hardcoded Redis URL from tests
- Read REDIS_URL from environment
- Skip tests if not set
2026-03-28 09:43:34 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00
2026-03-28 09:01:07 +00:00

Context Paging

Virtual memory for LLM context windows — summarize, pointer-reference, and dereference on demand.


The Problem

Long conversations exceed the model's context limit. Naively truncating messages loses critical information. Sending everything wastes tokens and degrades quality.

The Solution

Replace older messages with compressed summaries that include a pointer (MD5 hash) back to the original. The model can "dereference" any pointer by requesting the full message via tool call.

The Analogy

This is virtual memory. The context window is RAM. The message store is the page table. The original messages are disk. A tool call requesting an MD5 hash is a page fault.


Architecture: Three Nested Loops

USER sends message
 │
 ▼
┌─────────────────────────────────┐
│ LOOP 2 — Context Fitting        │
│ Compress history until it fits  │
└─────────────┬───────────────────┘
 │ fitted context
 ▼
┌─────────────────────────────────┐
│ LOOP 3 — Dereference            │
│ LLM may request full msgs       │
│ via MD5 → inject & re-run       │
└─────────────┬───────────────────┘
 │ final response
 ▼
USER receives response

Loop 2 — Fit

ContextPaging::fit() compresses messages until they fit within the context window:

  1. Count total tokens in all messages
  2. If under budget → done
  3. Take oldest non-summarized message
  4. Compute MD5 hash, store original in message store
  5. Replace with summary + hash pointer: [md5:a3f8c1e9...] User asked about Q3 revenue...
  6. Repeat until under budget

Rule: The last message (current user request) is never summarized.

Loop 3 — Execute

ContextPaging::execute() runs the LLM and handles dereference requests:

  1. Send fitted context to LLM
  2. If response contains fetch_message tool call with MD5 → continue
  3. Look up original message, inject into context
  4. Re-send to LLM
  5. If response is normal text (no tool calls) → done, return to user

Project Structure

context-paging/
├── src/
│   ├── ContextPaging.php            # Main class — fit() + execute()
│   ├── TokenCounter.php             # Shells out to Rust binary
│   ├── ContextRequest.php           # Extended ServerRequest
│   ├── OpenAICompatibleClient.php   # Guzzle-based LLM client
│   ├── CompletionsClientInterface.php
│   ├── LLMSummarizer.php            # LLM-backed summarizer
│   ├── SummarizerInterface.php
│   ├── CacheInterface.php           # Cache abstraction
│   ├── InMemoryCache.php            # In-memory implementation
│   ├── RedisCache.php               # Redis implementation
│   ├── ToolCallParser.php           # Parse tool calls from responses
│   ├── ToolFormatter.php            # Format tools for requests
│   └── ToolCallMode.php             # NATIVE/RAW/AUTO enum
├── tests/
│   ├── ContextPagingTest.php        # Core functionality tests
│   ├── OpenAICompatibleClientTest.php # LLM client tests
│   ├── SummarizerTest.php           # Summarization tests
│   ├── RedisCacheTest.php           # Redis persistence tests
│   ├── ToolCallParserTest.php
│   ├── ToolFormatterTest.php
│   └── fluff.md                     # Test article for summarization
├── token-counter                    # Rust binary (tiktoken)
├── index.php                        # CLI entry point
├── composer.json
├── phpunit.xml
└── README.md

Quick Start

Prerequisites

  • PHP 8.5+
  • Composer
  • Rust binary at ./token-counter (or rebuild from ~/dev/token-counter/)

Install

composer install

This installs:

  • guzzlehttp/guzzle — HTTP client for LLM API calls
  • guzzlehttp/psr7 — PSR-7 message implementations
  • predis/predis — Redis client (optional, only if using RedisCache)

Run Tests

./vendor/bin/phpunit

# With testdox output
./vendor/bin/phpunit --testdox

# Run specific test file
./vendor/bin/phpunit tests/SummarizerTest.php

CLI Usage

# Pipe JSON payload
echo '{"messages":[{"role":"user","content":"Hello!"}]}' | php index.php

# Or pass as argument
php index.php '{"messages":[{"role":"user","content":"Hello!"}]}'

API

ContextPaging

use ContextPaging\ContextPaging;
use ContextPaging\TokenCounter;
use ContextPaging\LLMSummarizer;
use ContextPaging\OpenAICompatibleClient;
use ContextPaging\ToolCallMode;

// Create summarizer (optional — falls back to truncation if not provided)
$summarizerClient = new OpenAICompatibleClient(
    baseUrl: 'http://your-llm-endpoint/v1',
    apiKey: null, // optional for local endpoints
    timeout: 120
);

$summarizer = new LLMSummarizer(
    client: $summarizerClient,
    model: 'HuggingFaceTB/SmolLM3-3B',
    maxTokens: 200,
    temperature: 0.3
);

// Create main instance
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    summarizer: $summarizer
);

// Configure for your model
$contextPaging
    ->setMaxContextTokens(128000)
    ->setResponseReserve(4096);

// Set tool call mode (for models with broken tool parsers)
$contextPaging->setToolCallMode(ToolCallMode::RAW);

// LOOP 2: Fit the context
$fittedRequest = $contextPaging->fit($request);

// LOOP 3: Execute with dereference handling
$response = $contextPaging->execute($fittedRequest, function (array $messages, $options) use ($client) {
    return $client->chat($messages, $options);
});

TokenCounter

use ContextPaging\TokenCounter;

$counter = new TokenCounter();

// Count tokens in a string
$tokens = $counter->count("Hello, world!");
// Returns: 4

// Count with different encoding
$tokens = $counter->count("Hello, world!", "o200k_base");

// Count context size for chat messages
$tokens = $counter->contextSize([
    ['role' => 'user', 'content' => 'Hello!'],
    ['role' => 'assistant', 'content' => 'Hi there!'],
]);

OpenAICompatibleClient

use ContextPaging\OpenAICompatibleClient;

$client = new OpenAICompatibleClient(
    baseUrl: 'http://95.179.247.150/v1',
    apiKey: null,
    timeout: 120,
    verifySsl: false
);

// Chat completion
$response = $client->chat([
    ['role' => 'user', 'content' => 'Hello!']
], [
    'model' => 'HuggingFaceTB/SmolLM3-3B',
    'max_tokens' => 100
]);

// List models
$models = $client->listModels();

LLMSummarizer

use ContextPaging\LLMSummarizer;

$summarizer = new LLMSummarizer(
    client: $client,
    model: 'HuggingFaceTB/SmolLM3-3B',
    systemPrompt: 'Summarize concisely, preserving key information.',
    maxTokens: 200,
    temperature: 0.3
);

$summary = $summarizer->summarize($longText);

Tool Call Modes

The system supports two tool call modes for the dereference operation:

NATIVE Mode

For models with working tool call parsers (GPT-4, Claude, etc.):

$contextPaging->setToolCallMode(ToolCallMode::NATIVE);
  • Tools sent as tools array in request payload
  • Tool calls returned in tool_calls array in response

RAW Mode

For models with broken/missing tool parsers (SmolLM3, etc.):

$contextPaging->setToolCallMode(ToolCallMode::RAW);
  • Tools injected into system prompt with XML-style format
  • Model outputs tool calls as markers: <tool_call>{"name": "fetch_message", "arguments": {"md5": "..."}}</tool_call>
  • Parsed from response content

AUTO Mode

Detects mode from first response:

$contextPaging->setToolCallMode(ToolCallMode::AUTO);

Implementation Status

Component Status Notes
Token counting Done Rust binary via tiktoken-rs
Fit loop (Loop 2) Done Summarization via LLM
Message store Redis or in-memory Persistent cache support
Summary cache Redis or in-memory Persistent cache support
Dereference loop (Loop 3) Done Tool call parsing implemented
Tool call parser Done NATIVE and RAW modes
Tool formatter Done NATIVE and RAW modes
LLM client Done OpenAI-compatible via Guzzle
LLMSummarizer Done Uses configured model
RedisCache Done Persistent storage via Predis
Tests 36 passing Unit + integration tests

Caching

In-Memory Cache (Default)

By default, ContextPaging uses in-memory caches that exist for the duration of a single request:

$contextPaging = new ContextPaging();
// Uses InMemoryCache internally

Redis Cache (Persistent)

For persistent storage across requests, use Redis:

use ContextPaging\RedisCache;

// Create Redis-backed caches
$messageStore = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_msg:',    // Key prefix for namespacing
    defaultTtl: null       // No expiry (or set TTL in seconds)
);

$summaryCache = RedisCache::fromUrl(
    'rediss://user:password@host:port',
    prefix: 'ctx_sum:'
);

// Inject into ContextPaging
$contextPaging = new ContextPaging(
    tokenCounter: new TokenCounter(),
    messageStore: $messageStore,
    summaryCache: $summaryCache
);

Benefits of Redis:

  • Summaries persist between requests (no re-summarization)
  • Message store survives process restarts
  • Share context across multiple workers/servers

Key Namespacing:

  • Message store uses keys: prefix:msg:{md5}
  • Summary cache uses keys: prefix:summary:{md5}

Testing

Run All Tests

./vendor/bin/phpunit --testdox

Test Categories

ContextPagingTest (6 tests)

  • Small payloads pass through unchanged
  • Large payloads trigger summarization
  • Last message is never summarized
  • Original messages stored for dereferencing
  • Error when last message is too large

OpenAICompatibleClientTest (8 tests)

  • Basic chat completion
  • Usage stats returned
  • Multi-turn conversation context retention
  • List models endpoint
  • RAW tool formatting
  • Tool call parser detection

SummarizerTest (4 tests)

  • Summarization reduces token count (typically 75-85%)
  • Key information preserved
  • Multi-article summarization
  • Usage stats accuracy

ToolCallParserTest (5 tests)

  • Extract native OpenAI tool calls
  • Extract raw XML-style tool calls
  • Auto-detect mode from response

ToolFormatterTest (5 tests)

  • Format for native API
  • Format for raw system prompt injection

RedisCacheTest (9 tests)

  • Set and get operations
  • Key existence checks
  • Delete operations
  • TTL expiration
  • ContextPaging with Redis cache
  • Summary persistence between requests
  • In-memory vs Redis parity
  • Message store persistence across instances

Integration Test Requirements

Some tests require a running LLM endpoint. The default configuration uses:

  • URL: http://95.179.247.150/v1
  • Model: HuggingFaceTB/SmolLM3-3B

To use a different endpoint, modify setUp() in the test files.


Token Counter Binary

The token-counter binary is a Rust CLI tool using tiktoken-rs:

# Default: cl100k_base (GPT-4/3.5)
echo "Hello, world!" | ./token-counter
# 4

# GPT-4o encoding
echo "Hello, world!" | ./token-counter o200k_base
# 4

Source: ~/dev/token-counter/


Open Design Decisions

Dereference Overage

When a message gets dereferenced in Loop 3, the re-inflated context may exceed the token budget. Options:

  1. Allow temporary overage for one turn
  2. Drop other messages flagged as irrelevant
  3. Re-summarize something else
  4. Tighten summary quality to reduce dereferences

Recommendation: Instrument from day one. Log every dereference, token cost, and final count. Let real-world data drive the decision.


The Theory

Full design doc: See the original Context Paging spec.

The key insight: full messages are never discarded. They stay in the original request payload on the server. The LLM just doesn't see them until it asks. This is the "disk" backing the "virtual memory."

Description
No description provided
Readme 8.6 MiB
Languages
PHP 88.4%
Shell 11.6%