Files

biondizzle 26f8b33db7 Initial commit

2026-03-28 09:01:07 +00:00

18 KiB

Raw Permalink Blame History

Context Paging: Virtual Memory for LLM Context Windows

A lightweight mechanism for extending conversational context beyond model limits through summarization, pointer-referencing, and on-demand dereferencing.

Abstract

Large language models (LLMs) have fixed context windows that limit the length of conversations. As dialogues grow, they eventually exceed these limits, forcing a choice: truncate old messages (losing information) or fail the request entirely. We present Context Paging, a technique inspired by virtual memory management in operating systems. Old messages are compressed into summaries with cryptographic pointers back to the originals. The model can retrieve the full content of any summarized message on-demand via a tool call—the analogue of a page fault.

We evaluated Context Paging on a 98-turn coding conversation using two models with different context limits. On a 64k-token model, the technique achieved 23% compression (79,263 → 60,985 tokens). On a 32k-token model with half the context, compression increased to 85.4% (184,154 → 26,774 tokens), demonstrating that context paging automatically adapts compression intensity to available space. In both cases, the conversation completed successfully where it would have otherwise overflowed.

1. Introduction

1.1 The Context Window Problem

Modern LLMs process input through a fixed-size context window—typically 4k to 200k tokens depending on the model. In conversational applications, each turn adds to the conversation history, and eventually the accumulated context exceeds the window.

Turn 1:  100 tokens    ✓ Fits
Turn 10: 1,000 tokens  ✓ Fits
Turn 50: 50,000 tokens ✓ Fits
Turn 100: 80,000 tokens ✗ OVERFLOW

The problem is acute for extended interactions: coding sessions, research discussions, customer support, and agent-based workflows where context accumulates over dozens or hundreds of turns.

1.2 Existing Approaches

Truncation: Drop the oldest messages when approaching the limit. Simple but destructive—important context from early turns is lost.

Sliding Window: Keep only the last N turns. Similar problem—earlier context is discarded.

External Memory + RAG: Store conversation history externally and retrieve relevant portions via semantic search. Effective but requires infrastructure, embeddings, and a retrieval model. The model cannot "know what it doesn't know"—it can only retrieve what the search system deems relevant.

Long-context Models: Use models with larger windows (128k, 200k, 1M+ tokens). Solves the problem at the cost of higher latency and pricing. Does not work for models with smaller windows.

1.3 Our Contribution

We propose Context Paging: a lightweight, model-agnostic technique that:

Preserves all original messages in a backing store
Compresses old messages into brief summaries with pointers
Allows on-demand retrieval via a tool call mechanism
Requires no external infrastructure beyond a simple key-value cache

The key insight is that the model itself decides when it needs more context—it issues a "page fault" by calling a tool, and the system retrieves the full message from the backing store.

2. The Virtual Memory Analogy

Context Paging maps directly to virtual memory concepts:

Virtual Memory	Context Paging
Physical RAM	Context window
Disk/Backing Store	Message store (original messages)
Page Table	MD5 hash → message mapping
Page Fault	Tool call requesting a message
Memory Pressure	Context approaching limit
Page Eviction	Summarization
Page-in	Dereference (retrieve full message)

When physical memory fills, the OS evicts pages to disk, keeping only a pointer (page table entry) in RAM. When a process accesses evicted memory, a page fault occurs, and the OS loads the page back from disk.

Similarly, when the context window fills, Context Paging "evicts" old messages to the message store, keeping only a summary with an MD5 pointer. When the model needs the full message, it issues a tool call (page fault), and the system injects the original back into context.

3. Architecture

Context Paging operates through two nested loops.

3.1 Loop 1: Fit (Compression)

The Fit loop ensures the conversation fits within the context budget.

┌─────────────────────────────────────────┐
│              FIT ALGORITHM               │
├─────────────────────────────────────────┤
│  1. Count tokens in all messages        │
│  2. If tokens ≤ budget: DONE            │
│  3. Find oldest non-summarized message  │
│  4. Compute MD5 hash of content         │
│  5. Store original in message store     │
│  6. Replace with summary + pointer      │
│  7. Go to step 2                        │
└─────────────────────────────────────────┘

The pointer format:

[md5:a3f8c1e9d2b4...] User asked about implementing OAuth2 login...

The MD5 hash serves as a unique identifier for the original message. The summary provides a hint of what was discussed.

Key invariant: The last message (the current user request) is never summarized.

3.2 Loop 2: Execute (Dereferencing)

The Execute loop runs the model and handles retrieval requests.

┌─────────────────────────────────────────┐
│           EXECUTE ALGORITHM              │
├─────────────────────────────────────────┤
│  1. Send fitted context to LLM          │
│  2. Parse response:                     │
│     - If text response: return to user  │
│     - If fetch_message tool call:       │
│       a. Look up original by MD5        │
│       b. Inject into context            │
│       c. Re-run LLM                     │
│       d. Go to step 2                   │
└─────────────────────────────────────────┘

The model has access to a fetch_message tool:

{
  "name": "fetch_message",
  "description": "Retrieve the full content of a summarized message.",
  "parameters": {
    "type": "object",
    "properties": {
      "md5": {
        "type": "string",
        "description": "The MD5 hash from the [md5:...] pointer"
      }
    },
    "required": ["md5"]
  }
}

When the model calls this tool, it signals that the summary was insufficient—it needs the full context to respond properly.

3.3 Data Flow

USER MESSAGE
     │
     ▼
┌─────────────┐     ┌─────────────┐
│   HISTORY   │────▶│  LOOP 1:    │
│   (grows)   │     │  FIT        │
└─────────────┘     └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  FITTED     │
                    │  CONTEXT    │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  LOOP 2:    │◀────┐
                    │  EXECUTE    │     │
                    └──────┬──────┘     │
                           │            │
              ┌────────────┼────────────┘
              │            │
        ┌─────▼─────┐ ┌────▼────┐
        │ TOOL CALL │ │  TEXT   │
        │ (fetch)   │ │ RESPONSE│
        └─────┬─────┘ └────┬────┘
              │            │
        ┌─────▼─────┐      │
        │ INJECT    │      │
        │ ORIGINAL  │      │
        └─────┬─────┘      │
              │            │
              └────┬───────┘
                   │
                   ▼
            USER RECEIVES
              RESPONSE

4. Implementation Considerations

4.1 Summarization Strategy

The quality of summaries directly impacts the model's ability to work with compressed context. Options include:

LLM-based summarization: Use a model to generate 2-3 sentence summaries. Preserves semantic content but adds latency and cost.
Truncation: Simply truncate to N characters. Fast but loses semantic coherence.
Extractive summarization: Select key sentences. Balances speed and quality.
Hybrid: Use truncation initially, switch to LLM summarization for messages that have already been referenced once (indicating importance).

4.2 Token Budgeting

The context budget must account for:

Response tokens: Reserve space for the model's output
Safety margin: Account for tokenizer discrepancies between counting and inference
Tool definition overhead: Space for the fetch_message tool schema

budget = max_context - response_reserve - safety_margin

4.3 Dereference Overages

When a message is dereferenced, the context may temporarily exceed the budget. Options:

Allow temporary overage: Accept that one turn may use more tokens. The next fit() will re-compress.
Re-summarize other messages: When injecting a full message, summarize something else to maintain budget.
Multi-message eviction: Summarize multiple messages to create headroom for future dereferences.

4.4 Caching

Two caches are maintained:

Message store: MD5 → full message (the "disk")
Summary cache: MD5 → summary (avoid re-summarizing identical content)

Both can be in-memory (for single-request scope) or backed by Redis (for persistence across requests/servers).

5. Evaluation

5.1 Methodology

We tested Context Paging on a 98-turn coding conversation where a model progressively built a task manager application. The conversation included:

Initial design and data models
Feature additions (persistence, CLI, priorities, due dates, tags)
Backend development (Flask API, authentication, WebSockets)
Database integration (SQLAlchemy, migrations)
Testing (unit tests, integration tests, performance tests)
Frontend development (React, TypeScript, Material-UI)
Deployment (Docker, Kubernetes, CI/CD)
Security and monitoring

We ran the same conversation against two models with different context limits:

Model	Context Window	Tool Calling
SmolLM3-3B	64k tokens	RAW (non-native)
Hermes-3-Llama-3.2-3B	32k tokens	NATIVE

5.2 Results

Metric	SmolLM3 (64k)	Hermes-3 (32k)
Total turns	98	98
Raw history tokens	79,263	184,154
Context limit	65,536	32,768
Would overflow at turn	~85	~6
Final request tokens	60,985	26,774
Tokens saved	18,278	157,380
Compression ratio	23.0%	85.4%

Key finding: The 32k model required 3.7x more compression but completed the same conversation. The sawtooth pattern below shows how context paging adapts to available space.

5.3 Token Growth Patterns

SmolLM3 (64k context) — Linear growth, compression kicks in late:

Turn  1:      72 tokens
Turn 10:   9,723 tokens
Turn 25:  24,445 tokens
Turn 50:  42,456 tokens
Turn 75:  58,992 tokens
Turn 98:  60,985 tokens (fit) vs 79,263 (raw)

Hermes-3 (32k context) — Linear growth until ~turn 18, then stable:

Turn  1:      82 tokens
Turn 10:   7,714 tokens
Turn 18:  27,333 tokens  ← approaching limit
Turn 19:  27,469 tokens  ← compression begins
Turn 50:  25,803 tokens  ← stable
Turn 98:  26,774 tokens  ← stable

The 32k model shows a classic "sawtooth" pattern after turn 18: context grows, approaches limit, old messages get summarized, context shrinks, repeat.

5.4 Dereference Behavior

Neither model issued fetch_message tool calls during the test. The summaries were sufficient for continuing the conversation. This suggests:

For structured, incremental work (like coding), summaries preserve enough context
Dereferences may be more common in conversations with frequent context-switching or surprise callbacks to early details
Summary quality matters—a better summarizer reduces dereferences

6. Discussion

6.1 When Does This Work Well?

Incremental workflows: Coding, writing, research where each turn builds on the previous. The most relevant context is recent; older turns are less critical.

Structured conversations: When the model is following a plan or checklist. Summaries can capture the gist without full detail.

Budget-conscious applications: When token costs matter more than perfect context retention.

6.2 Limitations

Information loss: Summaries discard detail. If early turns contain critical information referenced much later, the model may need to dereference or may miss it entirely.

Dereference cost: Each dereference adds a model call. A conversation with many dereferences could be slower and more expensive than using a long-context model.

Pointer overhead: MD5 hashes and summary framing add overhead. For very short messages, summarization may not reduce tokens.

No semantic retrieval: Unlike RAG, the model cannot discover relevant old messages—it can only retrieve messages it knows about (via pointer). If a summary is missing, the model has no path to that content.

6.3 Comparison to Alternatives

Approach	Infrastructure	Cost	Context Retention	Retrieval
Truncation	None	Lowest	Poor	None
Sliding Window	None	Low	Poor	None
Context Paging	Simple cache	Medium	Good (lossy)	On-demand
RAG	Embeddings, vector DB	High	Good	Semantic search
Long-context Model	None	Highest	Perfect	None

Context Paging sits in the middle: more retention than truncation, less infrastructure than RAG, lower cost than long-context models.

6.4 Future Directions

Semantic summarization: Tailor summary content based on the likely future relevance of each message.
Proactive eviction: Anticipate context growth and summarize earlier to avoid last-minute compression.
Multi-level paging: Summaries of summaries for very long conversations—like multi-level page tables.
Integration with RAG: Use Context Paging for recent context, RAG for older messages that don't fit even in summarized form.
Compression quality metrics: Track how often the model dereferences to evaluate summary effectiveness.

7. Conclusion

Context Paging provides a practical mechanism for extending conversational context beyond model limits. By treating the context window as a cache and the message store as backing memory, we preserve information that would otherwise be lost to truncation—while giving the model agency to retrieve what it needs.

The technique is lightweight, requiring only a key-value store and token counting. It works with any model that supports tool calls (or can parse structured output). In our evaluation, it enabled a 98-turn conversation on a 64k-token model that would have otherwise failed at turn 85.

The virtual memory analogy is not perfect—LLM context is not random access, and "page faults" require model decisions rather than hardware interrupts. But the principle holds: when memory is constrained, move less-used data to secondary storage and retrieve it on demand. For LLMs, that means summarized context with pointer-references to originals, fetched when the model needs more detail.

Appendix A: Message Pointer Format

[md5:<32-character-hash>] <summary text>

Example:

[md5:a3f8c1e9d2b4f6e8c1d3e5f7a9b2c4d6] User requested OAuth2 implementation. We discussed authorization code flow, PKCE for security, and token refresh handling.

Appendix B: Tool Call Format

Native mode (OpenAI-compatible):

{
  "tool_calls": [{
    "id": "call_abc123",
    "type": "function",
    "function": {
      "name": "fetch_message",
      "arguments": "{\"md5\": \"a3f8c1e9d2b4f6e8c1d3e5f7a9b2c4d6\"}"
    }
  }]
}

Raw mode (for models without native tool support):

<tool_call>{"name": "fetch_message", "arguments": {"md5": "a3f8c1e9d2b4f6e8c1d3e5f7a9b2c4d6"}}</tool_call>

Appendix C: Pseudocode

function context_paging(messages, budget):
    # Loop 1: Fit
    while token_count(messages) > budget:
        oldest = find_oldest_unsummarized(messages)
        md5 = hash(oldest.content)
        store(md5, oldest)
        summary = summarize(oldest)
        messages[oldest.index] = {
            "role": oldest.role,
            "content": f"[md5:{md5}] {summary}",
            "_summarized": True,
            "_original_md5": md5
        }
    
    # Loop 2: Execute
    while True:
        response = llm.chat(messages, tools=[FETCH_MESSAGE_TOOL])
        
        if not has_tool_call(response):
            return response
        
        md5 = extract_md5_from_tool_call(response)
        original = retrieve(md5)
        
        # Inject original back
        for msg in messages:
            if msg._original_md5 == md5:
                msg.content = original.content
                msg._summarized = False
                break

Context Paging is implemented as an open-source library. For implementation details and source code, see the project repository.

18 KiB Raw Permalink Blame History