18 KiB
Context Paging: Virtual Memory for LLM Context Windows
A lightweight mechanism for extending conversational context beyond model limits through summarization, pointer-referencing, and on-demand dereferencing.
Abstract
Large language models (LLMs) have fixed context windows that limit the length of conversations. As dialogues grow, they eventually exceed these limits, forcing a choice: truncate old messages (losing information) or fail the request entirely. We present Context Paging, a technique inspired by virtual memory management in operating systems. Old messages are compressed into summaries with cryptographic pointers back to the originals. The model can retrieve the full content of any summarized message on-demand via a tool call—the analogue of a page fault.
We evaluated Context Paging on a 98-turn coding conversation using two models with different context limits. On a 64k-token model, the technique achieved 23% compression (79,263 → 60,985 tokens). On a 32k-token model with half the context, compression increased to 85.4% (184,154 → 26,774 tokens), demonstrating that context paging automatically adapts compression intensity to available space. In both cases, the conversation completed successfully where it would have otherwise overflowed.
1. Introduction
1.1 The Context Window Problem
Modern LLMs process input through a fixed-size context window—typically 4k to 200k tokens depending on the model. In conversational applications, each turn adds to the conversation history, and eventually the accumulated context exceeds the window.
Turn 1: 100 tokens ✓ Fits
Turn 10: 1,000 tokens ✓ Fits
Turn 50: 50,000 tokens ✓ Fits
Turn 100: 80,000 tokens ✗ OVERFLOW
The problem is acute for extended interactions: coding sessions, research discussions, customer support, and agent-based workflows where context accumulates over dozens or hundreds of turns.
1.2 Existing Approaches
Truncation: Drop the oldest messages when approaching the limit. Simple but destructive—important context from early turns is lost.
Sliding Window: Keep only the last N turns. Similar problem—earlier context is discarded.
External Memory + RAG: Store conversation history externally and retrieve relevant portions via semantic search. Effective but requires infrastructure, embeddings, and a retrieval model. The model cannot "know what it doesn't know"—it can only retrieve what the search system deems relevant.
Long-context Models: Use models with larger windows (128k, 200k, 1M+ tokens). Solves the problem at the cost of higher latency and pricing. Does not work for models with smaller windows.
1.3 Our Contribution
We propose Context Paging: a lightweight, model-agnostic technique that:
- Preserves all original messages in a backing store
- Compresses old messages into brief summaries with pointers
- Allows on-demand retrieval via a tool call mechanism
- Requires no external infrastructure beyond a simple key-value cache
The key insight is that the model itself decides when it needs more context—it issues a "page fault" by calling a tool, and the system retrieves the full message from the backing store.
2. The Virtual Memory Analogy
Context Paging maps directly to virtual memory concepts:
| Virtual Memory | Context Paging |
|---|---|
| Physical RAM | Context window |
| Disk/Backing Store | Message store (original messages) |
| Page Table | MD5 hash → message mapping |
| Page Fault | Tool call requesting a message |
| Memory Pressure | Context approaching limit |
| Page Eviction | Summarization |
| Page-in | Dereference (retrieve full message) |
When physical memory fills, the OS evicts pages to disk, keeping only a pointer (page table entry) in RAM. When a process accesses evicted memory, a page fault occurs, and the OS loads the page back from disk.
Similarly, when the context window fills, Context Paging "evicts" old messages to the message store, keeping only a summary with an MD5 pointer. When the model needs the full message, it issues a tool call (page fault), and the system injects the original back into context.
3. Architecture
Context Paging operates through two nested loops.
3.1 Loop 1: Fit (Compression)
The Fit loop ensures the conversation fits within the context budget.
┌─────────────────────────────────────────┐
│ FIT ALGORITHM │
├─────────────────────────────────────────┤
│ 1. Count tokens in all messages │
│ 2. If tokens ≤ budget: DONE │
│ 3. Find oldest non-summarized message │
│ 4. Compute MD5 hash of content │
│ 5. Store original in message store │
│ 6. Replace with summary + pointer │
│ 7. Go to step 2 │
└─────────────────────────────────────────┘
The pointer format:
[md5:a3f8c1e9d2b4...] User asked about implementing OAuth2 login...
The MD5 hash serves as a unique identifier for the original message. The summary provides a hint of what was discussed.
Key invariant: The last message (the current user request) is never summarized.
3.2 Loop 2: Execute (Dereferencing)
The Execute loop runs the model and handles retrieval requests.
┌─────────────────────────────────────────┐
│ EXECUTE ALGORITHM │
├─────────────────────────────────────────┤
│ 1. Send fitted context to LLM │
│ 2. Parse response: │
│ - If text response: return to user │
│ - If fetch_message tool call: │
│ a. Look up original by MD5 │
│ b. Inject into context │
│ c. Re-run LLM │
│ d. Go to step 2 │
└─────────────────────────────────────────┘
The model has access to a fetch_message tool:
{
"name": "fetch_message",
"description": "Retrieve the full content of a summarized message.",
"parameters": {
"type": "object",
"properties": {
"md5": {
"type": "string",
"description": "The MD5 hash from the [md5:...] pointer"
}
},
"required": ["md5"]
}
}
When the model calls this tool, it signals that the summary was insufficient—it needs the full context to respond properly.
3.3 Data Flow
USER MESSAGE
│
▼
┌─────────────┐ ┌─────────────┐
│ HISTORY │────▶│ LOOP 1: │
│ (grows) │ │ FIT │
└─────────────┘ └──────┬──────┘
│
┌──────▼──────┐
│ FITTED │
│ CONTEXT │
└──────┬──────┘
│
┌──────▼──────┐
│ LOOP 2: │◀────┐
│ EXECUTE │ │
└──────┬──────┘ │
│ │
┌────────────┼────────────┘
│ │
┌─────▼─────┐ ┌────▼────┐
│ TOOL CALL │ │ TEXT │
│ (fetch) │ │ RESPONSE│
└─────┬─────┘ └────┬────┘
│ │
┌─────▼─────┐ │
│ INJECT │ │
│ ORIGINAL │ │
└─────┬─────┘ │
│ │
└────┬───────┘
│
▼
USER RECEIVES
RESPONSE
4. Implementation Considerations
4.1 Summarization Strategy
The quality of summaries directly impacts the model's ability to work with compressed context. Options include:
-
LLM-based summarization: Use a model to generate 2-3 sentence summaries. Preserves semantic content but adds latency and cost.
-
Truncation: Simply truncate to N characters. Fast but loses semantic coherence.
-
Extractive summarization: Select key sentences. Balances speed and quality.
-
Hybrid: Use truncation initially, switch to LLM summarization for messages that have already been referenced once (indicating importance).
4.2 Token Budgeting
The context budget must account for:
- Response tokens: Reserve space for the model's output
- Safety margin: Account for tokenizer discrepancies between counting and inference
- Tool definition overhead: Space for the
fetch_messagetool schema
budget = max_context - response_reserve - safety_margin
4.3 Dereference Overages
When a message is dereferenced, the context may temporarily exceed the budget. Options:
-
Allow temporary overage: Accept that one turn may use more tokens. The next fit() will re-compress.
-
Re-summarize other messages: When injecting a full message, summarize something else to maintain budget.
-
Multi-message eviction: Summarize multiple messages to create headroom for future dereferences.
4.4 Caching
Two caches are maintained:
- Message store: MD5 → full message (the "disk")
- Summary cache: MD5 → summary (avoid re-summarizing identical content)
Both can be in-memory (for single-request scope) or backed by Redis (for persistence across requests/servers).
5. Evaluation
5.1 Methodology
We tested Context Paging on a 98-turn coding conversation where a model progressively built a task manager application. The conversation included:
- Initial design and data models
- Feature additions (persistence, CLI, priorities, due dates, tags)
- Backend development (Flask API, authentication, WebSockets)
- Database integration (SQLAlchemy, migrations)
- Testing (unit tests, integration tests, performance tests)
- Frontend development (React, TypeScript, Material-UI)
- Deployment (Docker, Kubernetes, CI/CD)
- Security and monitoring
We ran the same conversation against two models with different context limits:
| Model | Context Window | Tool Calling |
|---|---|---|
| SmolLM3-3B | 64k tokens | RAW (non-native) |
| Hermes-3-Llama-3.2-3B | 32k tokens | NATIVE |
5.2 Results
| Metric | SmolLM3 (64k) | Hermes-3 (32k) |
|---|---|---|
| Total turns | 98 | 98 |
| Raw history tokens | 79,263 | 184,154 |
| Context limit | 65,536 | 32,768 |
| Would overflow at turn | ~85 | ~6 |
| Final request tokens | 60,985 | 26,774 |
| Tokens saved | 18,278 | 157,380 |
| Compression ratio | 23.0% | 85.4% |
Key finding: The 32k model required 3.7x more compression but completed the same conversation. The sawtooth pattern below shows how context paging adapts to available space.
5.3 Token Growth Patterns
SmolLM3 (64k context) — Linear growth, compression kicks in late:
Turn 1: 72 tokens
Turn 10: 9,723 tokens
Turn 25: 24,445 tokens
Turn 50: 42,456 tokens
Turn 75: 58,992 tokens
Turn 98: 60,985 tokens (fit) vs 79,263 (raw)
Hermes-3 (32k context) — Linear growth until ~turn 18, then stable:
Turn 1: 82 tokens
Turn 10: 7,714 tokens
Turn 18: 27,333 tokens ← approaching limit
Turn 19: 27,469 tokens ← compression begins
Turn 50: 25,803 tokens ← stable
Turn 98: 26,774 tokens ← stable
The 32k model shows a classic "sawtooth" pattern after turn 18: context grows, approaches limit, old messages get summarized, context shrinks, repeat.
5.4 Dereference Behavior
Neither model issued fetch_message tool calls during the test. The summaries were sufficient for continuing the conversation. This suggests:
- For structured, incremental work (like coding), summaries preserve enough context
- Dereferences may be more common in conversations with frequent context-switching or surprise callbacks to early details
- Summary quality matters—a better summarizer reduces dereferences
6. Discussion
6.1 When Does This Work Well?
Incremental workflows: Coding, writing, research where each turn builds on the previous. The most relevant context is recent; older turns are less critical.
Structured conversations: When the model is following a plan or checklist. Summaries can capture the gist without full detail.
Budget-conscious applications: When token costs matter more than perfect context retention.
6.2 Limitations
Information loss: Summaries discard detail. If early turns contain critical information referenced much later, the model may need to dereference or may miss it entirely.
Dereference cost: Each dereference adds a model call. A conversation with many dereferences could be slower and more expensive than using a long-context model.
Pointer overhead: MD5 hashes and summary framing add overhead. For very short messages, summarization may not reduce tokens.
No semantic retrieval: Unlike RAG, the model cannot discover relevant old messages—it can only retrieve messages it knows about (via pointer). If a summary is missing, the model has no path to that content.
6.3 Comparison to Alternatives
| Approach | Infrastructure | Cost | Context Retention | Retrieval |
|---|---|---|---|---|
| Truncation | None | Lowest | Poor | None |
| Sliding Window | None | Low | Poor | None |
| Context Paging | Simple cache | Medium | Good (lossy) | On-demand |
| RAG | Embeddings, vector DB | High | Good | Semantic search |
| Long-context Model | None | Highest | Perfect | None |
Context Paging sits in the middle: more retention than truncation, less infrastructure than RAG, lower cost than long-context models.
6.4 Future Directions
-
Semantic summarization: Tailor summary content based on the likely future relevance of each message.
-
Proactive eviction: Anticipate context growth and summarize earlier to avoid last-minute compression.
-
Multi-level paging: Summaries of summaries for very long conversations—like multi-level page tables.
-
Integration with RAG: Use Context Paging for recent context, RAG for older messages that don't fit even in summarized form.
-
Compression quality metrics: Track how often the model dereferences to evaluate summary effectiveness.
7. Conclusion
Context Paging provides a practical mechanism for extending conversational context beyond model limits. By treating the context window as a cache and the message store as backing memory, we preserve information that would otherwise be lost to truncation—while giving the model agency to retrieve what it needs.
The technique is lightweight, requiring only a key-value store and token counting. It works with any model that supports tool calls (or can parse structured output). In our evaluation, it enabled a 98-turn conversation on a 64k-token model that would have otherwise failed at turn 85.
The virtual memory analogy is not perfect—LLM context is not random access, and "page faults" require model decisions rather than hardware interrupts. But the principle holds: when memory is constrained, move less-used data to secondary storage and retrieve it on demand. For LLMs, that means summarized context with pointer-references to originals, fetched when the model needs more detail.
Appendix A: Message Pointer Format
[md5:<32-character-hash>] <summary text>
Example:
[md5:a3f8c1e9d2b4f6e8c1d3e5f7a9b2c4d6] User requested OAuth2 implementation. We discussed authorization code flow, PKCE for security, and token refresh handling.
Appendix B: Tool Call Format
Native mode (OpenAI-compatible):
{
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "fetch_message",
"arguments": "{\"md5\": \"a3f8c1e9d2b4f6e8c1d3e5f7a9b2c4d6\"}"
}
}]
}
Raw mode (for models without native tool support):
<tool_call>{"name": "fetch_message", "arguments": {"md5": "a3f8c1e9d2b4f6e8c1d3e5f7a9b2c4d6"}}</tool_call>
Appendix C: Pseudocode
function context_paging(messages, budget):
# Loop 1: Fit
while token_count(messages) > budget:
oldest = find_oldest_unsummarized(messages)
md5 = hash(oldest.content)
store(md5, oldest)
summary = summarize(oldest)
messages[oldest.index] = {
"role": oldest.role,
"content": f"[md5:{md5}] {summary}",
"_summarized": True,
"_original_md5": md5
}
# Loop 2: Execute
while True:
response = llm.chat(messages, tools=[FETCH_MESSAGE_TOOL])
if not has_tool_call(response):
return response
md5 = extract_md5_from_tool_call(response)
original = retrieve(md5)
# Inject original back
for msg in messages:
if msg._original_md5 == md5:
msg.content = original.content
msg._summarized = False
break
Context Paging is implemented as an open-source library. For implementation details and source code, see the project repository.