Context Window Optimization: Getting More From Every Token

A 200K token context window sounds like infinite space until you’re paying for it. Every token you send costs money, adds latency, and competes for the model’s attention. Most agent systems waste 40-60% of their context window on information that doesn’t contribute to the current task.

Context window optimization isn’t about fitting more information in. It’s about fitting the right information in. The difference between a well-optimized context and a bloated one shows up in three places: cost (10x difference between cache hits and misses), quality (models produce better outputs with focused context), and latency (shorter contexts process faster).

This guide covers the practical techniques that make the biggest difference, ordered by impact and implementation difficulty.

Interactive Concept Map

Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.

context window optimization infographic — Visual overview of context window optimization techniques. Click to enlarge.

Why context optimization matters more than you think

The context window is the model’s entire world. Everything the model can reference, reason about, and use to generate a response must be in the context window. Anything outside the window doesn’t exist as far as the model is concerned.

This creates a fundamental tension: you want the model to have access to everything relevant, but every irrelevant token dilutes the model’s attention, increases cost, and adds latency. The optimization challenge is maximizing relevance while minimizing waste.

The cost math

Most API pricing charges per input token and per output token. Consider a research agent that makes 10 model calls per task:

Context management	Tokens per call	Cost per task (GPT-4)	Cost per task (Claude)
No optimization	50,000	~$1.50	~$1.50
Basic truncation	20,000	~$0.60	~$0.60
Active optimization	8,000	~$0.24	~$0.24
With KV-cache hits	8,000 (cached)	~$0.024	~$0.03

The difference between no optimization and full optimization with caching is roughly 50-60x in cost. For a system handling thousands of tasks daily, this is the difference between a viable product and a money pit.

Technique 1: Token budgeting

Assign a specific token budget to each component of your context window, then enforce it.

How to implement

Divide your context into categories and allocate budgets:

Component	Budget (% of total)	Purpose
System prompt	10-15%	Agent instructions, persona, constraints
Conversation history	20-30%	Recent messages and context
Retrieved knowledge	25-35%	RAG results, domain knowledge
Tool results	15-20%	Outputs from tool calls
Reserve	10-15%	Buffer for unexpected content

When a category exceeds its budget, apply that category’s compression strategy (truncation, summarization, or filtering) before adding content to the context.

Common mistake

Most developers don’t budget at all. They add content to the context until it’s full, then truncate from the beginning. This crude approach loses important early context (like system instructions or initial user requirements) while keeping less relevant recent messages.

Technique 2: Progressive summarization

Replace old content with increasingly compressed summaries rather than discarding it entirely.

How to implement

When conversation history exceeds its token budget, summarize older messages in stages:

Recent messages (last 5-10): Keep in full. These contain the most immediately relevant context.

Medium-age messages (10-30 messages ago): Summarize to key facts and decisions. A 2,000-token exchange might compress to a 200-token summary: “User requested a competitive analysis of LangChain vs CrewAI. I searched three sources and found that LangGraph handles state better while CrewAI is faster for prototyping.”

Old messages (30+ messages ago): Summarize to a single paragraph capturing key requirements, decisions, and constraints established earlier in the conversation.

Trade-offs

Each summarization step risks losing important details. The summary model might miss a constraint mentioned in passing (“make sure to include pricing data”) that becomes critical later. Mitigate this by using structured summarization prompts that specifically extract: requirements, constraints, decisions, and unresolved questions.

Summarization costs additional API calls. For long-running agent tasks, balance the cost of summarization against the cost of sending full, uncompressed history.

Technique 3: KV-cache optimization

The KV-cache (key-value cache) stores computed attention patterns for tokens the model has already processed. If the beginning of your prompt is identical across requests, those tokens are served from cache rather than recomputed, at dramatically lower cost.

How to implement

Structure your prompts so that static content appears first and dynamic content appears last:

[System prompt - static, cacheable]
[Domain knowledge - static or slowly changing, cacheable]
[Conversation history - dynamic, not cached]
[Current query - dynamic, not cached]

When the model provider’s API supports prompt caching (both Anthropic and OpenAI offer this), the static prefix is processed once and cached. Subsequent calls with the same prefix reuse the cached computation, reducing cost by approximately 90% for the cached portion.

Key insight

The order of content in your context window isn’t just about attention. It’s about caching. Moving a 5,000-token system prompt from the end to the beginning of your context can save 90% of the cost of those 5,000 tokens on every subsequent call.

Requirements

Your prompt must have a stable, identical prefix across calls
The prefix must be long enough to justify caching (typically 1,000+ tokens)
The API provider must support prompt caching
Minimum cache lifetime varies by provider (Anthropic: 5 minutes, OpenAI: varies)

Technique 4: Selective retrieval

When using RAG (Retrieval-Augmented Generation), retrieve less but retrieve better.

The retrieval-quality paradox

Retrieving more documents increases the chance that the answer is somewhere in the context. But it also dilutes the model’s attention across more content, reducing the quality of the response. The model attends to everything in the context window, and irrelevant-but-retrieved content competes with the actual answer.

How to optimize

Reduce top-K. Most RAG systems retrieve the top 5-10 documents by default. Start with 3 and only increase if the model’s responses indicate missing information. In many cases, the top 3 most relevant results contain everything the model needs.

Rerank before injection. Use a reranker model to score retrieved documents for relevance to the specific query. A lightweight reranker (like Cohere Rerank or a cross-encoder) adds minimal latency but significantly improves the relevance of what enters the context.

Extract, don’t include. Instead of injecting full documents, extract the relevant paragraphs or sentences. A 3,000-word document might contain one paragraph that actually answers the question. Injecting the full document wastes 2,800 words of context on irrelevant content.

Add recency weighting. For domains where information changes over time, boost recently added or recently updated documents. A product specification from last month is more likely to be current than one from last year.

Technique 5: Context-aware tool result compression

Tool results are often the biggest source of context bloat. A web search returns pages of results. A database query returns full records. A file read returns entire documents. Most of that content isn’t relevant to the current step.

How to implement

Summarize tool results before adding to context. Use a lightweight model or extraction logic to pull out only the relevant information from tool results. A web search that returns 5,000 tokens of results might contain 500 tokens of useful information.

Set maximum token limits per tool result. A database query that returns 10,000 tokens of records should be truncated to the most relevant 2,000 tokens. Better yet, modify the query to return only the fields and records the agent actually needs.

Remove tool results from previous steps. Unless a previous tool result is directly relevant to the current step, remove it from the context. The agent already used that information to make a decision. Keeping it in context wastes tokens on information that has served its purpose.

Example

A research agent searches for “LangChain production deployment patterns.” The search returns 8,000 tokens of results from five web pages. After compression:

Extract key findings: 1,200 tokens
Remove boilerplate (navigation, footers, ads): saves 3,000 tokens
Remove redundant information across pages: saves 1,500 tokens
Remove irrelevant sections: saves 2,300 tokens

Net context impact: 1,200 tokens instead of 8,000, with all relevant information preserved.

Technique 6: Attention manipulation through structure

How you format information in the context window affects how the model attends to it. Well-structured content gets better attention than walls of text.

Practical techniques

Use headers and sections. Break retrieved knowledge into clearly labeled sections. The model can navigate structured content more effectively than unstructured paragraphs.

Put critical information at the beginning and end. Research consistently shows that models attend most strongly to the beginning and end of the context window. Place the most important instructions, constraints, and current query in these positions.

Use explicit markers for different content types. Label conversation history, retrieved documents, and tool results clearly. The model handles mixed content better when it can identify what each section is.

Repeat critical constraints. If a constraint is essential (for example, “never recommend discontinued products”), mention it in the system prompt AND near the end of the context, before the current query. Repetition reinforces attention without significant token overhead.

Measuring optimization effectiveness

Track these metrics to understand whether your optimization efforts are working:

Metric	What it tells you	Target
Average context size	How efficiently you’re using tokens	Decreasing over time
Cache hit rate	How well your caching strategy works	70%+ for stable workflows
Cost per task	Direct financial impact	Decreasing or stable
Output quality score	Whether optimization hurts quality	Stable or improving
Retrieval precision	Whether retrieved content is relevant	80%+

The critical balance: cost reduction must not degrade output quality. If quality drops as you optimize, you’ve cut too aggressively. Measure both metrics simultaneously.

Connecting to the broader context engineering discipline

Context window optimization is one component of context engineering, the discipline of controlling what the model sees and when. Optimization techniques work alongside memory management (see our agent memory patterns guide), retrieval strategy design, and dynamic context assembly to create the full context engineering pipeline.

For the foundational concepts of building reliable agent infrastructure, read our complete introduction to harness engineering.

Frequently asked questions

How do I know if my context window is too large?

Three signals: cost per task is rising without corresponding quality improvement, the model is producing outputs that reference irrelevant information from the context, or latency is increasing as context grows. If any of these apply, you’re sending too much context.

Does prompt caching work with all models?

No. Anthropic Claude and OpenAI GPT both support prompt caching, but implementation details differ (cache lifetime, minimum prefix length, cost savings). Check your provider’s current documentation. Some open-source model deployments support KV-cache optimization through frameworks like vLLM.

Should I optimize for cost or quality first?

Quality first, always. An optimized context that produces wrong answers saves money while destroying user trust. Establish your quality baseline, then optimize cost without dropping below it. In practice, moderate optimization often improves both cost and quality because a focused context helps the model produce better outputs.

How much does context window optimization realistically save?

For most production agent systems, 50-80% cost reduction is achievable through the combination of token budgeting, progressive summarization, KV-cache optimization, and selective retrieval. The highest-impact technique varies by system: retrieval-heavy systems benefit most from selective retrieval, while conversational systems benefit most from progressive summarization.

Subscribe to the newsletter for weekly tutorials on context engineering, cost optimization, and production agent patterns.

Interactive Concept Map

Why context optimization matters more than you think

The cost math

Technique 1: Token budgeting

How to implement

Common mistake

Technique 2: Progressive summarization

How to implement

Trade-offs

Technique 3: KV-cache optimization

How to implement

Key insight

Requirements

Technique 4: Selective retrieval

The retrieval-quality paradox

How to optimize

Technique 5: Context-aware tool result compression

How to implement

Example

Technique 6: Attention manipulation through structure

Practical techniques

Measuring optimization effectiveness

Connecting to the broader context engineering discipline

Frequently asked questions

How do I know if my context window is too large?

Does prompt caching work with all models?

Should I optimize for cost or quality first?

How much does context window optimization realistically save?

Leave a Comment Cancel reply