The difference between a mediocre agent and a great one is usually context, not the model. Feed the same model poor context and it hallucinates. Feed it well-engineered context and it produces reliable, grounded outputs. Context engineering is the skill of curating what the model sees to maximize output quality while minimizing cost.
This tutorial walks through five production context engineering techniques with code examples you can adapt for your own agent systems. Each technique solves a specific problem and includes before/after comparisons showing the impact.
If you’re new to context engineering as a discipline, start with our complete introduction first. This tutorial assumes you understand the basics and are ready to implement.
Interactive Concept Map
Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.
Technique 1: Token budgeting
The problem: Your agent has a 128K context window, but you’re stuffing everything into it with no strategy. Some requests use 90K tokens and cost $0.50. Others use 3K tokens and cost $0.01. There’s no predictability, and costs spike on complex queries.
The technique: Divide your context window into segments with fixed budgets. Each segment has a maximum token allocation, and a compression strategy for when it hits the limit.
class TokenBudget:
def __init__(self, total_budget=128000):
self.allocations = {
"system_prompt": int(total_budget * 0.04), # 5K tokens
"conversation": int(total_budget * 0.25), # 32K tokens
"retrieved_context": int(total_budget * 0.45), # 58K tokens
"tool_results": int(total_budget * 0.15), # 19K tokens
"output_reserve": int(total_budget * 0.08), # 10K tokens
}
def fits(self, segment, content_tokens):
return content_tokens <= self.allocations[segment]
def remaining(self, segment, used_tokens):
return self.allocations[segment] - used_tokens
The key insight: The output reserve is the most commonly forgotten allocation. If you fill 127K of a 128K window, the model has almost no room to generate a response. Always reserve 8-10% for output tokens.
Before: Unpredictable token usage, 3x cost variance between similar queries.
After: Consistent token usage within 10% of budget, costs predictable within 20%.
Technique 2: Progressive summarization
The problem: Your agent handles multi-turn conversations that can go 30-50 turns. Storing every turn verbatim exhausts the conversation budget by turn 15. Dropping old turns entirely loses important context from earlier in the conversation.
The technique: Compress older conversation turns at increasing levels of granularity. Recent turns are verbatim. Older turns are summarized. The oldest turns are reduced to bullet points.
def progressive_summarize(turns, budget_tokens, llm_client):
"""Keep recent turns verbatim, summarize older ones."""
# Most recent 5 turns: verbatim
recent = turns[-5:]
recent_tokens = count_tokens(recent)
# If recent turns fit the budget, we're done
if recent_tokens >= budget_tokens:
return truncate_to_budget(recent, budget_tokens)
remaining_budget = budget_tokens - recent_tokens
older_turns = turns[:-5]
if not older_turns:
return recent
# Summarize older turns into a compact paragraph
summary_prompt = (
"Summarize this conversation history in 2-3 sentences. "
"Preserve: user goals, decisions made, key facts mentioned.\n\n"
+ format_turns(older_turns)
)
summary = llm_client.complete(summary_prompt, max_tokens=200)
return [{"role": "system", "content": f"Earlier context: {summary}"}] + recent
When to trigger summarization: Don’t summarize on every turn (that adds LLM call costs). Summarize when the conversation segment exceeds 80% of its budget. This means summarization happens infrequently for short conversations and regularly for long ones.
What to preserve: Goals the user stated, decisions that were made, factual claims that were confirmed, and action items that are pending. What to discard: greetings, acknowledgments, repeated questions, and tangential discussion.
Before: Conversations fail at turn 15 because context window is exhausted.
After: Conversations run to 50+ turns while staying within budget. Quality of early context degrades gracefully rather than disappearing.
Technique 3: Selective retrieval with relevance thresholds
The problem: Your RAG pipeline retrieves the top 10 documents for every query. But some queries match well (scores 0.90+) and some match poorly (scores 0.60-0.70). Low-relevance documents dilute the context and increase hallucination risk because the model tries to use irrelevant information.
The technique: Set a minimum relevance threshold. Only include documents that score above it. If no documents pass the threshold, include none and let the agent acknowledge it doesn’t have the information.
def selective_retrieve(query, vector_store, threshold=0.82, max_docs=5):
"""Retrieve only documents above relevance threshold."""
results = vector_store.similarity_search(query, k=max_docs * 2)
# Filter by threshold
relevant = [
doc for doc in results
if doc.similarity_score >= threshold
]
# Take top max_docs from filtered results
relevant = relevant[:max_docs]
if not relevant:
return [], "No relevant documents found. The agent should acknowledge uncertainty."
return relevant, None
Choosing the threshold: Start at 0.80 and adjust based on your evaluation data. If the agent says “I don’t know” too often, lower the threshold. If the agent hallucinates from irrelevant context, raise it. The right threshold depends on your embedding model, your document corpus, and your quality requirements.
The “I don’t know” advantage: An agent that says “I don’t have information about that” is more trustworthy than one that confidently generates an answer from tangentially related documents. Users prefer honest uncertainty over confident hallucination.
Before: Agent retrieves 10 documents per query. 3-4 are relevant, 6-7 are noise. Hallucination rate: 15%.
After: Agent retrieves 3-5 highly relevant documents. Hallucination rate: 4%.
Technique 4: Priority-based context assembly
The problem: You have multiple context sources (system prompt, user history, retrieved documents, tool results) and they don’t all fit. You need a principled way to decide what stays and what gets cut.
The technique: Assign priority scores to every context chunk. Fill the context window from highest to lowest priority until the budget is exhausted.
def assemble_context(sources, total_budget):
"""Assemble context from prioritized sources."""
# Priority levels (higher = more important)
PRIORITIES = {
"system_instructions": 100,
"current_query": 95,
"active_tool_results": 90,
"recent_conversation": 80,
"high_relevance_docs": 70,
"user_preferences": 60,
"medium_relevance_docs": 50,
"older_conversation": 40,
"low_relevance_docs": 30,
}
# Sort all chunks by priority
all_chunks = []
for source_type, chunks in sources.items():
priority = PRIORITIES.get(source_type, 0)
for chunk in chunks:
all_chunks.append({
"content": chunk,
"tokens": count_tokens(chunk),
"priority": priority,
})
all_chunks.sort(key=lambda x: x["priority"], reverse=True)
# Fill budget from highest priority
assembled = []
used_tokens = 0
for chunk in all_chunks:
if used_tokens + chunk["tokens"] <= total_budget:
assembled.append(chunk["content"])
used_tokens += chunk["tokens"]
return assembled, used_tokens
Dynamic re-prioritization: Priorities should change based on the current turn. If the user just asked about billing, billing-related documents get a priority boost. If the user is in a technical troubleshooting flow, technical documentation gets boosted. Re-score priorities on every turn.
Before: Context assembly is ad hoc. Important information gets crowded out by less relevant content.
After: The most important context is always included. Less important context is gracefully excluded when space is limited.
Technique 5: Tool result compression
The problem: Your agent calls external tools (web search, databases, APIs) and the raw results are verbose. A web search returns 10 snippets at 500 tokens each. A database query returns a JSON blob with 50 fields when only 5 are relevant. Raw tool results consume 30-40% of your context budget.
The technique: Compress tool results before injecting them into the context. Extract only the fields and information relevant to the current query.
def compress_tool_result(tool_name, raw_result, query_context):
"""Compress tool results to essential information."""
if tool_name == "web_search":
# Extract only title, snippet, and URL from search results
compressed = []
for result in raw_result["results"][:5]:
compressed.append({
"title": result["title"],
"snippet": result["snippet"][:200],
"url": result["url"],
})
return compressed
elif tool_name == "database_query":
# Keep only fields relevant to the query
relevant_fields = identify_relevant_fields(raw_result, query_context)
return {k: v for k, v in raw_result.items() if k in relevant_fields}
elif tool_name == "api_call":
# Flatten nested structures, remove metadata
return extract_payload(raw_result)
return raw_result # fallback: return uncompressed
Compression ratios in practice: Web search results compress 60-70% (from ~5,000 tokens to ~1,500). Database results compress 40-80% depending on schema complexity. API responses compress 30-50%. These savings compound when agents make multiple tool calls per interaction.
Before: Tool results consume 35% of context budget. Agent makes fewer tool calls to stay within budget, reducing capability.
After: Tool results consume 12% of context budget. Agent can make more tool calls while using fewer tokens.
Combining the techniques
These five techniques work together. In a production agent, the context assembly flow looks like this:
- Token budgeting sets the allocation for each segment
- Progressive summarization compresses the conversation history to fit its budget
- Selective retrieval pulls only relevant documents above the threshold
- Priority-based assembly fills the remaining budget from highest to lowest priority
- Tool result compression minimizes the footprint of any tool calls
The combined effect is significant. In production systems, applying all five techniques typically reduces token usage by 50-60% while maintaining or improving output quality. The agent sees less noise and more signal, which means fewer hallucinations and more relevant responses.
For a deeper dive into context window optimization math, including KV-cache strategies and attention manipulation, read our optimization guide. For the broader context engineering framework that these techniques fit into, see our complete context engineering guide.
Frequently asked questions
How do I measure whether my context engineering is working?
Track three metrics: hallucination rate (should decrease), token cost per interaction (should decrease), and task completion rate (should stay the same or improve). If all three move in the right direction, your context engineering is working. If task completion drops, you’re being too aggressive with compression or filtering.
Which technique should I implement first?
Start with selective retrieval (Technique 3). It has the highest impact-to-effort ratio. Setting a relevance threshold takes 30 minutes to implement and can cut hallucination rates by 50%+ immediately. Token budgeting (Technique 1) is the second priority because it prevents runaway costs.
Do these techniques work with all LLM providers?
Yes. Context engineering is provider-agnostic because it happens before the LLM call, not during it. The token counting specifics vary between providers (different tokenizers), but the techniques themselves apply to any model with a context window.
How much does context engineering actually save in production?
Typical savings are 40-60% reduction in token costs, 30-50% reduction in hallucination rates, and 10-20% improvement in response relevance. The exact numbers depend on your baseline. If your current context is already well-curated, the gains will be smaller. If you’re currently stuffing everything into the context window, the gains will be dramatic.
Subscribe to the newsletter for weekly tutorials on context engineering, agent verification, and production deployment patterns.