Static context is the beginner approach to agent design. You write a system prompt, maybe add a few retrieved documents, and send everything to the model on every turn. It works for simple agents. It fails for complex ones.
The problem with static context is that it doesn’t adapt. A customer support agent needs different context for a billing question than for a technical issue. A research agent needs different documents for a science question than for a history question. A coding agent needs different files for a frontend bug than for a backend refactor.
Dynamic context injection solves this by building context on the fly, assembling the right information for each specific request at the moment it’s needed. The context window becomes a living document that changes on every turn based on what the agent is doing right now.
This tutorial walks through building a dynamic context injection pipeline from scratch. By the end, you’ll have a system that classifies incoming requests, retrieves context tailored to each request type, and assembles everything within a token budget.
Interactive Concept Map
Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.
Why static context breaks down
Consider a customer support agent with access to a knowledge base of 500 articles. The static approach loads the system prompt plus the top 10 articles by general relevance. But this means:
- A billing question gets technical documentation in its context
- A technical question gets billing policies in its context
- Every query pays for 10 retrieved documents regardless of whether they’re needed
- The model’s attention is divided across irrelevant context, increasing hallucination risk
Dynamic injection fixes each of these problems. The billing question gets billing articles. The technical question gets technical docs. Simple questions get fewer documents. Complex questions get more.
The difference in quality is measurable. In production systems, switching from static to dynamic context injection typically reduces hallucination rates by 25-40% and cuts token costs by 30-50%.
Architecture overview
A dynamic context injection pipeline has four stages:
User Query → [1. Classify] → [2. Route] → [3. Retrieve] → [4. Assemble] → LLM
Stage 1: Classify. Determine what the query is about and what it needs.
Stage 2: Route. Based on the classification, decide which context sources to query and with what parameters.
Stage 3: Retrieve. Pull the relevant context from each source.
Stage 4: Assemble. Fit everything into the token budget, prioritized by relevance and importance.
Each stage is a separate, testable component. Let’s build them.
Stage 1: Query classification
The classifier determines what kind of context the query needs. This can be a simple keyword matcher, a lightweight ML model, or even a fast LLM call.
class QueryClassifier:
def __init__(self):
self.categories = {
"billing": ["invoice", "charge", "payment", "refund", "subscription", "price"],
"technical": ["error", "bug", "crash", "install", "configure", "setup"],
"account": ["password", "login", "email", "profile", "settings", "delete"],
"general": [] # fallback category
}
def classify(self, query: str) -> dict:
query_lower = query.lower()
scores = {}
for category, keywords in self.categories.items():
score = sum(1 for kw in keywords if kw in query_lower)
scores[category] = score
best_category = max(scores, key=scores.get)
if scores[best_category] == 0:
best_category = "general"
return {
"category": best_category,
"confidence": scores[best_category] / max(len(self.categories[best_category]), 1),
"needs_retrieval": best_category != "general" or len(query.split()) > 5,
}
For production systems, replace the keyword matcher with a small classifier model or a single fast LLM call. The classification itself should cost less than 1% of the total interaction cost and add less than 200ms of latency.
Stage 2: Context routing
The router decides which context sources to query based on the classification. Different query types need different context.
class ContextRouter:
def __init__(self):
self.routes = {
"billing": {
"knowledge_base": {"collection": "billing_docs", "max_results": 5},
"user_context": ["subscription_status", "billing_history"],
"system_addons": ["refund_policy", "pricing_tiers"],
},
"technical": {
"knowledge_base": {"collection": "tech_docs", "max_results": 8},
"user_context": ["product_version", "recent_tickets"],
"system_addons": ["known_issues", "troubleshooting_steps"],
},
"account": {
"knowledge_base": {"collection": "account_docs", "max_results": 3},
"user_context": ["account_status", "security_settings"],
"system_addons": ["account_policies"],
},
"general": {
"knowledge_base": {"collection": "general_docs", "max_results": 3},
"user_context": [],
"system_addons": [],
},
}
def get_route(self, classification: dict) -> dict:
category = classification["category"]
return self.routes.get(category, self.routes["general"])
The routing table is where you encode your domain knowledge. A billing query needs fewer knowledge base results but more user-specific context (what plan are they on? what did they pay last month?). A technical query needs more knowledge base results but less user context.
Stage 3: Context retrieval
The retriever pulls context from each source specified by the router. Each source type has its own retrieval logic.
class ContextRetriever:
def __init__(self, vector_store, user_db):
self.vector_store = vector_store
self.user_db = user_db
self.system_docs = self._load_system_docs()
def retrieve(self, query: str, route: dict, user_id: str) -> dict:
context = {}
# Knowledge base retrieval (semantic search)
kb_config = route.get("knowledge_base", {})
if kb_config:
results = self.vector_store.search(
query=query,
collection=kb_config["collection"],
limit=kb_config["max_results"],
min_score=0.80,
)
context["knowledge"] = [r.text for r in results]
# User context retrieval (database lookup)
user_fields = route.get("user_context", [])
if user_fields and user_id:
user_data = self.user_db.get_user_context(user_id, fields=user_fields)
context["user"] = user_data
# System addon retrieval (cached documents)
addon_keys = route.get("system_addons", [])
if addon_keys:
context["system"] = {
key: self.system_docs[key]
for key in addon_keys
if key in self.system_docs
}
return context
Notice the minimum score threshold on knowledge base retrieval. If no documents score above 0.80, the retriever returns an empty list rather than irrelevant documents. This is critical for reducing hallucination.
Stage 4: Context assembly
The assembler fits all retrieved context into the token budget, prioritized by importance.
class ContextAssembler:
def __init__(self, total_budget=100000):
self.budgets = {
"system_prompt": 3000,
"user_context": 2000,
"system_addons": 5000,
"knowledge": 40000,
"conversation": 30000,
"output_reserve": 8000,
}
def assemble(self, system_prompt: str, context: dict,
conversation: list) -> list:
messages = []
used_tokens = 0
# System prompt (always included, highest priority)
messages.append({"role": "system", "content": system_prompt})
used_tokens += count_tokens(system_prompt)
# User context (high priority, usually small)
if "user" in context:
user_block = format_user_context(context["user"])
if count_tokens(user_block) <= self.budgets["user_context"]:
messages[0]["content"] += f"\n\nUser context:\n{user_block}"
used_tokens += count_tokens(user_block)
# System addons (medium-high priority)
if "system" in context:
for key, doc in context["system"].items():
tokens = count_tokens(doc)
if used_tokens + tokens <= (self.budgets["system_prompt"]
+ self.budgets["system_addons"]):
messages[0]["content"] += f"\n\n{key}:\n{doc}"
used_tokens += tokens
# Knowledge base results (medium priority, fill remaining budget)
if "knowledge" in context:
kb_budget = self.budgets["knowledge"]
for doc in context["knowledge"]:
tokens = count_tokens(doc)
if tokens <= kb_budget:
messages.append({
"role": "system",
"content": f"Reference document:\n{doc}"
})
kb_budget -= tokens
used_tokens += tokens
# Conversation history (fit what we can)
conv_budget = self.budgets["conversation"]
for turn in reversed(conversation):
tokens = count_tokens(turn["content"])
if tokens <= conv_budget:
messages.insert(-0 if turn["role"] == "user" else len(messages),
turn)
conv_budget -= tokens
return messages
Putting it together
The complete pipeline connects all four stages:
class DynamicContextPipeline:
def __init__(self, vector_store, user_db):
self.classifier = QueryClassifier()
self.router = ContextRouter()
self.retriever = ContextRetriever(vector_store, user_db)
self.assembler = ContextAssembler()
def build_context(self, query: str, user_id: str,
conversation: list, system_prompt: str) -> list:
# 1. Classify the query
classification = self.classifier.classify(query)
# 2. Route to context sources
route = self.router.get_route(classification)
# 3. Retrieve relevant context
context = self.retriever.retrieve(query, route, user_id)
# 4. Assemble within budget
messages = self.assembler.assemble(
system_prompt, context, conversation
)
return messages
Each request now gets context tailored to its specific needs. A billing question about refunds gets the refund policy, the user’s billing history, and relevant knowledge base articles about refunds. Nothing else.
Testing your pipeline
Dynamic context pipelines need testing at each stage. Test the classifier with examples from each category. Test the router by verifying the right sources are selected. Test the retriever with queries that should and shouldn’t return results. Test the assembler by confirming it stays within budget.
The most important test: compare agent output quality with static context versus dynamic context on your evaluation dataset. If dynamic context doesn’t improve quality scores, the pipeline isn’t adding value and you should simplify.
For more on context engineering fundamentals, read our complete introduction. For token-level optimization techniques, see our context window optimization guide. For hands-on techniques to complement this tutorial, try our 5 techniques tutorial.
Frequently asked questions
How much latency does dynamic context injection add?
Classification adds 10-50ms (keyword matching) or 200-500ms (LLM-based). Retrieval adds 50-200ms for vector search and database lookups. Assembly adds 5-20ms. Total overhead: 65-720ms depending on your classification method. For most applications, the quality improvement justifies the latency.
Should every agent use dynamic context injection?
No. If your agent handles a single, narrow task (code formatting, data extraction, simple classification), static context is simpler and sufficient. Dynamic injection adds value when the agent handles diverse query types that need different context sources.
How do I decide which context sources to add?
Start with the sources that affect output quality most. Typically: the knowledge base (answers factual questions), user-specific data (personalizes responses), and policy documents (ensures compliance). Add sources one at a time and measure the quality impact of each.
Can I use an LLM for classification instead of keyword matching?
Yes, and it’s more accurate for ambiguous queries. The trade-off is latency (200-500ms vs 10-50ms) and cost (a small model call per request). For high-traffic systems, keyword matching with LLM fallback for low-confidence classifications is a good compromise.
Subscribe to the newsletter for weekly tutorials on context engineering, agent verification, and production deployment patterns.