Reducing AI Agent Inference Costs: Caching Strategies, Model Routing, and Token Optimization Techniques

You’ve built your first AI agent. It works beautifully in development. Then you deploy it to production — and the invoice arrives.

Welcome to one of the most important engineering challenges in modern AI development: keeping inference costs under control without sacrificing performance. Whether you’re running a customer support agent handling thousands of conversations daily or a research pipeline querying an LLM hundreds of times per task, unchecked inference costs can make even a great product economically unviable.

The good news? Reducing AI agent inference costs is an engineering problem, and engineering problems have solutions. In this guide, we’ll walk through three high-impact strategy areas — prompt caching, model routing, and token optimization — with real-world examples you can apply starting today.


Why Inference Costs Spiral Out of Control

Before we optimize, we need to understand where the money goes.

LLM inference is priced in two dimensions: input tokens (what you send to the model) and output tokens (what the model generates back). Output tokens are typically 3–5x more expensive than input tokens. For an AI agent that runs multi-step reasoning loops, tool calls, and long context windows, these costs compound fast.

Consider a realistic example: an agent that processes customer support tickets. Each request might include:

  • A system prompt: ~500 tokens
  • Conversation history: ~1,000 tokens
  • Retrieval-augmented context (RAG results): ~2,000 tokens
  • The user’s message: ~100 tokens

That’s 3,600 input tokens per call, before the model even responds. If your agent makes 3 tool calls before answering, you’ve multiplied that by 3. At scale — say, 10,000 tickets per day — you’re looking at hundreds of millions of tokens monthly. Small inefficiencies become enormous costs.

Let’s fix that.


Strategy 1: Prompt Caching

What Is Prompt Caching?

Prompt caching lets you cache portions of your input prompt so that repeated content doesn’t need to be re-processed (and re-charged) on every request. Most major AI providers — including Anthropic, OpenAI, and Google — now offer some form of this capability.

With Anthropic’s Claude API, for instance, you can mark parts of your prompt with a cache_control parameter. When the same cached prefix appears in subsequent requests, you’re charged at a significantly reduced rate (roughly 10% of the normal input token price for cache hits).

How to Structure Your Prompts for Caching

The key insight is: put stable content first, dynamic content last.

Here’s a concrete before/after example:

Before (no caching strategy):

messages = [
    {
        "role": "user",
        "content": f"""
        You are a customer support agent for Acme Corp.
        {FULL_PRODUCT_DOCUMENTATION}  # 2,000 tokens, same every time
        {COMPANY_POLICIES}            # 1,000 tokens, same every time

        Customer message: {user_message}  # 50 tokens, changes every time
        """
    }
]

Every call sends all 3,050 tokens fresh. Cache hit rate: 0%.

After (cache-aware structure):

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": FULL_PRODUCT_DOCUMENTATION + COMPANY_POLICIES,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Customer message: {user_message}"
            }
        ]
    }
]

Now the first 3,000 tokens are cached. After the first call, every subsequent request only pays full price for the 50-token user message. That’s a ~98% reduction on input token costs for the static portion.

What to Cache

Not everything is worth caching. Prioritize:

  • System prompts and personas — rarely change between requests
  • Retrieved documents and RAG context — if the same knowledge base chunks appear frequently
  • Few-shot examples — stable demonstrations you include for consistency
  • Tool definitions — especially if you’re passing long JSON schemas for function calling

Cache TTL and Invalidation

Cached prompts don’t live forever. Anthropic’s ephemeral cache, for example, has a 5-minute TTL by default. Design your agent to maximize within-session reuse. For longer-lived caches, check your provider’s extended caching options — some offer multi-hour or even persistent caching tiers.

Pro tip: Log your cache hit rates as a production metric. If your hit rate drops below 80% for content you expect to be stable, something upstream changed — catch it early before costs spike.


Strategy 2: Intelligent Model Routing

The “Always Use the Best Model” Trap

It’s tempting to route every request through your most capable (and most expensive) model. But the reality is that most tasks in a production AI agent don’t require frontier-model reasoning. Routing everything through Claude Opus or GPT-4o when a smaller model would do just fine is one of the most common sources of unnecessary cost.

Model routing means dynamically selecting which model handles each request based on the task’s complexity and requirements.

Building a Simple Routing Layer

Here’s a practical tiered routing approach:

Tier 1 — Lightweight tasks (small/fast models, e.g., Claude Haiku, GPT-4o-mini):
– Intent classification
– Simple slot extraction (“what city does the user mention?”)
– Formatting and summarization of structured data
– Yes/no decision gates
– Short factual lookups

Tier 2 — Medium complexity (mid-tier models, e.g., Claude Sonnet):
– Multi-step reasoning with available tools
– Drafting responses that need editing
– Moderate-length document analysis
– Most agentic loops

Tier 3 — High complexity (frontier models, e.g., Claude Opus, GPT-4o):
– Complex reasoning over ambiguous instructions
– Tasks where quality directly impacts revenue (e.g., final customer-facing output)
– Novel problem-solving with no clear precedent

Here’s how that routing logic might look in code:

def route_request(task_type: str, context_length: int, requires_tools: bool) -> str:
    """Return the appropriate model ID for a given task."""

    # Simple classification or short tasks — use the cheapest model
    if task_type in ("intent_classification", "slot_extraction", "format_check"):
        return "claude-haiku-4-5-20251001"

    # Tool-using agents or medium reasoning
    if requires_tools or context_length < 8000:
        return "claude-sonnet-4-6"

    # Complex reasoning or large contexts
    return "claude-opus-4-6"

Cascade Routing: Try Cheap First

A more sophisticated pattern is cascade routing: attempt the task with a cheaper model, evaluate confidence, and escalate only if needed.

async def cascade_complete(prompt: str, threshold: float = 0.85) -> str:
    # First attempt with cheap model
    result = await call_model("claude-haiku-4-5-20251001", prompt)

    # Evaluate confidence (could be self-evaluated or rule-based)
    confidence = evaluate_confidence(result)

    if confidence >= threshold:
        return result.text  # Good enough — done!

    # Escalate to stronger model
    return await call_model("claude-sonnet-4-6", prompt)

Real-world data from production systems consistently shows that 60–80% of tasks can be handled by cheaper models without any user-facing quality degradation. That ratio compounds dramatically at scale.

A/B Test Your Routing

Don’t guess at quality thresholds — measure them. Run shadow mode experiments where you send the same request to both a cheap and expensive model, compare outputs, and let your quality metrics tell you where the routing boundary should be.


Strategy 3: Token Optimization

Trim Your System Prompts (Without Losing Effectiveness)

System prompts have a way of growing over time. Engineers add edge case handling. Product managers add brand voice guidelines. Support teams add FAQs. Before long, your system prompt is 2,000 tokens of instruction that could be 400 tokens of well-structured guidance.

Conduct a regular system prompt audit:

  1. Print every sentence in your system prompt
  2. Ask: “Would the agent behave differently if I removed this?”
  3. If no, cut it
  4. Test rigorously after trimming

Concrete compression techniques:

  • Replace verbose prose with bullet points
  • Remove redundant instructions (“always be helpful, always be polite, always be professional” → “be professional and helpful”)
  • Move rarely-triggered edge cases to tool-retrieved context rather than the base prompt
  • Use structured formatting (JSON or YAML-style instruction blocks) which LLMs parse efficiently

Compress Retrieved Context (RAG Optimization)

If your agent uses retrieval-augmented generation, the retrieved chunks are often the biggest token consumer. Two techniques help enormously:

1. Rerank before you send. Retrieve more chunks than you need (say, top 20), then use a fast reranker model to select only the top 5 most relevant. Send only those 5 to the main model.

2. Summarize chunks before injection. Instead of sending a 500-token raw document excerpt, pre-summarize it to 100 tokens using a cheap model. You lose some detail but preserve the signal that matters.

async def get_compressed_context(query: str) -> str:
    raw_chunks = await retrieve_top_k(query, k=10)

    compressed = []
    for chunk in raw_chunks:
        summary = await call_model(
            "claude-haiku-4-5-20251001",
            f"Summarize this in 2 sentences, preserving key facts:\n{chunk}"
        )
        compressed.append(summary.text)

    return "\n\n".join(compressed)

Control Output Length Explicitly

Output tokens are expensive. Many agents generate far more output than they need because the model isn’t given clear length guidance.

Be explicit:

# Instead of:
"Analyze the customer's issue and provide a response."

# Write:
"Analyze the customer's issue and respond in 2-3 sentences. 
Be concise. Do not restate the question."

For structured outputs, use JSON mode or constrained generation rather than asking the model to explain its reasoning in prose. If you need the reasoning for debugging, log it separately rather than including it in every production response.

Truncate Conversation History Strategically

Most agents maintain a conversation history that grows unboundedly. This is a cost trap. Instead:

  • Sliding window: Keep only the last N turns (e.g., last 5 exchanges)
  • Summarization: Periodically summarize older turns into a compact memory block using a cheap model
  • Structured memory: Extract key facts from conversation into a structured store and inject only relevant facts rather than full history
def manage_conversation_history(history: list, max_tokens: int = 2000) -> list:
    """Return a token-bounded conversation history."""
    total = 0
    trimmed = []

    for message in reversed(history):
        msg_tokens = count_tokens(message["content"])
        if total + msg_tokens > max_tokens:
            break
        trimmed.insert(0, message)
        total += msg_tokens

    return trimmed

Putting It All Together: A Cost-Optimized Agent Architecture

Here’s how a well-optimized production agent combines all three strategies:

User Request
     |
     v
[Model Router] — Classify task complexity (cheap model)
     |
     +— Simple task ——> [Haiku / Mini] with cached system prompt
     |
     +— Medium task ——> [Sonnet] with cached system prompt + compressed RAG
     |
     +— Complex task ——> [Opus] with full context

All paths:
  - Static prompts marked for caching
  - RAG results reranked and compressed
  - Output length constrained
  - History window managed

Teams that implement all three layers consistently report 50–80% reductions in monthly inference spend without measurable quality loss on production metrics.


Measuring What Matters

You can’t optimize what you don’t measure. Instrument your agent with:

  • Cost per conversation (not just per call)
  • Cache hit rate (by prompt section)
  • Model distribution (% of calls hitting each tier)
  • Output token to input token ratio (a rising ratio signals verbose outputs)
  • Task completion rate by model tier (to catch quality regressions from over-routing to cheap models)

Set alerts when cost-per-conversation rises more than 20% week-over-week. Spikes almost always trace back to a prompt change that broke caching or a routing rule that stopped firing.


Your Next Steps

Reducing AI agent inference costs is a skill, not a one-time fix. The best AI engineers treat it as an ongoing practice — measuring regularly, auditing prompts, and refining routing logic as their agent evolves.

Here’s your action plan:

  1. This week: Audit your system prompt. Strip 20% of tokens without changing behavior.
  2. Next week: Add cache_control to your static prompt sections and measure the hit rate.
  3. This month: Implement a two-tier model router. Route simple classification tasks to a cheaper model.
  4. Ongoing: Add cost-per-conversation to your production dashboard and review it weekly.

Ready to go deeper? Our AI Agent Engineering Certification covers cost optimization, evaluation frameworks, and production deployment patterns in detail. Join thousands of engineers building economically sustainable AI systems.

Have questions about your specific architecture? Drop them in our community forum — our instructors and fellow learners are there to help.

Happy building — and may your cache hit rates be ever in your favor.


Written by Jamie Park, Educator and Career Coach at Harness Engineering Academy. Jamie helps engineers at all levels build practical, production-ready AI agent skills.

Leave a Comment