If you’ve built even a simple AI agent that calls external tools — a weather API, a web scraper, a database query — you’ve almost certainly seen it fail. The API times out. The search engine returns nothing useful. The database is temporarily unavailable. Your agent, lacking any resilience strategy, crashes or halts mid-task with a frustrating error.
Welcome to one of the most important — and often overlooked — topics in AI agent engineering: building for failure.
In this guide, you’ll learn how to make your AI agents genuinely resilient. We’ll cover retry logic with exponential backoff, fallback patterns that keep your agent moving forward, and graceful degradation strategies that let your agent deliver partial value even when things go wrong. By the end, you’ll have a practical toolkit for building agents that handle the real, messy world.
Why Resilience Matters in AI Agent Design
AI agents are fundamentally different from traditional software pipelines. They operate in open-ended environments, calling diverse tools they don’t control — third-party APIs, search engines, code interpreters, databases, and more. Each of those tools can fail independently, unpredictably, and temporarily.
Consider a research agent tasked with summarizing the latest news on a topic. It might:
- Call a news API → 503 Service Unavailable
- Fall back to a web search → rate limit exceeded
- Attempt to scrape a webpage → timeout
Without resilience strategies, this agent is dead in the water at step one. With them, it navigates each failure gracefully and still delivers value.
Resilience engineering in AI agents covers three core pillars:
- Retry Logic — Automatically retrying failed operations with intelligent timing
- Fallback Patterns — Substituting alternative tools or strategies when the primary fails
- Graceful Degradation — Delivering reduced-but-useful output rather than complete failure
Let’s dig into each one.
Part 1: Retry Logic with Exponential Backoff
What Is Retry Logic?
Retry logic is the practice of automatically re-attempting a failed operation after a short delay. It’s the most fundamental resilience technique and the right first line of defense against transient failures — those temporary hiccups that would succeed if you just tried again a moment later.
Not every failure is worth retrying. You need to distinguish:
- Transient failures (worth retrying): network timeouts, rate limits (429), temporary server errors (503), connection resets
- Permanent failures (don’t retry): authentication errors (401/403), bad requests (400), resource not found (404)
Exponential Backoff: The Right Way to Retry
Naively retrying immediately in a tight loop makes things worse — you’ll hammer an already-struggling server and get rate-limited faster. The standard approach is exponential backoff with jitter: wait progressively longer between retries, with a small random offset to avoid thundering-herd problems when many agents retry simultaneously.
Here’s a reusable retry decorator in Python:
import time
import random
import functools
from typing import Callable, Tuple, Type
def retry_with_backoff(
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
retryable_exceptions: Tuple[Type[Exception], ...] = (Exception,),
):
"""
Decorator that retries a function with exponential backoff.
Args:
max_retries: Maximum number of retry attempts
base_delay: Initial delay in seconds
max_delay: Maximum delay cap in seconds
retryable_exceptions: Only retry on these exception types
"""
def decorator(func: Callable):
@functools.wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except retryable_exceptions as e:
last_exception = e
if attempt == max_retries:
break
# Exponential backoff with full jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
wait_time = delay + jitter
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time:.2f}s...")
time.sleep(wait_time)
raise last_exception
return wrapper
return decorator
Applying Retry Logic to an Agent Tool Call
Here’s how you’d apply this to a real agent tool — say, a news search function:
import requests
class TransientAPIError(Exception):
pass
@retry_with_backoff(
max_retries=3,
base_delay=2.0,
retryable_exceptions=(TransientAPIError, requests.Timeout, requests.ConnectionError),
)
def search_news(query: str) -> list[dict]:
response = requests.get(
"https://api.newsservice.com/search",
params={"q": query},
timeout=10,
)
if response.status_code == 429:
raise TransientAPIError("Rate limited")
if response.status_code >= 500:
raise TransientAPIError(f"Server error: {response.status_code}")
if response.status_code == 401:
raise ValueError("Invalid API key") # Not retryable
response.raise_for_status()
return response.json()["articles"]
Notice how we only raise TransientAPIError for errors worth retrying (rate limits, server errors), while authentication errors raise ValueError, which the retry decorator won’t catch.
Retry Logic Inside an Agent Loop
When building agents with frameworks like LangChain, LlamaIndex, or the Anthropic Agent SDK, you can wrap your tool implementations with this decorator before registering them. The agent loop sees a clean, reliable tool interface — all the retry complexity is hidden inside.
Part 2: Fallback Patterns
Retry logic handles temporary failures of a single tool. But what happens when a tool fails definitively — the API is down for hours, the key is invalid, or the service is deprecated? That’s where fallback patterns come in.
A fallback pattern substitutes an alternative strategy when the primary one fails. Think of it like a decision tree: try Plan A, and if it fails, automatically try Plan B.
The Tool Chain Fallback
The simplest pattern is a sequential chain of tools, tried in priority order:
from typing import Optional
def search_with_fallback(query: str) -> Optional[list[dict]]:
"""
Try premium search API first, fall back to free tier,
then fall back to cached results.
"""
# Plan A: Premium search
try:
return premium_search_api(query)
except Exception as e:
print(f"Premium search failed: {e}. Trying fallback...")
# Plan B: Free search tier
try:
return free_search_api(query)
except Exception as e:
print(f"Free search failed: {e}. Trying cache...")
# Plan C: Cached results (stale but better than nothing)
try:
return get_cached_results(query)
except Exception as e:
print(f"Cache lookup failed: {e}.")
# All options exhausted
return None
The Capability Downgrade Fallback
Sometimes your fallback isn’t a different source of the same data — it’s a less capable version of the same operation. This is common when a specialized tool fails and you fall back to a general-purpose one.
For example, a code execution agent might try a sandboxed executor first (fast, safe), then fall back to a more limited static analysis tool:
def analyze_code(code: str) -> dict:
try:
# Preferred: full execution with real output
return sandboxed_executor.run(code)
except ExecutorUnavailableError:
# Fallback: static analysis only (no runtime output)
print("Executor unavailable. Running static analysis only.")
return static_analyzer.analyze(code)
Your agent’s downstream reasoning should be aware of which mode it’s in — the output metadata should indicate whether results came from full execution or static analysis.
Model-Level Fallbacks
For agents using LLMs as part of their tool calls (e.g., using a smaller model for initial filtering), you can apply the same pattern at the model level:
async def classify_document(text: str) -> str:
try:
# Preferred: fast, cheap model
return await call_model("claude-haiku-4-5", text)
except ModelUnavailableError:
# Fallback: primary model
return await call_model("claude-sonnet-4-6", text)
Building a Fallback Registry
For larger agents with many tools, a fallback registry keeps things organized and declarative:
FALLBACK_REGISTRY = {
"web_search": ["premium_search", "free_search", "cached_search"],
"news_fetch": ["news_api", "rss_feed", "news_cache"],
"code_run": ["sandboxed_exec", "static_analysis"],
"database_query": ["primary_db", "replica_db", "cache_layer"],
}
def execute_with_fallback(tool_name: str, *args, **kwargs):
tool_chain = FALLBACK_REGISTRY.get(tool_name, [tool_name])
for tool in tool_chain:
try:
return TOOL_MAP[tool](*args, **kwargs)
except Exception as e:
print(f"Tool '{tool}' failed: {e}. Trying next fallback...")
raise RuntimeError(f"All fallbacks exhausted for tool: {tool_name}")
This pattern is especially powerful when combined with retry logic — each tool in the chain gets its own retries before the fallback kicks in.
Part 3: Graceful Degradation
Retry logic and fallbacks are about keeping your agent running. Graceful degradation is about making sure partial failure doesn’t destroy the entire result.
Imagine a report-writing agent that needs to gather data from five sources. If one source fails permanently and no fallback is available, should the entire report fail? Usually not. A report with four of five data sources is far more valuable than no report at all.
The Partial Results Pattern
Design your agent to accumulate partial results and continue even when individual steps fail:
from dataclasses import dataclass, field
from typing import Any
@dataclass
class AgentResult:
data: dict = field(default_factory=dict)
errors: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
@property
def is_complete(self) -> bool:
return len(self.errors) == 0
@property
def is_partial(self) -> bool:
return len(self.data) > 0 and len(self.errors) > 0
async def research_topic(topic: str) -> AgentResult:
result = AgentResult()
sources = ["news", "academic", "social", "patents", "market_data"]
for source in sources:
try:
data = await fetch_from_source(source, topic)
result.data[source] = data
except Exception as e:
error_msg = f"Failed to fetch from {source}: {e}"
result.errors.append(error_msg)
print(f"Warning: {error_msg}")
return result
The calling code (or the agent’s reasoning) can then inspect result.is_partial and adjust its behavior — perhaps adding a note that market data was unavailable, rather than crashing.
Communicating Degraded State to the Agent
When your agent uses an LLM for reasoning, you need to communicate what succeeded and what didn’t so it can reason accurately:
def format_research_for_agent(result: AgentResult) -> str:
sections = []
for source, data in result.data.items():
sections.append(f"## {source.title()} Data\n{data}")
if result.errors:
warning_text = "\n".join(f"- {e}" for e in result.errors)
sections.append(
f"## Data Availability Notice\n"
f"The following sources were unavailable:\n{warning_text}\n"
f"Please note these gaps in your analysis."
)
return "\n\n".join(sections)
This gives the LLM the context it needs to produce an honest, accurate response that acknowledges what data was missing — rather than hallucinating information from the failed sources.
Circuit Breakers: Stopping the Retry Cycle
One advanced pattern worth knowing is the circuit breaker. If a tool has failed five times in a row, retrying it on the sixth call is probably a waste of time and latency. A circuit breaker tracks failure rates and “opens the circuit” — temporarily disabling a tool and routing immediately to fallbacks.
from collections import deque
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_timeout: int = 60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failures = deque(maxlen=failure_threshold)
self.state = "closed" # closed = normal, open = blocking
self.opened_at = None
def call(self, func, *args, **kwargs):
if self.state == "open":
if datetime.now() > self.opened_at + timedelta(seconds=self.recovery_timeout):
self.state = "half-open" # Test one call
else:
raise RuntimeError("Circuit open — tool temporarily disabled")
try:
result = func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failures.clear()
return result
except Exception as e:
self.failures.append(datetime.now())
if len(self.failures) >= self.failure_threshold:
self.state = "open"
self.opened_at = datetime.now()
print(f"Circuit breaker opened for {func.__name__}")
raise
Putting It All Together: A Resilient Agent Tool Layer
Here’s how all three patterns combine in a real agent tool layer:
# Instantiate circuit breakers per tool
news_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=120)
search_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
@retry_with_backoff(max_retries=2, base_delay=1.5)
def _fetch_news_raw(query: str) -> list[dict]:
return news_breaker.call(news_api.search, query)
@retry_with_backoff(max_retries=2, base_delay=1.0)
def _fetch_search_raw(query: str) -> list[dict]:
return search_breaker.call(search_api.query, query)
def fetch_information(query: str) -> AgentResult:
result = AgentResult()
# Try news with retry + circuit breaker
try:
result.data["news"] = _fetch_news_raw(query)
except Exception as e:
# Try fallback: RSS feed
try:
result.data["news"] = rss_feed.fetch(query)
result.warnings.append("Used RSS fallback for news data")
except Exception:
result.errors.append(f"News unavailable: {e}")
# Try search with retry + circuit breaker
try:
result.data["search"] = _fetch_search_raw(query)
except Exception as e:
result.errors.append(f"Search unavailable: {e}")
return result
This stack — retry → circuit breaker → fallback → partial results — gives you defense in depth. Every layer handles a different failure mode.
Common Mistakes to Avoid
Retrying non-idempotent operations. If your tool has side effects (sending an email, making a payment), retrying blindly can cause duplicate actions. Add idempotency keys or check for existing results before retrying.
Setting retry counts too high. Three retries with exponential backoff is usually enough. Ten retries with a 60-second base delay means your agent could be stuck for 17+ minutes on one failed tool call.
Swallowing errors silently. Graceful degradation doesn’t mean hiding errors from the agent. Always propagate failure information so the LLM can reason accurately about what data it has.
Not testing failure modes. Write tests that deliberately inject failures. Use a library like pytest with unittest.mock to simulate API errors and verify your retry/fallback logic fires correctly.
What to Learn Next
Resilience patterns are a core competency for any professional AI agent engineer. Once you’re comfortable with these building blocks, the next steps are:
- Observability — Logging retry attempts, fallback activations, and circuit breaker state changes so you can monitor agent health in production
- Timeout Budgets — Managing total time budgets across nested tool calls to prevent runaway agents
- Testing Chaos — Deliberately injecting failures in your test suite to validate your resilience strategies
Ready to build production-grade AI agents? Explore the Agent Reliability Engineering course on Harness Engineering Academy — where we go deep on observability, timeout management, and real-world deployment patterns for AI agents.
Key Takeaways
- Retry with exponential backoff and jitter for transient failures — but only for idempotent, retryable error types.
- Fallback chains let your agent substitute alternative tools or degraded capabilities when a primary tool fails permanently.
- Circuit breakers prevent wasted retries on tools that are clearly down, keeping your agent responsive.
- Partial results + transparent error metadata let your LLM reason honestly about incomplete data rather than failing or hallucinating.
- Defense in depth — stacking these patterns — produces agents that behave reliably in production, not just in demos.
Building for failure isn’t pessimism. It’s the difference between a demo agent and a production agent. Start applying these patterns today, and your agents will keep running long after the tools they depend on start misbehaving.
Written by Jamie Park — Educator and Career Coach at Harness Engineering Academy. Jamie writes beginner-to-advanced tutorials on AI agent engineering, covering everything from first agents to production-scale deployment.