You built an AI agent. It works great on your laptop. You deploy it to production — and three days later, a user reports it gave completely wrong advice. You check the logs. There are no logs. You have no idea what happened.
This scenario plays out constantly in AI engineering teams that ship first and instrument later. The good news: observability is not complicated, and adding it early saves enormous pain down the road.
In this guide, you will learn how to build observable AI agents from the ground up — structured logging, distributed tracing, and production monitoring — using practical code examples you can adapt immediately.
Why Observability Matters More for AI Agents Than for Regular Software
Traditional software is deterministic. Given the same input, you get the same output. Debugging is annoying, but cause and effect are traceable.
AI agents are different in three important ways:
Non-determinism. The same prompt can produce different outputs across runs. Without logging the exact prompt, model, temperature, and response, you cannot reproduce failures.
Multi-step reasoning chains. An agent might retrieve documents, call three tools, reason over results, and generate a final answer — all in one request. If the final answer is wrong, you need to know which step broke down.
External dependencies. Agents call APIs, search databases, and invoke tools. Each hop adds latency and failure modes that only visibility can surface.
Observability for AI agents means being able to answer: What did the agent do, why did it do that, and how long did each step take?
The Three Pillars of AI Agent Observability
Before writing code, understand the three pillars and what each one gives you.
Logs: The What Happened Record
Logs are timestamped records of events. For AI agents, the most important events to log are:
- Agent invocations (input, model, configuration)
- Tool calls (name, arguments, response)
- LLM requests and responses (prompt tokens, completion tokens, latency)
- Errors and exceptions
- Final outputs with metadata
Traces: The Causal Chain
A trace links all the logs from a single agent run into one coherent story. Instead of searching through thousands of log lines, you open one trace and see the entire reasoning chain in order.
Traces use parent-child span relationships. The top-level span represents the overall agent invocation. Child spans represent individual steps: retrieval, tool call, LLM inference. Each span has a start time, end time, and attributes.
Metrics: The Aggregate Picture
Metrics answer questions like: What is the p95 latency for our research agent? How many tool calls fail per hour? What is the average token spend per user session?
Metrics alert you when something goes wrong at scale — before individual users start filing support tickets.
Setting Up Your Observability Stack
For this tutorial, you will use:
- Python with the Anthropic SDK for the agent
- OpenTelemetry for vendor-neutral instrumentation
- structlog for structured logging
- Prometheus for metrics collection (or any compatible backend)
Install the dependencies:
pip install anthropic opentelemetry-sdk opentelemetry-exporter-otlp \
opentelemetry-instrumentation structlog prometheus-client
Step 1: Implement Structured Logging
Plain text logs are hard to query. Structured logs — where every log entry is a JSON object with consistent fields — are searchable, filterable, and ingestible by every major log aggregation platform.
import structlog
import logging
import sys
def configure_logging(service_name: str, log_level: str = "INFO") -> None:
"""Configure structured JSON logging for the agent service."""
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer(),
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
logging.basicConfig(
format="%(message)s",
stream=sys.stdout,
level=getattr(logging, log_level.upper()),
)
logger = structlog.get_logger()
Every log call now produces JSON like this:
{
"event": "tool_call_completed",
"tool_name": "web_search",
"query": "Claude agent SDK documentation",
"result_count": 5,
"latency_ms": 342,
"timestamp": "2026-04-04T10:23:11.445Z",
"level": "info"
}
This is immediately queryable in Datadog, Grafana Loki, CloudWatch, or any other log platform.
Step 2: Implement Distributed Tracing with OpenTelemetry
OpenTelemetry is the industry standard for distributed tracing. It works with every major backend: Jaeger, Honeycomb, Datadog, New Relic, and more.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
import time
def configure_tracing(service_name: str, otlp_endpoint: str = "http://localhost:4317") -> trace.Tracer:
"""Configure OpenTelemetry tracing."""
resource = Resource.create({
"service.name": service_name,
"service.version": "1.0.0",
})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
return trace.get_tracer(service_name)
tracer = configure_tracing("research-agent")
Step 3: Build an Instrumented Agent
Now put it together. Here is a fully instrumented research agent using the Anthropic SDK:
import anthropic
import time
import uuid
from typing import Any
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import structlog
logger = structlog.get_logger()
class ObservableAgent:
def __init__(self, model: str = "claude-opus-4-6"):
self.client = anthropic.Anthropic()
self.model = model
self.tracer = trace.get_tracer("observable-agent")
def run(self, user_input: str, session_id: str = None) -> str:
"""Execute agent with full observability."""
if session_id is None:
session_id = str(uuid.uuid4())
bound_logger = logger.bind(session_id=session_id, model=self.model)
with self.tracer.start_as_current_span("agent.run") as root_span:
root_span.set_attribute("session.id", session_id)
root_span.set_attribute("agent.model", self.model)
root_span.set_attribute("input.length", len(user_input))
bound_logger.info("agent_invoked", input_preview=user_input[:100])
start_time = time.time()
try:
tools = self._get_tools()
messages = [{"role": "user", "content": user_input}]
final_response = None
tool_call_count = 0
# Agentic loop
while True:
with self.tracer.start_as_current_span("llm.inference") as llm_span:
llm_start = time.time()
response = self.client.messages.create(
model=self.model,
max_tokens=4096,
tools=tools,
messages=messages,
)
llm_latency = (time.time() - llm_start) * 1000
llm_span.set_attribute("llm.input_tokens", response.usage.input_tokens)
llm_span.set_attribute("llm.output_tokens", response.usage.output_tokens)
llm_span.set_attribute("llm.latency_ms", llm_latency)
llm_span.set_attribute("llm.stop_reason", response.stop_reason)
bound_logger.info(
"llm_inference_completed",
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=round(llm_latency, 2),
stop_reason=response.stop_reason,
)
if response.stop_reason == "end_turn":
final_response = next(
block.text for block in response.content
if hasattr(block, "text")
)
break
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
tool_call_count += 1
result = self._execute_tool(
block.name, block.input, session_id, bound_logger
)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
total_latency = (time.time() - start_time) * 1000
root_span.set_attribute("agent.tool_call_count", tool_call_count)
root_span.set_attribute("agent.total_latency_ms", total_latency)
root_span.set_status(Status(StatusCode.OK))
bound_logger.info(
"agent_completed",
tool_call_count=tool_call_count,
total_latency_ms=round(total_latency, 2),
output_length=len(final_response),
)
return final_response
except Exception as e:
root_span.set_status(Status(StatusCode.ERROR, str(e)))
root_span.record_exception(e)
bound_logger.error("agent_failed", error=str(e), exc_info=True)
raise
def _execute_tool(
self,
tool_name: str,
tool_input: dict,
session_id: str,
bound_logger: Any
) -> str:
"""Execute a tool with its own trace span."""
with self.tracer.start_as_current_span(f"tool.{tool_name}") as tool_span:
tool_span.set_attribute("tool.name", tool_name)
tool_span.set_attribute("tool.input", str(tool_input))
bound_logger.info("tool_call_started", tool=tool_name, input=tool_input)
start = time.time()
try:
# Route to tool implementations
if tool_name == "web_search":
result = self._web_search(tool_input["query"])
elif tool_name == "calculator":
result = self._calculator(tool_input["expression"])
else:
result = f"Unknown tool: {tool_name}"
latency_ms = (time.time() - start) * 1000
tool_span.set_attribute("tool.latency_ms", latency_ms)
tool_span.set_attribute("tool.result_length", len(str(result)))
tool_span.set_status(Status(StatusCode.OK))
bound_logger.info(
"tool_call_completed",
tool=tool_name,
latency_ms=round(latency_ms, 2),
result_preview=str(result)[:100],
)
return str(result)
except Exception as e:
tool_span.set_status(Status(StatusCode.ERROR, str(e)))
tool_span.record_exception(e)
bound_logger.error("tool_call_failed", tool=tool_name, error=str(e))
return f"Error executing {tool_name}: {str(e)}"
def _get_tools(self) -> list:
return [
{
"name": "web_search",
"description": "Search the web for current information",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
{
"name": "calculator",
"description": "Evaluate a mathematical expression",
"input_schema": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
]
def _web_search(self, query: str) -> str:
# Placeholder — wire to your search API
return f"Search results for: {query}"
def _calculator(self, expression: str) -> str:
try:
result = eval(expression, {"__builtins__": {}}, {})
return str(result)
except Exception as e:
return f"Calculation error: {e}"
Step 4: Add Metrics for Aggregate Monitoring
Logs and traces tell you what happened in individual runs. Metrics tell you what is happening across all runs, right now.
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Define metrics
agent_invocations_total = Counter(
"agent_invocations_total",
"Total agent invocations",
["model", "status"]
)
agent_latency_seconds = Histogram(
"agent_latency_seconds",
"Agent end-to-end latency",
["model"],
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
tool_calls_total = Counter(
"tool_calls_total",
"Total tool calls",
["tool_name", "status"]
)
llm_tokens_total = Counter(
"llm_tokens_total",
"Total LLM tokens consumed",
["model", "token_type"]
)
active_sessions = Gauge(
"active_sessions",
"Currently active agent sessions"
)
# Start metrics server on port 8000
start_http_server(8000)
Add metric recording to your agent’s run method:
active_sessions.inc()
try:
result = self._run_internal(user_input, session_id)
agent_invocations_total.labels(model=self.model, status="success").inc()
agent_latency_seconds.labels(model=self.model).observe(total_latency / 1000)
return result
except Exception:
agent_invocations_total.labels(model=self.model, status="error").inc()
raise
finally:
active_sessions.dec()
Step 5: Set Up Alerting Rules
Metrics only protect you if you actually alert on them. Here are the four alerts every production AI agent needs:
High error rate: Alert if more than 5% of agent invocations fail in a 5-minute window.
Latency degradation: Alert if p95 latency exceeds your SLA (commonly 30 seconds for complex agents).
Token spend spike: Alert if hourly token consumption exceeds 150% of the 7-day average. This catches prompt injection loops and runaway agents.
Tool failure rate: Alert if any individual tool fails more than 10% of the time — this usually means an external API is degrading.
In Prometheus alert syntax:
groups:
- name: ai_agent_alerts
rules:
- alert: AgentHighErrorRate
expr: |
rate(agent_invocations_total{status="error"}[5m]) /
rate(agent_invocations_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Agent error rate above 5%"
- alert: AgentHighLatency
expr: |
histogram_quantile(0.95, rate(agent_latency_seconds_bucket[10m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Agent p95 latency exceeds 30 seconds"
Real-World Example: Debugging a Failure with Traces
Imagine your agent returns an incorrect financial calculation. A user reports it. Without observability, you are guessing. With traces, here is what you do:
- Find the
session_idfrom the user’s report (you logged it and returned it in your API response). - Open your trace backend (Jaeger, Honeycomb, etc.) and search for that session ID.
- The trace shows: LLM inference called the
calculatortool with the expression"1200 * 0.075 / 12". The tool returned"7.5"— correct. But then the LLM added a text response that said “your monthly payment is $75” — it dropped a zero.
The bug is in the LLM’s reasoning, not the tool. You now have the exact prompt, the exact model response, and the exact tool call that led to the error. You can write a regression test against this case and improve your prompt to include units explicitly.
Without observability, this debugging session takes days. With it, minutes.
Production Checklist for AI Agent Observability
Before shipping any AI agent to production, verify these are in place:
- [ ] Every agent invocation has a unique
session_idreturned to the caller - [ ] Structured JSON logs are shipping to a centralized log platform
- [ ] Traces cover: agent root span, each LLM call, each tool call
- [ ] Token usage is recorded per invocation (for cost tracking)
- [ ] Metrics are exported and dashboards exist for error rate, latency, and token spend
- [ ] Alerts are configured for error rate, latency, token spike, and tool failures
- [ ] Sensitive data (PII, API keys) is scrubbed before logging
- [ ] Log retention policy is set (90 days is common for compliance)
What to Learn Next
Observability is the foundation that makes everything else in production AI engineering possible. Once your agents are instrumented, the next layer is evaluation — systematically measuring output quality, not just infrastructure health.
Ready to go deeper? Explore our guides on:
- LLM Evaluation Frameworks — measuring accuracy, hallucination rates, and relevance scores automatically
- Cost Optimization for Production Agents — using token metrics to reduce spend by 40%+ without degrading quality
- Multi-Agent Observability — tracing across agent handoffs when one agent calls another
Start Building Observable Agents Today
Observability is not a “nice to have” for production AI systems — it is a prerequisite for operating them responsibly. The code in this guide gives you a working foundation you can deploy in a day.
Want structured guidance as you build? Harness Engineering Academy offers a full curriculum on production AI engineering, from your first agent to enterprise-scale deployment. Our certification program includes hands-on labs for observability, evaluation, and reliability engineering.
Explore the AI Agent Engineer Certification Path →
Download the Production Observability Checklist →
Have questions or feedback on this tutorial? Jamie reads every comment. Drop your question below or join the discussion in our community Discord.
Jamie Park is an educator and career coach at Harness Engineering Academy. Jamie specializes in helping engineers transition into AI systems roles, with a focus on production reliability and evaluation.