Building Evaluation Datasets for AI Agents: A Step-by-Step Guide

Your agent works in the demo. It handles the five examples you tested during development. Then it hits production and a user asks something slightly different, and the whole thing falls apart.

The gap between “works in demos” and “works in production” is almost always a dataset problem. Not a model problem. Not an architecture problem. A dataset problem. You tested against five examples when you needed five hundred. You tested happy paths when you needed adversarial ones. You built a dataset that confirmed your assumptions instead of challenging them.

Building good evaluation datasets is the single highest-leverage activity in agent development. A well-constructed eval dataset catches failures before users do, guides prompt iteration with data instead of intuition, and gives you confidence that changes improve quality rather than just shifting which things break.

This guide walks through the complete process of building evaluation datasets for AI agents, from defining what to measure through scaling your dataset for production coverage.

Interactive Concept Map

Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.

building eval datasets for agents infographic
Visual guide to building evaluation datasets for AI agents. Click to enlarge.

Why agent eval datasets are different from traditional test data

Traditional software tests compare expected output to actual output. The function returns 42 or it doesn’t. Agent evaluation is harder because the same input can produce multiple correct outputs.

Ask an agent to summarize a document, and there are hundreds of valid summaries. Ask it to plan a trip, and there are thousands of reasonable itineraries. Ask it to debug code, and the fix might be correct but stylistically different from your reference answer.

This means agent eval datasets need three things traditional test data doesn’t:

Grading criteria instead of exact answers. Instead of “the correct answer is X,” you define what makes an answer acceptable. Does the summary capture the three main points? Does the itinerary include the required cities? Does the code fix resolve the bug without introducing new ones?

Multiple evaluation methods. Some outputs can be checked programmatically (did the agent call the right tool?). Others need model-based grading (is this summary faithful to the source?). Some need human judgment (is this response helpful?). Your dataset needs to support all three.

Coverage across failure modes. Agents fail in predictable ways: they hallucinate facts, call wrong tools, loop infinitely, exceed cost budgets, or produce outputs that are technically correct but unhelpful. Your dataset needs examples that test each failure mode specifically.

Step 1: Define your evaluation dimensions

Before collecting a single example, decide what you’re measuring. Most agent systems need evaluation across four dimensions.

Task completion. Did the agent accomplish what the user asked? This is the most basic dimension but also the most important. For a coding agent, did the code run? For a research agent, did the answer address the question? For a booking agent, was the reservation made correctly?

Output quality. Beyond completing the task, is the output good? Quality covers accuracy, completeness, coherence, and helpfulness. A research agent might answer the question (task completion) but miss critical context (quality failure).

Safety and constraints. Did the agent stay within its boundaries? This includes not hallucinating facts, not revealing system prompts, not performing actions outside its scope, and not exceeding cost or latency budgets. Safety failures are often more damaging than quality failures.

Process correctness. Did the agent take a reasonable path to the answer? An agent that produces the right output through a fragile chain of lucky guesses will fail on the next similar input. Evaluating the trajectory, not just the final output, catches these brittle successes.

For each dimension, define a scoring rubric before you start collecting data. A rubric for task completion might be: 0 = task not attempted, 1 = task attempted but wrong, 2 = task partially completed, 3 = task fully completed. Write these rubrics down. You will reference them constantly.

Step 2: Seed your dataset from real interactions

The best eval examples come from real usage, not from your imagination. Synthetic examples reflect your assumptions. Real examples reflect what users actually do.

Start with production logs. If your agent is already running, pull interaction logs. Focus on three categories: interactions where the agent clearly succeeded (positive examples), interactions where the agent clearly failed (negative examples), and interactions where the outcome was ambiguous (these are the most valuable because they reveal edge cases your rubric doesn’t cover).

If you don’t have production data, simulate it. Run your agent through realistic scenarios and record the inputs, outputs, and intermediate steps. Get five people who aren’t on the development team to interact with the agent for 30 minutes each. Their inputs will be different from yours in ways that matter.

Collect the full trace, not just input/output. For each example in your dataset, store the user input, the agent’s final output, every intermediate step (tool calls, reasoning, sub-agent delegations), the context the agent had access to, and any metadata (latency, token count, cost). You will need the intermediate steps for trajectory evaluation and debugging.

A minimal dataset entry looks like this:

{
  "id": "eval-001",
  "input": "Find the quarterly revenue for Acme Corp in Q3 2025",
  "context": {"available_tools": ["web_search", "calculator"], "documents": []},
  "expected_behavior": {
    "task_completion": "Returns specific revenue figure with source",
    "required_tool_calls": ["web_search"],
    "constraints": ["Must cite source", "Must not hallucinate figures"]
  },
  "reference_output": "Acme Corp reported $2.3B in Q3 2025 revenue (source: Q3 earnings call transcript)",
  "tags": ["factual_lookup", "tool_use", "citation_required"],
  "difficulty": "medium"
}

Step 3: Build coverage systematically

Random examples aren’t enough. You need coverage across the dimensions that matter.

Coverage by input type. Map the categories of inputs your agent handles. A customer support agent might handle billing questions, technical issues, account changes, and complaints. Your dataset needs examples from each category, weighted by frequency in production.

Coverage by difficulty. Include easy cases (where any reasonable agent succeeds), medium cases (where good agents succeed and weak ones fail), and hard cases (adversarial, ambiguous, or edge-case inputs that stress-test the system). A common split is 40% easy, 40% medium, 20% hard.

Coverage by failure mode. For each known failure mode, include examples designed to trigger it:

Failure Mode Example Type
Hallucination Questions where the correct answer is “I don’t know”
Wrong tool selection Inputs where multiple tools seem relevant
Infinite loops Tasks with circular dependencies
Cost overruns Complex tasks requiring many LLM calls
Constraint violations Inputs that tempt the agent to exceed its scope
Partial completion Multi-step tasks where early steps succeed but later ones fail

Coverage by context variation. The same input can produce different results depending on context. Test with varying context window sizes, different available tools, missing information, and contradictory context. These variations catch fragile agents that work only under ideal conditions.

Aim for 200-500 examples for initial coverage. This sounds like a lot, but you can build it incrementally. Start with 50 seed examples covering the most critical paths, then add 20-30 examples per week as you discover new failure modes.

Step 4: Design your grading pipeline

Each eval example needs a grading method. The three standard approaches, in order of reliability, are:

Deterministic checks. Programmatic assertions that don’t require judgment. Did the agent call the expected tool? Did the output contain required fields? Was the response under the token limit? These are fast, cheap, and perfectly reliable. Use them for everything you can.

def grade_tool_usage(trace, expected_tools):
    """Check if the agent used the required tools."""
    actual_tools = [step.tool_name for step in trace.steps if step.type == "tool_call"]
    missing = set(expected_tools) - set(actual_tools)
    return {"passed": len(missing) == 0, "missing_tools": list(missing)}

Model-based grading. Use an LLM to judge the agent’s output against your rubric. This works well for subjective dimensions like quality and helpfulness. The grading prompt matters enormously. A vague prompt like “is this good?” produces unreliable scores. A specific prompt like “Does this summary contain all three main points from the source document? Score 0 if none, 1 if one or two, 2 if all three” produces consistent, useful scores.

Human evaluation. Human graders score outputs that deterministic checks and model-based grading can’t handle reliably. Reserve human evaluation for your hardest cases, for calibrating your model-based graders, and for periodic audits of your automated pipeline. Human evaluation is expensive but irreplaceable for building trust in your other grading methods.

Most production pipelines use all three in a cascade: deterministic checks run first (instant, free), model-based grading runs on examples that pass deterministic checks (seconds, cents), and human evaluation runs on a sample for calibration (minutes, dollars).

Step 5: Version and maintain your dataset

Eval datasets are living artifacts. They change as your agent evolves, your user base grows, and you discover new failure modes. Without versioning, you lose the ability to compare results across time.

Version every change. Use git or a dedicated dataset versioning tool. Every addition, removal, or modification to your dataset should be tracked with a reason. “Added 15 examples for multi-step tool chains after discovering loop failures in production” is useful. “Updated dataset” is not.

Tag examples by addition date. This lets you run evaluations against “the dataset as it existed on March 1st” and compare with “the dataset as it exists today.” If your agent’s score drops, you need to know whether the agent got worse or the dataset got harder.

Retire stale examples. If your agent no longer handles a particular use case, or if the grading criteria for an example no longer match your rubric, retire the example. Don’t delete it; move it to an archive. You might need it again.

Run regular dataset audits. Every month, sample 20 examples and re-grade them manually. Check whether your model-based graders still agree with human judgment. Check whether your deterministic checks still reflect the current expected behavior. Dataset drift is as real as model drift, and just as dangerous.

Track dataset statistics. Maintain a dashboard showing: total examples by category, difficulty distribution, failure mode coverage, last audit date, and grader agreement rates. When coverage drops below your threshold in any category, that is your signal to add more examples.

Step 6: Scale from prototype to production

A 50-example seed dataset gets you started. A production-grade dataset requires more systematic effort.

Mine production failures. Set up a pipeline that flags agent interactions with low confidence scores, user complaints, or unexpected behavior. Review these weekly and convert the most informative failures into eval examples. Production failures are your highest-value source of new examples.

Use LLM-generated variations. Take your seed examples and ask an LLM to generate variations: rephrase the input, change the context, increase the difficulty. Review every generated example before adding it to your dataset. LLM-generated examples are a draft, not a finished product.

Build annotation workflows. If you need human-labeled examples at scale, build a lightweight annotation interface. It doesn’t need to be sophisticated. A spreadsheet where annotators score agent outputs against your rubric works for teams of 1-5. At larger scale, use dedicated annotation tools like Label Studio or Argilla.

Stratified sampling for regression testing. You don’t need to run every example on every evaluation. Create a “smoke test” subset (50 examples covering critical paths) that runs on every change. Run the full dataset nightly or weekly. Run the full dataset plus adversarial extensions before major releases.

Connecting evaluation to your development cycle

A dataset that sits in a folder unused is worthless. The dataset becomes valuable when it’s wired into your development workflow.

Run eval on every prompt change. When someone modifies a system prompt or adds a tool, the eval suite runs automatically and reports whether quality improved, degraded, or stayed flat. This is the agent equivalent of running unit tests on every commit.

Track scores over time. A dashboard showing eval scores across versions tells you whether your agent is improving. If task completion is rising but safety scores are dropping, you have a trade-off to investigate, not a success to celebrate.

Use eval results to prioritize work. If your dataset reveals that 30% of failures come from wrong tool selection, that is where engineering effort should focus next. Eval data turns “the agent feels unreliable” into “the agent fails on tool selection in 30% of multi-step tasks, primarily when the user’s request is ambiguous.”

For a deeper understanding of how evaluation datasets fit into the broader verification process, read our guide to AI agent verification. To understand the design patterns that evaluation datasets should test against, see our agent design patterns guide. For the tools that run evaluations against these datasets, explore the eval tools comparison on agent-harness.ai.

Frequently asked questions

How many examples do I need in my eval dataset?

Start with 50 examples covering your critical paths. Scale to 200-500 for reasonable production coverage. The number matters less than the coverage. Fifty well-chosen examples that cover all failure modes are more valuable than 500 examples that all test the happy path.

Should I use synthetic data or real data for evaluation?

Both. Seed with real production data to capture actual user behavior. Supplement with synthetic variations to improve coverage of edge cases and failure modes. Never rely on synthetic data alone because it reflects your assumptions about how users interact with the agent, not how they actually do.

How often should I update my eval dataset?

Continuously. Add new examples whenever you discover a new failure mode in production. Audit your existing examples monthly. Retire examples that no longer reflect current behavior. A stale dataset gives you false confidence, which is worse than no dataset at all.

Can I use the same dataset for development and production evaluation?

Separate them. Your development dataset is for iterating on prompts and tools during development. Your production eval dataset is a held-out set that measures real quality. If you optimize against your eval dataset during development, your scores will look great but won’t reflect actual performance. This is the same overfitting problem from machine learning, applied to agent development.

Subscribe to the newsletter for weekly tutorials on agent evaluation, verification patterns, and production deployment strategies.

Leave a Comment