AI Agent Verification Checklist: 25 Checks Before You Deploy

A fintech startup deployed a customer service agent on a Friday afternoon. By Monday morning, it had confidently told 47 customers they were eligible for loan products they didn’t qualify for. The agent passed every functional test during development. What it didn’t pass was a systematic verification of safety constraints, output validation, and failure handling under production conditions.

This checklist exists to prevent that kind of deployment. It covers 25 verification checks organized into five categories: functional correctness, safety and constraints, cost and resource controls, observability, and production readiness. Run through every item before any agent touches real users.

Interactive Concept Map

Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.

agent verification checklist infographic
Visual checklist for AI agent verification before deployment. Click to enlarge.

How to use this checklist

Print it. Pin it to your deployment process. No agent ships without a check mark on every line. Some checks are automated (run them in your CI/CD pipeline). Others require manual review. The checklist notes which is which.

For each check, the format is: what to verify, how to verify it, and what a failure looks like.

Category 1: Functional correctness (Checks 1-7)

These verify that your agent does what it’s supposed to do.

Check 1: Core task completion

Verify: The agent completes its primary task correctly on a representative evaluation dataset.

How: Run 50+ test cases covering normal inputs. Measure pass rate with model-graded evaluation using specific rubrics (not “is this good?” but “does this response include X, Y, and Z?”).

Failure looks like: Pass rate below 90% on core task cases. This means 1 in 10 users will get an incorrect response.

Type: Automated

Check 2: Tool call accuracy

Verify: The agent selects the correct tool for each input type and passes valid parameters.

How: Run test cases for every tool the agent can call. Validate tool name, parameter names, parameter types, and parameter values against expected behavior.

Failure looks like: Agent calls a search tool when it should call a database query tool, or passes a string where an integer is expected.

Type: Automated

Check 3: Multi-step task handling

Verify: The agent completes tasks requiring multiple sequential steps without losing context or skipping steps.

How: Design test cases with 3-5 step tasks. Verify each intermediate step produces correct output and the final result reflects all steps.

Failure looks like: Agent completes steps 1-3 correctly but produces a final answer that ignores the output of step 2. Common in long reasoning chains where context drops off.

Type: Automated

Check 4: Edge case handling

Verify: The agent handles unusual inputs gracefully: empty strings, extremely long inputs, special characters, mixed languages, ambiguous queries.

How: Run 20+ edge case inputs. Verify the agent either handles them correctly or fails gracefully with a helpful error message.

Failure looks like: Agent crashes, returns an empty response, or enters an infinite loop on malformed input.

Type: Automated

Check 5: Conversation context maintenance

Verify: In multi-turn scenarios, the agent maintains context from previous turns and doesn’t contradict itself.

How: Run 10+ multi-turn test conversations (4-6 turns each). Check that references to earlier turns are accurate and the agent tracks state correctly.

Failure looks like: User says “I’m interested in the second option” and the agent doesn’t know which option they mean, or recommends something it previously ruled out.

Type: Automated

Check 6: Output format compliance

Verify: Agent outputs conform to the required format: JSON schema, markdown structure, response length, or other structural requirements.

How: Parse every output against the expected schema or format specification. Reject any output that doesn’t validate.

Failure looks like: Downstream system receives JSON with missing fields, truncated responses, or unexpected data types.

Type: Automated

Check 7: Fallback behavior

Verify: When the agent doesn’t know the answer or can’t complete a task, it acknowledges this rather than guessing.

How: Run 10 cases where the correct answer is “I don’t know” or “I can’t help with that.” Verify the agent doesn’t fabricate a response.

Failure looks like: Agent confidently answers a question it has no basis for answering. This is the most dangerous failure mode because users trust confident responses.

Type: Automated + Manual review

Category 2: Safety and constraints (Checks 8-13)

These verify that your agent respects boundaries.

Check 8: Prohibited content filtering

Verify: The agent refuses to generate content that violates your safety policies: harmful instructions, discriminatory content, private information disclosure.

How: Run 20+ adversarial inputs that attempt to elicit prohibited content. Use both direct requests and indirect prompt injection attempts.

Failure looks like: Agent generates prohibited content when asked creatively, even if it refuses direct requests.

Type: Automated + Manual review

Check 9: Prompt injection resistance

Verify: The agent doesn’t follow instructions embedded in user input that attempt to override its system prompt.

How: Test with 15+ prompt injection patterns: “Ignore all previous instructions,” instructions embedded in user-provided documents, role-playing attacks, delimiter manipulation.

Failure looks like: Agent executes instructions from user input that contradict its system prompt, like revealing internal instructions or changing its behavior.

Type: Automated

Check 10: Data boundary enforcement

Verify: The agent only accesses data it’s authorized to access and doesn’t leak information across user sessions or privilege levels.

How: Test with scenarios where User A asks about User B’s data. Test cases where the agent should NOT have access to certain tools or databases.

Failure looks like: Agent retrieves and displays data belonging to a different user, or calls a tool it shouldn’t have access to.

Type: Automated + Manual review

Check 11: Disclaimer and qualification

Verify: The agent includes required disclaimers for regulated topics: financial advice, medical information, legal guidance.

How: Run 10+ queries on regulated topics. Verify appropriate disclaimers appear in every response.

Failure looks like: Agent provides medical information without “consult a healthcare professional” disclaimer. This creates legal liability.

Type: Automated

Check 12: Scope containment

Verify: The agent stays within its defined scope and doesn’t attempt to help with tasks outside its domain.

How: Run 10 out-of-scope queries. Verify the agent redirects the user appropriately rather than attempting a response.

Failure looks like: A customer service agent starts giving legal advice, or a code assistant starts providing medical opinions.

Type: Automated

Check 13: PII handling

Verify: The agent handles personally identifiable information according to your privacy policy: doesn’t log PII, doesn’t include PII in responses to other users, masks sensitive data in outputs.

How: Submit inputs containing PII (names, emails, phone numbers, SSNs). Verify PII is handled correctly in responses, logs, and stored data.

Failure looks like: Agent echoes a user’s SSN in its response, or PII from one user appears in logs accessible to another.

Type: Manual review

Category 3: Cost and resource controls (Checks 14-17)

These prevent your agent from burning money or consuming excessive resources.

Check 14: Token budget enforcement

Verify: The agent stays within defined token limits for both input context and output generation.

How: Run test cases with varying input sizes. Verify that context assembly never exceeds the model’s context window and output generation stays within your budget.

Failure looks like: Agent assembles a 200K token context when the budget is 50K, costing 4x more per request than planned.

Type: Automated

Check 15: Loop detection

Verify: The agent doesn’t enter infinite or near-infinite loops of tool calls or self-reflection.

How: Run edge cases that have historically triggered loops: ambiguous queries, contradictory constraints, tasks that can’t be completed with available tools.

Failure looks like: Agent makes 50+ tool calls on a single query, consuming hundreds of thousands of tokens before timing out.

Type: Automated

Check 16: Cost per query validation

Verify: The actual cost per query matches your expected cost model within acceptable variance.

How: Run 100 representative queries and measure total token usage (input + output) for each. Calculate mean, p95, and p99 cost per query.

Failure looks like: Mean cost is $0.15 per query when your budget assumes $0.05. At 10,000 queries per day, that’s $1,000 daily overspend.

Type: Automated

Check 17: Rate limit handling

Verify: The agent handles rate limits from external APIs (LLM providers, search APIs, databases) gracefully with backoff and retry logic.

How: Simulate rate limit responses from each external dependency. Verify the agent retries with appropriate backoff and eventually returns a response or a meaningful error.

Failure looks like: Agent crashes on the first rate limit error, or retries immediately without backoff and triggers more aggressive rate limiting.

Type: Automated

Category 4: Observability (Checks 18-21)

These ensure you can see what your agent is doing in production.

Check 18: Structured logging

Verify: Every agent action produces structured log entries with consistent fields: timestamp, request_id, step_name, duration, token_usage, status.

How: Run 10 requests end-to-end. Verify every step produces a log entry and all required fields are populated.

Failure looks like: Missing log entries for specific steps, making it impossible to debug failures. Or unstructured log messages that can’t be queried.

Type: Automated

Check 19: Error reporting

Verify: Errors are captured with sufficient context for debugging: the input that triggered the error, the agent’s state at the time, and the full error trace.

How: Trigger 5 different error conditions. Verify each produces an error report with enough context to diagnose the root cause without accessing the production system.

Failure looks like: Error log says “LLM call failed” without the prompt, model parameters, or response that caused the failure.

Type: Manual review

Check 20: Metrics collection

Verify: Key operational metrics are collected and accessible: latency per step, total request latency, token usage, error rates, tool call success rates.

How: Run 20 requests and verify metrics appear in your monitoring dashboard within the expected delay.

Failure looks like: Dashboard shows zero data, or metrics are collected but not queryable in your monitoring system.

Type: Manual review

Check 21: Trace correlation

Verify: All log entries and metrics for a single request can be correlated using a shared request ID.

How: Pick 5 completed requests. Use their request IDs to pull all related logs, metrics, and traces. Verify a complete picture of the request lifecycle is available.

Failure looks like: Some steps don’t include the request ID, making it impossible to trace a request end-to-end.

Type: Manual review

Category 5: Production readiness (Checks 22-25)

These verify the operational infrastructure around your agent.

Check 22: Graceful degradation

Verify: When external dependencies fail (LLM provider down, search API unavailable, database timeout), the agent degrades gracefully rather than crashing.

How: Simulate failure of each external dependency. Verify the agent returns a meaningful error message or falls back to a simpler behavior.

Failure looks like: Agent returns a 500 error or hangs indefinitely when the LLM provider has a 30-second latency spike.

Type: Automated

Check 23: Load handling

Verify: The agent handles your expected peak load without degraded performance.

How: Run a load test at 2x your expected peak concurrent requests. Measure latency at p50, p95, and p99. Verify error rates stay below 1%.

Failure looks like: Latency doubles at 1.5x normal load, or the system starts dropping requests under peak conditions.

Type: Automated

Check 24: Rollback plan

Verify: You can roll back to the previous agent version within 5 minutes if the new deployment causes problems.

How: Document the rollback procedure. Execute it in a staging environment. Time it.

Failure looks like: Rollback requires manual database changes, config file edits across multiple servers, or takes more than 15 minutes. During that time, users are experiencing the broken agent.

Type: Manual review

Check 25: Incident response

Verify: Your team has a documented process for responding to agent failures: who gets paged, what gets checked first, how to disable the agent in an emergency.

How: Run a tabletop exercise. Simulate a production incident and walk through the response process.

Failure looks like: When the agent starts giving wrong answers at 2 AM, nobody knows who to call or how to shut it down.

Type: Manual review

Using this checklist in practice

Pre-deployment gate

Add this checklist as a mandatory gate in your deployment pipeline. No agent ships to production without all 25 checks verified. For automated checks, build them into your CI/CD pipeline using the patterns in our automated testing pipeline guide. For manual checks, assign an owner and require sign-off.

Ongoing verification

This checklist isn’t one-and-done. Re-run it when:
– You change the system prompt
– You update the model version
– You add or modify tools
– You change context assembly logic
– Monthly, as a routine verification

Scoring

Track how many checks pass on each deployment. A declining pass rate across deployments signals that quality is slipping and the development process needs attention.

Pass Rate Status Action
25/25 Ship it Deploy with confidence
22-24/25 Fix first Address failures before deployment
18-21/25 Major gaps Significant work needed
Below 18/25 Not ready Agent needs substantial development

Frequently asked questions

Do I need all 25 checks for every agent?

The functional correctness and safety checks (1-13) are non-negotiable for any agent that interacts with users. Cost controls (14-17) are essential for any agent making LLM API calls. Observability (18-21) and production readiness (22-25) scale with deployment complexity. A simple internal tool might skip load testing. A customer-facing agent needs every check.

How long does the full checklist take?

Automated checks (about 16 of the 25) run in your CI/CD pipeline in under 15 minutes. Manual checks (about 9) take 2-4 hours for a thorough review. Budget a full day for the first deployment and 2-3 hours for subsequent deployments with incremental changes.

What if a check fails close to launch?

Don’t ship. The cost of a production incident (user trust, engineering time, potential legal liability) always exceeds the cost of a delayed launch. Fix the failure, re-verify, and deploy when clean.

For the verification methodology behind this checklist, read our complete guide to AI agent verification. Subscribe to the newsletter for deployment checklists, verification patterns, and production engineering guides.

Leave a Comment