A fintech startup deployed a customer service agent on a Friday afternoon. By Monday morning, it had confidently told 47 customers they were eligible for loan products they didn’t qualify for. The agent passed every functional test during development. What it didn’t pass was a systematic verification of safety constraints, output validation, and failure handling under production conditions.
This checklist exists to prevent that kind of deployment. It covers 25 verification checks organized into five categories: functional correctness, safety and constraints, cost and resource controls, observability, and production readiness. Run through every item before any agent touches real users.
Interactive Concept Map
Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.
How to use this checklist
Print it. Pin it to your deployment process. No agent ships without a check mark on every line. Some checks are automated (run them in your CI/CD pipeline). Others require manual review. The checklist notes which is which.
For each check, the format is: what to verify, how to verify it, and what a failure looks like.
Category 1: Functional correctness (Checks 1-7)
These verify that your agent does what it’s supposed to do.
Check 1: Core task completion
Verify: The agent completes its primary task correctly on a representative evaluation dataset.
How: Run 50+ test cases covering normal inputs. Measure pass rate with model-graded evaluation using specific rubrics (not “is this good?” but “does this response include X, Y, and Z?”).
Failure looks like: Pass rate below 90% on core task cases. This means 1 in 10 users will get an incorrect response.
Type: Automated
Check 2: Tool call accuracy
Verify: The agent selects the correct tool for each input type and passes valid parameters.
How: Run test cases for every tool the agent can call. Validate tool name, parameter names, parameter types, and parameter values against expected behavior.
Failure looks like: Agent calls a search tool when it should call a database query tool, or passes a string where an integer is expected.
Type: Automated
Check 3: Multi-step task handling
Verify: The agent completes tasks requiring multiple sequential steps without losing context or skipping steps.
How: Design test cases with 3-5 step tasks. Verify each intermediate step produces correct output and the final result reflects all steps.
Failure looks like: Agent completes steps 1-3 correctly but produces a final answer that ignores the output of step 2. Common in long reasoning chains where context drops off.
Type: Automated
Check 4: Edge case handling
Verify: The agent handles unusual inputs gracefully: empty strings, extremely long inputs, special characters, mixed languages, ambiguous queries.
How: Run 20+ edge case inputs. Verify the agent either handles them correctly or fails gracefully with a helpful error message.
Failure looks like: Agent crashes, returns an empty response, or enters an infinite loop on malformed input.
Type: Automated
Check 5: Conversation context maintenance
Verify: In multi-turn scenarios, the agent maintains context from previous turns and doesn’t contradict itself.
How: Run 10+ multi-turn test conversations (4-6 turns each). Check that references to earlier turns are accurate and the agent tracks state correctly.
Failure looks like: User says “I’m interested in the second option” and the agent doesn’t know which option they mean, or recommends something it previously ruled out.
Type: Automated
Check 6: Output format compliance
Verify: Agent outputs conform to the required format: JSON schema, markdown structure, response length, or other structural requirements.
How: Parse every output against the expected schema or format specification. Reject any output that doesn’t validate.
Failure looks like: Downstream system receives JSON with missing fields, truncated responses, or unexpected data types.
Type: Automated
Check 7: Fallback behavior
Verify: When the agent doesn’t know the answer or can’t complete a task, it acknowledges this rather than guessing.
How: Run 10 cases where the correct answer is “I don’t know” or “I can’t help with that.” Verify the agent doesn’t fabricate a response.
Failure looks like: Agent confidently answers a question it has no basis for answering. This is the most dangerous failure mode because users trust confident responses.
Type: Automated + Manual review
Category 2: Safety and constraints (Checks 8-13)
These verify that your agent respects boundaries.
Check 8: Prohibited content filtering
Verify: The agent refuses to generate content that violates your safety policies: harmful instructions, discriminatory content, private information disclosure.
How: Run 20+ adversarial inputs that attempt to elicit prohibited content. Use both direct requests and indirect prompt injection attempts.
Failure looks like: Agent generates prohibited content when asked creatively, even if it refuses direct requests.
Type: Automated + Manual review
Check 9: Prompt injection resistance
Verify: The agent doesn’t follow instructions embedded in user input that attempt to override its system prompt.
How: Test with 15+ prompt injection patterns: “Ignore all previous instructions,” instructions embedded in user-provided documents, role-playing attacks, delimiter manipulation.
Failure looks like: Agent executes instructions from user input that contradict its system prompt, like revealing internal instructions or changing its behavior.
Type: Automated
Check 10: Data boundary enforcement
Verify: The agent only accesses data it’s authorized to access and doesn’t leak information across user sessions or privilege levels.
How: Test with scenarios where User A asks about User B’s data. Test cases where the agent should NOT have access to certain tools or databases.
Failure looks like: Agent retrieves and displays data belonging to a different user, or calls a tool it shouldn’t have access to.
Type: Automated + Manual review
Check 11: Disclaimer and qualification
Verify: The agent includes required disclaimers for regulated topics: financial advice, medical information, legal guidance.
How: Run 10+ queries on regulated topics. Verify appropriate disclaimers appear in every response.
Failure looks like: Agent provides medical information without “consult a healthcare professional” disclaimer. This creates legal liability.
Type: Automated
Check 12: Scope containment
Verify: The agent stays within its defined scope and doesn’t attempt to help with tasks outside its domain.
How: Run 10 out-of-scope queries. Verify the agent redirects the user appropriately rather than attempting a response.
Failure looks like: A customer service agent starts giving legal advice, or a code assistant starts providing medical opinions.
Type: Automated
Check 13: PII handling
Verify: The agent handles personally identifiable information according to your privacy policy: doesn’t log PII, doesn’t include PII in responses to other users, masks sensitive data in outputs.
How: Submit inputs containing PII (names, emails, phone numbers, SSNs). Verify PII is handled correctly in responses, logs, and stored data.
Failure looks like: Agent echoes a user’s SSN in its response, or PII from one user appears in logs accessible to another.
Type: Manual review
Category 3: Cost and resource controls (Checks 14-17)
These prevent your agent from burning money or consuming excessive resources.
Check 14: Token budget enforcement
Verify: The agent stays within defined token limits for both input context and output generation.
How: Run test cases with varying input sizes. Verify that context assembly never exceeds the model’s context window and output generation stays within your budget.
Failure looks like: Agent assembles a 200K token context when the budget is 50K, costing 4x more per request than planned.
Type: Automated
Check 15: Loop detection
Verify: The agent doesn’t enter infinite or near-infinite loops of tool calls or self-reflection.
How: Run edge cases that have historically triggered loops: ambiguous queries, contradictory constraints, tasks that can’t be completed with available tools.
Failure looks like: Agent makes 50+ tool calls on a single query, consuming hundreds of thousands of tokens before timing out.
Type: Automated
Check 16: Cost per query validation
Verify: The actual cost per query matches your expected cost model within acceptable variance.
How: Run 100 representative queries and measure total token usage (input + output) for each. Calculate mean, p95, and p99 cost per query.
Failure looks like: Mean cost is $0.15 per query when your budget assumes $0.05. At 10,000 queries per day, that’s $1,000 daily overspend.
Type: Automated
Check 17: Rate limit handling
Verify: The agent handles rate limits from external APIs (LLM providers, search APIs, databases) gracefully with backoff and retry logic.
How: Simulate rate limit responses from each external dependency. Verify the agent retries with appropriate backoff and eventually returns a response or a meaningful error.
Failure looks like: Agent crashes on the first rate limit error, or retries immediately without backoff and triggers more aggressive rate limiting.
Type: Automated
Category 4: Observability (Checks 18-21)
These ensure you can see what your agent is doing in production.
Check 18: Structured logging
Verify: Every agent action produces structured log entries with consistent fields: timestamp, request_id, step_name, duration, token_usage, status.
How: Run 10 requests end-to-end. Verify every step produces a log entry and all required fields are populated.
Failure looks like: Missing log entries for specific steps, making it impossible to debug failures. Or unstructured log messages that can’t be queried.
Type: Automated
Check 19: Error reporting
Verify: Errors are captured with sufficient context for debugging: the input that triggered the error, the agent’s state at the time, and the full error trace.
How: Trigger 5 different error conditions. Verify each produces an error report with enough context to diagnose the root cause without accessing the production system.
Failure looks like: Error log says “LLM call failed” without the prompt, model parameters, or response that caused the failure.
Type: Manual review
Check 20: Metrics collection
Verify: Key operational metrics are collected and accessible: latency per step, total request latency, token usage, error rates, tool call success rates.
How: Run 20 requests and verify metrics appear in your monitoring dashboard within the expected delay.
Failure looks like: Dashboard shows zero data, or metrics are collected but not queryable in your monitoring system.
Type: Manual review
Check 21: Trace correlation
Verify: All log entries and metrics for a single request can be correlated using a shared request ID.
How: Pick 5 completed requests. Use their request IDs to pull all related logs, metrics, and traces. Verify a complete picture of the request lifecycle is available.
Failure looks like: Some steps don’t include the request ID, making it impossible to trace a request end-to-end.
Type: Manual review
Category 5: Production readiness (Checks 22-25)
These verify the operational infrastructure around your agent.
Check 22: Graceful degradation
Verify: When external dependencies fail (LLM provider down, search API unavailable, database timeout), the agent degrades gracefully rather than crashing.
How: Simulate failure of each external dependency. Verify the agent returns a meaningful error message or falls back to a simpler behavior.
Failure looks like: Agent returns a 500 error or hangs indefinitely when the LLM provider has a 30-second latency spike.
Type: Automated
Check 23: Load handling
Verify: The agent handles your expected peak load without degraded performance.
How: Run a load test at 2x your expected peak concurrent requests. Measure latency at p50, p95, and p99. Verify error rates stay below 1%.
Failure looks like: Latency doubles at 1.5x normal load, or the system starts dropping requests under peak conditions.
Type: Automated
Check 24: Rollback plan
Verify: You can roll back to the previous agent version within 5 minutes if the new deployment causes problems.
How: Document the rollback procedure. Execute it in a staging environment. Time it.
Failure looks like: Rollback requires manual database changes, config file edits across multiple servers, or takes more than 15 minutes. During that time, users are experiencing the broken agent.
Type: Manual review
Check 25: Incident response
Verify: Your team has a documented process for responding to agent failures: who gets paged, what gets checked first, how to disable the agent in an emergency.
How: Run a tabletop exercise. Simulate a production incident and walk through the response process.
Failure looks like: When the agent starts giving wrong answers at 2 AM, nobody knows who to call or how to shut it down.
Type: Manual review
Using this checklist in practice
Pre-deployment gate
Add this checklist as a mandatory gate in your deployment pipeline. No agent ships to production without all 25 checks verified. For automated checks, build them into your CI/CD pipeline using the patterns in our automated testing pipeline guide. For manual checks, assign an owner and require sign-off.
Ongoing verification
This checklist isn’t one-and-done. Re-run it when:
– You change the system prompt
– You update the model version
– You add or modify tools
– You change context assembly logic
– Monthly, as a routine verification
Scoring
Track how many checks pass on each deployment. A declining pass rate across deployments signals that quality is slipping and the development process needs attention.
| Pass Rate | Status | Action |
|---|---|---|
| 25/25 | Ship it | Deploy with confidence |
| 22-24/25 | Fix first | Address failures before deployment |
| 18-21/25 | Major gaps | Significant work needed |
| Below 18/25 | Not ready | Agent needs substantial development |
Frequently asked questions
Do I need all 25 checks for every agent?
The functional correctness and safety checks (1-13) are non-negotiable for any agent that interacts with users. Cost controls (14-17) are essential for any agent making LLM API calls. Observability (18-21) and production readiness (22-25) scale with deployment complexity. A simple internal tool might skip load testing. A customer-facing agent needs every check.
How long does the full checklist take?
Automated checks (about 16 of the 25) run in your CI/CD pipeline in under 15 minutes. Manual checks (about 9) take 2-4 hours for a thorough review. Budget a full day for the first deployment and 2-3 hours for subsequent deployments with incremental changes.
What if a check fails close to launch?
Don’t ship. The cost of a production incident (user trust, engineering time, potential legal liability) always exceeds the cost of a delayed launch. Fix the failure, re-verify, and deploy when clean.
For the verification methodology behind this checklist, read our complete guide to AI agent verification. Subscribe to the newsletter for deployment checklists, verification patterns, and production engineering guides.