Use this to define what "good" means before you build. If you can't write this, you're not ready to build.
Upstream: the task definition and quality bar come from the AI PRD. Downstream: eval results feed into the launch gate checklist and the weekly post-launch review.
Task definition
Eval scope
| Level | What you measure | Example |
|---|---|---|
| Node | Individual step or tool call accuracy | Did the retrieval return relevant docs? Did the classifier pick the right category? |
| Session | End-to-end task completion | Did the agent complete the full workflow? Did it recover from a failed step? |
| System | Latency, cost, token efficiency across runs | p95 latency under 8s, cost per task under $0.05 |
Evaluation dataset
- Source:
- Size: internal iteration: 5-10; prototype: 20-50; pilot: 50-100; production: 200+; larger for regulated/high-stakes domains
- Selection method: random sample, stratified by difficulty, adversarial, etc.
- Labeler: who created the ground truth? what were their instructions?
- Trace source: prototype traces, production traces, synthetic traces, support tickets, user research, etc.
- Traces reviewed: count and date range
- Human-labeled failures: count and who labeled them
- Eval-human agreement target: e.g., judge agrees with human labels >= 85% before use
- Stage: internal iteration / prototype validation / pilot readiness / production readiness
Golden examples
Happy path
Input: typical, well-formed input
Expected output: what good looks like
Why this matters: what it demonstrates about the AI's core capability
Edge case
Input: unusual but valid input the AI must handle
Expected output: acceptable behavior
Why this matters: what breaks if this fails
Unacceptable output
Input: input that could produce a bad result
Unacceptable output: the output you're testing against
Why this is unacceptable: user impact, trust damage, compliance issue
Safety boundary
Input: input that tests a safety constraint
Expected behavior: refusal, escalation, or safe fallback
Why this matters: what goes wrong if the boundary isn't held
Robustness and consistency
Robustness: accuracy across input variations
| Input variant | Example | Accuracy target |
|---|---|---|
| e.g., clean PDF | ||
| e.g., scanned image | ||
| e.g., CSV with different column order |
Consistency: same entity, same result
| Entity type | Variations to test | Target consistency |
|---|---|---|
| e.g., merchant names | e.g., "SQ", "Singapore Airlines", "Singapore Air" |
Quality rubric
| Criterion | Pass | Fail |
|---|---|---|
Grading tiers
| Tier | Method | When to use | Example |
|---|---|---|---|
| 1 | Code-based (deterministic) | Structured outputs, format checks, exact match | Schema validation, required fields present, no fabricated citations |
| 2 | LLM-as-judge | Subjective quality, tone, relevance, reasoning | Pass/fail judge for handoff failure, ignored intent, unsupported claim |
| 3 | Human review | Calibration, expert domains, disputed cases | Medical accuracy review, legal compliance check |
Error analysis
- Trace sample: e.g., 50 production traces, stratified by channel and user segment
- Reviewer: PM or domain expert who reviewed the traces
- Review date:
| Error category | Example trace or input | Frequency in sample | Severity | Fix path | Automate? |
|---|---|---|---|---|---|
| e.g., human handoff failure | low/med/high | prompt/retrieval/tool/product/policy | yes/no/later |
Trace-derived eval cases
| Trace or source | Failure observed | Eval case added | Human label | Owner |
|---|---|---|---|---|
LLM judge calibration
| Judge | Failure detected | Human-labeled sample size | Agreement | True positive rate | True negative rate | Approved for use? |
|---|---|---|---|---|---|---|
Non-deterministic eval strategy
- Runs per eval: e.g., 3-5 runs per test case, aggregate scores
- Aggregation method: e.g., median score, majority vote on pass/fail
- Variance threshold: e.g., if pass/fail disagrees across runs, flag for human review
Automated checks
- Output schema validation
- Required fields present
- No fabricated references or citations
- add task-specific checks
Regression plan
- Regression suite size:
- Run frequency: on every deploy, daily, weekly
- Alert if: specific threshold, e.g., "accuracy drops > 2% vs. baseline"
Online metrics
| Metric | Definition | Target | Alert threshold |
|---|---|---|---|
Launch threshold
Data flywheel
- Failure capture: e.g., user rejections, flagged outputs, support escalations logged
- Triage cadence: e.g., weekly review of new failures
- Eval set update: e.g., 5-10 new cases added per month from production
Review cadence
- Pre-launch: e.g., on every model or prompt change
- Post-launch: e.g., weekly for first month, then bi-weekly