AI PM Playbook

Use this to define what "good" means before you build. If you can't write this, you're not ready to build.

Upstream: the task definition and quality bar come from the AI PRD. Downstream: eval results feed into the launch gate checklist and the weekly post-launch review.

Task definition

Eval scope

Level	What you measure	Example
Node	Individual step or tool call accuracy	Did the retrieval return relevant docs? Did the classifier pick the right category?
Session	End-to-end task completion	Did the agent complete the full workflow? Did it recover from a failed step?
System	Latency, cost, token efficiency across runs	p95 latency under 8s, cost per task under $0.05

Evaluation dataset

Source:
Size: internal iteration: 5-10; prototype: 20-50; pilot: 50-100; production: 200+; larger for regulated/high-stakes domains
Selection method: random sample, stratified by difficulty, adversarial, etc.
Labeler: who created the ground truth? what were their instructions?
Trace source: prototype traces, production traces, synthetic traces, support tickets, user research, etc.
Traces reviewed: count and date range
Human-labeled failures: count and who labeled them
Eval-human agreement target: e.g., judge agrees with human labels >= 85% before use

Stage: internal iteration / prototype validation / pilot readiness / production readiness

Golden examples

Happy path

Input: typical, well-formed input

Expected output: what good looks like

Why this matters: what it demonstrates about the AI's core capability

Edge case

Input: unusual but valid input the AI must handle

Expected output: acceptable behavior

Why this matters: what breaks if this fails

Unacceptable output

Input: input that could produce a bad result

Unacceptable output: the output you're testing against

Why this is unacceptable: user impact, trust damage, compliance issue

Safety boundary

Input: input that tests a safety constraint

Expected behavior: refusal, escalation, or safe fallback

Why this matters: what goes wrong if the boundary isn't held

Robustness and consistency

Robustness: accuracy across input variations

Input variant	Example	Accuracy target
e.g., clean PDF
e.g., scanned image
e.g., CSV with different column order

Consistency: same entity, same result

Entity type	Variations to test	Target consistency
e.g., merchant names	e.g., "SQ", "Singapore Airlines", "Singapore Air"

Quality rubric

Criterion	Pass	Fail

Grading tiers

Tier	Method	When to use	Example
1	Code-based (deterministic)	Structured outputs, format checks, exact match	Schema validation, required fields present, no fabricated citations
2	LLM-as-judge	Subjective quality, tone, relevance, reasoning	Pass/fail judge for handoff failure, ignored intent, unsupported claim
3	Human review	Calibration, expert domains, disputed cases	Medical accuracy review, legal compliance check

Error analysis

Trace sample: e.g., 50 production traces, stratified by channel and user segment
Reviewer: PM or domain expert who reviewed the traces
Review date:

Error category	Example trace or input	Frequency in sample	Severity	Fix path	Automate?
e.g., human handoff failure			low/med/high	prompt/retrieval/tool/product/policy	yes/no/later

Trace-derived eval cases

Trace or source	Failure observed	Eval case added	Human label	Owner

LLM judge calibration

Judge	Failure detected	Human-labeled sample size	Agreement	True positive rate	True negative rate	Approved for use?

Non-deterministic eval strategy

Runs per eval: e.g., 3-5 runs per test case, aggregate scores
Aggregation method: e.g., median score, majority vote on pass/fail
Variance threshold: e.g., if pass/fail disagrees across runs, flag for human review

Automated checks

Output schema validation
Required fields present
No fabricated references or citations
add task-specific checks

Regression plan

Regression suite size:
Run frequency: on every deploy, daily, weekly
Alert if: specific threshold, e.g., "accuracy drops > 2% vs. baseline"

Online metrics

Metric	Definition	Target	Alert threshold

Launch threshold

Data flywheel

Failure capture: e.g., user rejections, flagged outputs, support escalations logged
Triage cadence: e.g., weekly review of new failures
Eval set update: e.g., 5-10 new cases added per month from production

Review cadence

Pre-launch: e.g., on every model or prompt change
Post-launch: e.g., weekly for first month, then bi-weekly

AI eval plan

Task definition

Eval scope

Evaluation dataset

Golden examples

Happy path

Edge case

Unacceptable output

Safety boundary

Robustness and consistency

Robustness: accuracy across input variations

Consistency: same entity, same result

Quality rubric

Grading tiers

Error analysis

Trace-derived eval cases

LLM judge calibration

Non-deterministic eval strategy

Automated checks

Regression plan

Online metrics

Launch threshold

Data flywheel

Review cadence

Read alongside this template

Eval design

Error analysis

Healthcare Intake Assistant: Eval plan