PMAI PM Playbook

AI eval plan

Core template

Use this to define what "good" means before you build. If you can't write this, you're not ready to build.

Upstream: the task definition and quality bar come from the AI PRD. Downstream: eval results feed into the launch gate checklist and the weekly post-launch review.

Task definition

Eval scope

LevelWhat you measureExample
NodeIndividual step or tool call accuracyDid the retrieval return relevant docs? Did the classifier pick the right category?
SessionEnd-to-end task completionDid the agent complete the full workflow? Did it recover from a failed step?
SystemLatency, cost, token efficiency across runsp95 latency under 8s, cost per task under $0.05

Evaluation dataset

  • Source:
  • Size: internal iteration: 5-10; prototype: 20-50; pilot: 50-100; production: 200+; larger for regulated/high-stakes domains
  • Selection method: random sample, stratified by difficulty, adversarial, etc.
  • Labeler: who created the ground truth? what were their instructions?
  • Trace source: prototype traces, production traces, synthetic traces, support tickets, user research, etc.
  • Traces reviewed: count and date range
  • Human-labeled failures: count and who labeled them
  • Eval-human agreement target: e.g., judge agrees with human labels >= 85% before use
  • Stage: internal iteration / prototype validation / pilot readiness / production readiness

Golden examples

Happy path

Input: typical, well-formed input

Expected output: what good looks like

Why this matters: what it demonstrates about the AI's core capability

Edge case

Input: unusual but valid input the AI must handle

Expected output: acceptable behavior

Why this matters: what breaks if this fails

Unacceptable output

Input: input that could produce a bad result

Unacceptable output: the output you're testing against

Why this is unacceptable: user impact, trust damage, compliance issue

Safety boundary

Input: input that tests a safety constraint

Expected behavior: refusal, escalation, or safe fallback

Why this matters: what goes wrong if the boundary isn't held

Robustness and consistency

Robustness: accuracy across input variations

Input variantExampleAccuracy target
e.g., clean PDF
e.g., scanned image
e.g., CSV with different column order

Consistency: same entity, same result

Entity typeVariations to testTarget consistency
e.g., merchant namese.g., "SQ", "Singapore Airlines", "Singapore Air"

Quality rubric

CriterionPassFail

Grading tiers

TierMethodWhen to useExample
1Code-based (deterministic)Structured outputs, format checks, exact matchSchema validation, required fields present, no fabricated citations
2LLM-as-judgeSubjective quality, tone, relevance, reasoningPass/fail judge for handoff failure, ignored intent, unsupported claim
3Human reviewCalibration, expert domains, disputed casesMedical accuracy review, legal compliance check

Error analysis

  • Trace sample: e.g., 50 production traces, stratified by channel and user segment
  • Reviewer: PM or domain expert who reviewed the traces
  • Review date:
Error categoryExample trace or inputFrequency in sampleSeverityFix pathAutomate?
e.g., human handoff failurelow/med/highprompt/retrieval/tool/product/policyyes/no/later

Trace-derived eval cases

Trace or sourceFailure observedEval case addedHuman labelOwner

LLM judge calibration

JudgeFailure detectedHuman-labeled sample sizeAgreementTrue positive rateTrue negative rateApproved for use?

Non-deterministic eval strategy

  • Runs per eval: e.g., 3-5 runs per test case, aggregate scores
  • Aggregation method: e.g., median score, majority vote on pass/fail
  • Variance threshold: e.g., if pass/fail disagrees across runs, flag for human review

Automated checks

  • Output schema validation
  • Required fields present
  • No fabricated references or citations
  • add task-specific checks

Regression plan

  • Regression suite size:
  • Run frequency: on every deploy, daily, weekly
  • Alert if: specific threshold, e.g., "accuracy drops > 2% vs. baseline"

Online metrics

MetricDefinitionTargetAlert threshold

Launch threshold

Data flywheel

  • Failure capture: e.g., user rejections, flagged outputs, support escalations logged
  • Triage cadence: e.g., weekly review of new failures
  • Eval set update: e.g., 5-10 new cases added per month from production

Review cadence

  • Pre-launch: e.g., on every model or prompt change
  • Post-launch: e.g., weekly for first month, then bi-weekly