AI PM Playbook

The failure this prevents: teams build evals at launch, run them once, and never update them. Six months later, the model has drifted, the eval set is stale, and nobody knows whether quality is improving or degrading. This guide is structured around making evals continuous, not ceremonial.

Many generative AI pilots fail to deliver measurable business impact. The most common reason is not bad models or weak prompts. It is missing evals. Teams ship AI features with no definition of "good," no way to detect regression, and no baseline to improve against.

Evals are the single most important artifact an AI PM owns. More important than the PRD. More important than the prompt. If you only do one thing well, make it evals.

What evals actually are

An eval is a repeatable test that measures whether your AI system produces acceptable output for a known input. That is it. Not a demo. Not a vibe check. Not "the CEO tried it and liked it."

An eval has three parts

A set of test inputs (the golden set)
Expected outputs or quality criteria for each input
A scoring method that produces a number

If you cannot describe all three, you do not have evals. You have opinions.

Four eval types

Most AI products need four kinds of evals:

Eval type	What it answers	Example
Code-based eval	Did the system satisfy a deterministic rule?	Output is valid JSON, required fields are present, SMS output contains no markdown
Human eval	Would a PM, domain expert, or trained reviewer approve this output?	Support lead labels whether a draft follows policy and sounds on-brand
LLM-as-judge eval	Can a calibrated model label outputs at scale?	Judge flags ignored intent, unsupported claims, or missing human handoff
User eval	How did real users react in the product?	Accept rate, retry rate, thumbs down, complaints, escalation, conversion impact

These are complementary. Code-based checks are cheap and precise. Human evals define the standard. LLM judges scale the standard after calibration. User evals show whether the product works in the real world.

User evals are signals, not ground truth. A thumbs down might mean the AI was wrong, or it might mean the AI gave a correct answer the user disliked. If eval scores look good but user metrics decline, inspect traces and user feedback before deciding which signal is right.

The golden eval set

Your golden set is a curated collection of input-output pairs that represent the full range of what your AI feature handles. This includes easy cases, hard cases, edge cases, and adversarial cases.

Start small, then grow as the product gets closer to real users. Early eval sets are for fast iteration; later eval sets are for confidence.

Stage	Useful size	What it is for
Internal prompt iteration	5-10 examples	Find obvious failures quickly without slowing every prompt change
Prototype validation	20-50 examples	Check the main use cases, edge cases, and failure modes before showing users
Pilot readiness	50-100 examples	Build confidence across real usage patterns and user segments
Production readiness	200+ examples	Support launch decisions, regression testing, and stakeholder trust

High-stakes or regulated domains need larger, more carefully labeled sets. The tradeoff is speed versus confidence: fewer examples let the team iterate quickly, but more examples are needed before the result can support a launch decision.

Where golden examples come from:

Real user inputs from production or user research
Known failure cases from QA or customer support
Adversarial inputs designed to expose weaknesses
Edge cases identified by domain experts
Inputs that previous model versions handled differently

Each example needs a human-verified expected output. This is tedious. Do it anyway. The expected output is the ground truth your entire quality system depends on.

The golden set is a living document. As you change models, update prompts, or modify retrieval pipelines, the golden set is what catches regression. Add new examples whenever you find a failure in production. Remove examples that no longer represent real usage. A stale golden set creates false confidence.

Beyond accuracy: robustness and consistency

Accuracy on a golden set is necessary but misleading on its own. Two more metrics matter:

Robustness is accuracy across different input formats, layouts, and conditions. If your product handles PDF uploads, measure accuracy on PDFs, CSVs, images, and different column layouts separately. A system that scores 95% on clean PDFs but 40% on scanned images is not 95% accurate in production. Weight by the actual distribution of input formats your users send.

Consistency is how reliably the system handles the same entity across variations. If your product categorizes merchants in financial data, test whether it correctly identifies "SQ", "Singapore Airlines", and "Singapore Air" as the same airline across multiple runs. Inconsistent entity resolution erodes user trust faster than occasional inaccuracy.

A PM pitfall from practice: a credit card analyzer kept categorizing "Singapore" (a merchant code for Singapore Airlines) as a government payment. The system was 90% accurate overall, but inconsistently handled merchant name variations, which made the product feel broken.

Quality bars depend on domain

70% accuracy means different things in different products. A travel photo recognition app that identifies landmarks at 80% accuracy is useful and fun. A fintech product that categorizes transactions at 70% accuracy is unusable.

Before setting your quality bar, answer: what happens when the AI is wrong? If the user loses a few seconds correcting a suggestion, lower accuracy is acceptable. If the user makes a financial decision based on wrong data, you need near-perfect accuracy on the dimensions that matter.

Set different accuracy targets for different dimensions of the same product. A support copilot might tolerate 80% on tone matching but require 99% on policy accuracy.

Quality rubrics

Not every AI output is pass/fail. Most require judgment across multiple dimensions. A quality rubric defines those dimensions and what each score means.

For a customer support response, your rubric might include:

Accuracy: does the response contain correct information? (1-5)
Completeness: does it address all parts of the question? (1-5)
Tone: does it match the brand voice? (1-3)
Safety: does it avoid making promises or sharing restricted info? (pass/fail)

The rubric should reflect what users actually care about. Talk to users. Look at complaint patterns. If users never mention tone but frequently complain about missing information, weight completeness higher than tone.

Hard rule: safety dimensions should be pass/fail, not scored. A response that leaks PII should fail regardless of how accurate it is.

Automated vs human grading

Human grading is the gold standard but does not scale. Automated grading scales but misses nuance. You need both.

Use automated grading for

Factual accuracy against known answers
Format compliance (JSON structure, required fields)
Safety checks (PII detection, blocked content categories)
Regression testing on every prompt or model change

Use human grading for

Tone and style assessment
Complex correctness that requires domain knowledge
Calibrating automated graders
Evaluating new categories of output

The ratio shifts over time. Early in development, you are 80% human grading. At scale, you should be 80% automated with human spot-checks.

Trace-first eval design

Start eval design from traces whenever possible. If you invent evals before seeing real model behavior, you may test the wrong failures.

Use this loop

Review prototype or production traces.
Label where the output, tool call, retrieval step, handoff, or action failed.
Group failures into product-specific categories.
Write evals for the categories that are frequent, severe, or strategically important.
Add representative traces to the golden set.

Trace review helps answer the question that generic evals miss: did the system fail because the model was wrong, retrieval was weak, the tool call failed, the workflow lacked human review, or the eval itself misunderstood what good looks like?

For agentic products, evaluate both the final outcome and the trajectory. A correct final answer can still be unsafe if the agent used the wrong tool, accessed the wrong data, skipped required approval, or retried until cost spiked.

It is acceptable to start with an LLM-generated "vibe eval" as a draft. Do not trust it as a launch gate until a human has reviewed examples where it passes and fails. A useful eval should produce a healthy mix of right and wrong outputs. If everything passes, it is probably too easy. If everything fails, it may be misaligned.

LLM-as-judge

Using one LLM to evaluate another LLM's output is now standard practice. It works surprisingly well for subjective quality dimensions like helpfulness and clarity. It works poorly for factual accuracy unless you give the judge model access to ground truth.

Tradeoffs PMs should understand:

LLM judges are biased toward longer, more verbose outputs. A concise correct answer often scores lower than a wordy one. You can mitigate this with explicit rubric instructions.
LLM judges show position bias. If you present two outputs for comparison, the judge favors whichever appears first. Always randomize order.
LLM judges are inconsistent across runs. The same input can get different scores. Run each judgment 3-5 times and average, or use majority voting.
Cost adds up. If your eval set is 200 examples and you run each through a judge model 3 times, that is 600 LLM calls per eval run.

LLM-as-judge is good enough for development iteration. It is not good enough as your only quality gate before production launch. Pair it with human review of a sample.

For many product decisions, prefer binary judge outputs over 1-5 scores. "Did the assistant fabricate a policy claim?" and "Should this have escalated to a human?" are easier to calibrate as pass/fail than as subjective rating scales. Binary outputs also map cleanly to product decisions: block, ship, escalate, or fix.

Calibrating LLM judges

Do not trust an LLM judge just because it returns a score. Calibrate it against human labels.

Start with a small labeled set from error analysis:

Review production traces manually.
Label whether a specific failure exists in each trace.
Write an LLM judge prompt for that failure.
Run the judge on the same traces.
Compare judge output to the human labels.

Track three numbers

Agreement: how often the judge matches the human label overall.
True positive rate: when the failure is present, how often the judge catches it.
True negative rate: when the failure is absent, how often the judge correctly ignores it.

Agreement alone is a trap. If only 5% of traces have a handoff failure, a judge that always says "no failure" will look 95% accurate while catching nothing useful. Look at positives and negatives separately.

Once a judge is calibrated, use it to scan a larger sample of traces and monitor that failure category over time. Recalibrate when the prompt, model, workflow, or failure definition changes.

Regression testing

Every prompt change, model upgrade, or RAG pipeline modification should trigger an eval run against your golden set. No exceptions.

This is regression testing. You are checking whether the change improved the target metric without degrading others. A prompt edit that improves accuracy by 5% but degrades safety compliance by 2% is a net negative.

Set a regression threshold. If any dimension drops more than X% from baseline, the change does not ship until reviewed. What X is depends on the dimension. Safety might be 0% (any regression blocks). Tone might be 5%.

When to run evals

Before launch: full eval suite against golden set, human review of a sample
After every prompt change: automated regression run
After model upgrades: full eval suite (model behavior changes between versions, sometimes dramatically)
Weekly in production: sample production outputs and grade them to detect drift
After incidents: add the failure case to the golden set and re-run
When user metrics disagree with eval results: inspect traces and feedback to understand whether the eval is missing a failure mode or users are reacting to a correct but disappointing outcome

Tools PMs should know about

You do not need to code evals yourself, but you need to understand what tools your team is using and what they measure.

promptfoo is the most PM-accessible eval tool. It uses YAML config files to define test cases and assertions. You can read and edit the configs without writing code. It supports LLM-as-judge, regex matching, and custom scoring functions. Open source.

deepeval provides pre-built evaluation metrics for common use cases: faithfulness, answer relevancy, hallucination detection, toxicity. Useful when you want standard metrics without building custom rubrics from scratch.

ragas focuses on RAG evaluation specifically. If your product retrieves documents and generates answers from them, ragas measures retrieval quality separately from generation quality. This distinction matters because a bad answer might be caused by retrieving the wrong documents, not by the model generating poorly.

How to start

If you have zero evals today

Pick your highest-risk AI feature
Review 30-50 real or realistic traces and write down the failures you see
Group the failures into product-specific error categories
Collect 20 real inputs that represent the full range of usage
Write expected outputs for each one (or define pass/fail criteria)
Run your current system against those 20 inputs
Grade the outputs manually
Record the scores. This is your baseline
Run the same eval after every change

This takes a PM about two days. It will save you months of shipping broken things and not knowing.

Common eval mistakes

Testing only the happy path: your golden set is full of well-formed, reasonable inputs. Real users send typos, ambiguous questions, multi-part requests, and inputs in unexpected languages. Include messy inputs.

Treating eval as a one-time activity: running evals once before launch and never again. Eval is continuous. Models change, user behavior changes, your data changes. An eval set that was comprehensive six months ago has gaps today.

Optimizing for the eval set: if you keep tweaking the prompt to score higher on the same 20 examples, you are overfitting. Periodically add new examples from production that the team has not seen before.

Ignoring disagreement between graders: if two human graders score the same output differently, your rubric is ambiguous. Fix the rubric, do not just average the scores. Disagreement is a signal that your quality definition is unclear.

Scoring everything on the same scale: accuracy and safety should not be averaged into a single number. A response that scores 5/5 on accuracy and 1/5 on safety is not a 3/5 response. It is a safety failure. Report dimensions separately and set hard thresholds on critical dimensions.

No baseline measurement: teams add evals after months of development and have no idea whether quality is improving or degrading. Measure your baseline on day one, even if the numbers are embarrassing. The baseline is what makes progress visible.

Skipping error analysis: teams pick generic metrics before reading real traces. Review traces first, label the failures, then write evals for the errors that actually matter.

Treating user feedback as ground truth: user ratings are valuable, but they are not labels. A rejected answer may be correct, and an accepted answer may be wrong. Use user feedback to trigger investigation, then inspect the trace and compare against the rubric.

Next steps

Use the Eval Plan template to define your golden set, scoring method, and regression cadence.
Use the Error Analysis guide before automating evals, especially when you do not yet know the main failure modes.
Once evals are running, set up production monitoring using the Operating AI Products guide to detect drift between eval runs.

Eval design for PMs