AI PM Playbook

Summary

Product: Customer Support Copilot Stage: Prototype (pre-pilot) Assessment date: 2026-02-10 Assessor: Maya Patel, PM Overall weighted score: 2.86 / 5.0 Recommendation: Pilot candidate after blockers

Readiness assessment

1. Problem fit

Score: 4 / 5

Evidence

4-week time study across 12 agents and 2,400 tickets confirms 38% of handle time spent on drafting
Top 50 intents cover 62% of volume; top 10 cover 31%
Template-based approach tried and failed (adoption dropped from 22% to 11%)
Agent satisfaction surveys consistently cite repetitive drafting as a top frustration

Risks

The 38% figure is an average. Some agents draft faster. The benefit will be uneven across the team.
Less useful for complex, low-volume tickets outside the top 50 intents

Next action: segment expected impact by intent tier to set realistic pilot expectations

Owner: PM

2. Workflow fit

Score: 4 / 5

Evidence

Clear human-in-the-loop model: AI drafts, agent reviews, agent sends
Workflow integrates into the existing support platform (draft appears in the reply panel)
Agent can accept, edit, reject, or escalate at every step
No change to ticket routing, categorization, or escalation paths

Risks

If drafts are slow (>4s) agents may develop a habit of ignoring them and drafting manually, making the feature invisible
Agents on high-pressure shifts may rubber-stamp drafts without reading them carefully

Next action: add a "draft is loading" indicator and measure time-to-first-interaction with the draft

Owner: Design

3. AI job definition

Score: 4 / 5

Evidence

AI job statement is specific and testable: "drafts support responses using approved KB articles for Tier 1 agents to review and send"
Input contract, output contract, and autonomy level are defined
Clear boundary: AI suggests, human decides and sends
Scope limited to top 10 intents for v1

Risks

Multi-intent tickets (18% of volume) are acknowledged but handling is not fully specified
Confidence threshold of 0.6 is a guess, not empirically validated

Next action: run the prototype against 50 multi-intent tickets and decide on a handling strategy before pilot

Owner: ML Lead

4. Data readiness

Score: 3 / 5

Evidence

Knowledge base exists with 400+ articles covering all supported intents
Historical ticket data available for eval set creation (6 months, 100K+ tickets)
Vector embedding pipeline for KB retrieval is functional in staging

Risks

15% of KB articles are outdated based on a spot check. Outdated articles are the most likely source of factual errors.
No automated freshness check or deprecation flag on KB articles
Article quality is inconsistent. Some are clear and structured, others are rambling FAQs with buried answers.

Next action: audit and update KB articles for the top 10 intents before pilot. Flag deprecated articles in the retrieval index.

Owner: Support Lead + Engineering

5. Eval readiness

Score: 2 / 5

Evidence

Quality rubric drafted with 4 dimensions (factual accuracy, citation relevance, tone, completeness)
5 golden examples defined covering happy path, edge case, unacceptable output, safety boundary, and regression
Automated checks specified (citation validity, PII detection, length bounds, confidence score presence)

Risks

No golden dataset built yet. The 5 examples are illustrative, not a scored eval set.
No labeled reference responses from graders
Regression suite has 5 cases. The target is 50 within 3 months, but the starting point is thin.
Inter-rater reliability between graders is untested

Next action: build 100-example golden eval set with labeled reference responses. Run inter-rater reliability test with 3 graders on 20 examples.

Owner: ML Lead + Support Lead

6. System behavior

Score: 3 / 5

Evidence

Model selected (Claude Sonnet) with estimated token costs
Retrieval pipeline functional in staging (vector search over KB articles)
Latency targets defined (p50 <2s, p95 <4s, hard timeout 6s)
Timeout behavior specified (show KB articles instead of draft)

Risks

Low-confidence fallback behavior is designed but not implemented. The prototype currently shows a draft regardless of confidence score.
Safety trigger detection (self-harm, threats) is specified but not tested against adversarial inputs
No load testing done. Latency under concurrent usage is unknown.

Next action: implement low-confidence fallback, test safety triggers against 20 adversarial examples, run load test at 2x expected concurrent usage

Owner: Engineering

7. Risk and safety

Score: 3 / 5

Evidence

Main risks identified: hallucinated policy claims, outdated KB citations, PII echo, safety-sensitive tickets
Mitigation strategy defined for each: citation grounding, KB freshness audit, PII regex filter, safety trigger escalation
Human review on every response is a strong backstop

Risks

No formal risk register with severity ratings and owners
PII filtering relies on regex patterns, which will miss some formats
Safety trigger detection has not been adversarially tested
No process for handling incidents if a hallucinated claim reaches a customer

Next action: create formal risk register. Define incident response process for hallucination reports. Test PII filter against 50 examples with varied PII formats.

Owner: PM + Engineering

8. Regulatory readiness

Score: 2 / 5

Evidence

Customer-facing AI disclosure requirement identified
PII handling risks listed in PRD and risk notes
Human review required before any customer-facing message is sent

Risks

Prompt and output retention policy is not confirmed
Customer-facing AI disclosure language has not been reviewed by legal
Tenant data handling review is not complete

Next action: confirm disclosure, retention, and tenant data handling requirements before pilot

Owner: Legal + Security

9. Cost and business case

Score: 3 / 5

Evidence

Estimated cost per draft: $0.0094 (based on Claude Sonnet pricing, ~1,500 input tokens, ~300 output tokens, with caching). PRD upper-bound target is $0.03.
Estimated monthly cost for full team (45 agents): $51 baseline for top-10-intent v1, about $103 if coverage expands to the top 50 intents
Expected business value: 1.6 minute reduction in drafting time per supported ticket, roughly 20% handle time reduction on supported intents

Risks

Cost estimate is based on assumed token counts, not measured. Retrieval-augmented prompts with multiple KB articles could push input tokens higher.
Business case relies on handle time reduction, which has not been measured in practice
Expansion beyond the top 10 intents depends on quality and KB coverage, not just cost

Next action: measure actual token usage on 200 prototype runs. Validate supported-intent volume during pilot and model cost at 50-intent and full-coverage scenarios.

Owner: PM + Engineering

10. Observability

Score: 2 / 5

Evidence

Latency logging is live in staging
API cost tracking is functional
Metric definitions exist for accept/edit/reject/retry/escalation rates

Risks

Accept/edit/reject tracking is not implemented. The agent review UI exists but does not emit events.
Citation click tracking is not implemented
No dashboard exists. Metrics are defined on paper but not visible to the team.
No alerting configured for any threshold

Next action: implement accept/edit/reject event logging. Build a dashboard with the 8 key online metrics. Configure alerts for reject rate and hallucination reports.

Owner: Engineering

11. Launch and operations

Score: 2 / 5

Evidence

Pilot group identified (8 agents on day shift)
Pilot duration planned (2 weeks)
Post-launch review cadence mentioned in the eval plan (weekly for month 1)

Risks

No written pilot plan with specific success criteria, rollback triggers, or escalation paths
No agent training plan or materials
No defined process for reviewing pilot results and making the go/no-go decision for production
Post-launch quality review owner is not assigned

Next action: write pilot plan with success criteria, rollback triggers, and go/no-go decision process. Create 30-minute agent training session. Assign post-launch review owner.

Owner: PM + Support Lead

Score summary

Dimension	Score	Weight	Weighted
Problem fit	4	1.1	4.40
Workflow fit	4	1.1	4.40
AI job definition	4	1.2	4.80
Data readiness	3	1.2	3.60
Eval readiness	2	1.5	3.00
System behavior	3	1.0	3.00
Risk and safety	3	1.4	4.20
Regulatory readiness	2	1.3	2.60
Cost and business case	3	1.0	3.00
Observability	2	1.3	2.60
Launch and operations	2	1.2	2.40
Overall		Total weight: 13.3	38.00 / 13.3 = 2.86 weighted avg

Weighted average: 2.86. The score is not rounded up because eval readiness, regulatory readiness, observability, and launch operations are below the pilot bar.

Recommendation: pilot candidate after blockers

This product is a strong pilot candidate, but it is not ready to start the pilot today. The team should close the four pre-pilot blockers below, then run a limited pilot with close monitoring. It is not ready for production launch.

Why not reject or prototype only

The problem is real and well-measured. The workflow fit is strong: AI drafts, human reviews, human sends. The AI job is specific enough to evaluate. These three things together mean the team should prepare a real pilot rather than keep exploring indefinitely. Agent review on every response limits the blast radius of quality issues once the pilot controls are in place.

Dropping to "prototype only" would be overly cautious given the strength of the problem evidence and the clear human review model. The team does not need more internal exploration to know this is worth trying with real agents on a narrow set of supported intents.

Why not pilot today

Four gaps block the pilot

No golden eval set. The rubric and golden examples exist, but the 100-example scored dataset does not. Without it, there's no baseline to measure quality against and no regression suite to catch degradation. This is the single biggest gap.
No accept/edit/reject tracking. The team cannot measure the most important product metric (did the agent use the draft?) without this instrumentation. Flying blind on adoption makes it impossible to know if the pilot is working.
Low-confidence fallback is not implemented. The design says "show KB articles instead of a draft when confidence is below 0.6." The prototype shows a draft every time. Without the fallback, low-confidence drafts will reach agents, get rejected, and erode trust in the tool.
Regulatory and data handling review is incomplete. The feature handles customer messages and may process PII. Before the pilot, the team needs legal and security sign-off on disclosure language, retention, and tenant data handling.

These four blockers are all closable within 2-3 weeks of focused work. None of them require rethinking the product. They are execution gaps, not strategy gaps.

Blockers before pilot

Build 100-example golden eval set with labeled reference responses (owner: ML Lead, target: 2 weeks)
Implement accept/edit/reject event logging and build a basic dashboard (owner: Engineering, target: 1 week)
Implement low-confidence fallback behavior (owner: Engineering, target: 1 week)
Confirm disclosure, retention, and tenant data handling requirements (owner: Legal + Security, target: 1 week)

Blockers before production

All pilot blockers, plus:

Pilot runs for at least 2 weeks with online metrics meeting targets
No confirmed hallucination reports during pilot
Agent feedback is net positive
Formal risk register with incident response process
250-example eval set passes at pilot thresholds
Load testing completed at 2x expected concurrent usage
Safety triggers tested against adversarial inputs

Launch gate

Summary

Readiness assessment

1. Problem fit

2. Workflow fit

3. AI job definition

4. Data readiness

5. Eval readiness

6. System behavior

7. Risk and safety

8. Regulatory readiness

9. Cost and business case

10. Observability

11. Launch and operations

Score summary

Recommendation: pilot candidate after blockers

Why not reject or prototype only

Why not pilot today

Blockers before pilot

Blockers before production

Use this template to write your own

Launch gate checklist