AI PM Playbook

Problem

Tier 1 support agents spend 38% of handle time drafting responses from scattered knowledge base articles. Time studies across 12 agents show 3.1 minutes per ticket on drafting alone. The top 50 intents cover 62% of ticket volume, and these intents have stable, well-documented answers in the KB. Agents are doing repetitive lookup-and-rewrite work that scales linearly with ticket volume.

Evidence

4-week time study across 12 agents, 2,400 tickets
Template adoption declined from 22% to 11% over 3 months (templates don't handle variation)
Agent satisfaction surveys cite "repetitive work" as top frustration
Average handle time has not improved in 6 quarters despite KB improvements

Goals

Reduce average drafting time from 3.1 minutes to under 1.5 minutes for supported intents
Achieve 75% draft accept rate (accepted with minor edits or no edits)
Keep reject rate below 5% (agent discards draft entirely and writes from scratch)
Zero hallucinated policy claims in production (any fabricated policy is a severity-1 issue)

Non-goals

Automating ticket routing or categorization (separate project)
Handling Tier 2 or Tier 3 escalations
Sending responses without agent review
Replacing the knowledge base or authoring new KB articles
Supporting intents outside the initial top 10 in v1

Target users

Tier 1 support agents handling inbound customer tickets via the support platform. Team of 45 agents, 3 shifts, handling roughly 800 tickets per day across all intents. The pilot will start with 8 agents on the day shift.

Current workflow

Agent picks up next ticket from the queue
Agent reads the customer message
Agent searches the KB using keywords
Agent reads 1-3 articles to find the relevant policy
Agent drafts a response from scratch, adapting KB content to the customer's situation
Agent reviews and sends the response
Agent categorizes the ticket

Proposed workflow

Agent picks up next ticket from the queue
Agent reads the customer message
System displays an AI-drafted response alongside cited KB articles
Agent reviews the draft: accepts, edits, rejects, or escalates
Agent sends the response (their own or the edited draft)
Agent categorizes the ticket

The AI removes steps 3-5 of the current workflow for supported intents. The agent still reads every response before sending. The system never sends anything without agent action.

AI job statement

The AI drafts support responses using approved knowledge base articles for Tier 1 agents to review and send, subject to human approval before any customer-facing message.

Input contract

Input	Format	Required	Max size	Fallback if missing
Customer ticket text	Plain text (latest message)	Yes	4,000 chars	Show error, agent drafts manually
Conversation history	Array of previous messages	No	Last 10 messages	Draft based on latest message only
Customer account metadata	JSON (plan tier, tenure, region)	No	1 KB	Draft without account-specific details
KB article corpus	Pre-indexed vector store	Yes	N/A (system dependency)	Feature unavailable, agent drafts manually

Output contract

Output field	Type	Always present	Example
Draft response	String	Yes	"Your billing cycle resets on the 1st of each month..."
Cited KB articles	Array of article IDs and titles	Yes	[{"id": "KB-1042", "title": "Billing cycle FAQ"}]
Confidence score	Float 0-1	Yes	0.82
Intent classification	String	Yes	"billing_cycle_question"
Fallback flag	Boolean	Yes	false

Autonomy level

Draft: AI produces output, human reviews before anything happens
Suggest: AI recommends an action, human accepts or rejects
Act: AI takes action, human can undo
Autonomous: AI takes action, no human in the loop

The AI produces a full draft response for the agent to review before anything happens. The agent decides whether to use it, edit it, or ignore it. No customer-facing message is ever sent without the agent pressing send. This is "draft" level, not "suggest," because the AI produces a complete output for review rather than recommending a discrete action. The right level because: (1) support responses go to real customers and errors damage trust, (2) agents are available to review every response since they're already handling the ticket, and (3) the cost of human review is low relative to the cost of a wrong answer.

Human review rules

Every AI-drafted response must be reviewed by the handling agent before sending
Agents can accept the draft as-is, edit it, reject it and write their own, or escalate the ticket
If the AI flags low confidence (below 0.6), the agent sees a clear visual indicator and the top 3 KB articles instead of a draft
Agents cannot bulk-accept drafts or auto-send

Quality bar

75% accept rate: agents accept the draft with no edits or minor edits (punctuation, small phrasing changes)
Less than 5% reject rate: agents discard the draft entirely
Zero hallucinated policy claims: every factual statement in the draft must be traceable to a current KB article
Factual accuracy on golden eval set: 100% pass rate (binary, no partial credit)
Citation relevance on golden eval set: average score above 3.5/5
Tone appropriateness on golden eval set: average score above 3.5/5

Latency target

p50: draft appears within 2 seconds of ticket load
p95: draft appears within 4 seconds
Hard timeout: 6 seconds. If no draft within 6 seconds, show "Draft unavailable" and surface top 3 KB articles

Cost constraint

Under $0.03 per draft at current model pricing
Under $0.10 per agent per day at expected v1 usage (about 5.5 top-10-intent tickets per agent per day)
Monthly cost for full team (45 agents): under $100 for top-10-intent v1; under $500 if coverage expands to the top 50 intents

Cost is estimated using Claude Sonnet with ~1,500 input tokens (ticket + context + KB snippets) and ~300 output tokens (draft response). This needs validation at actual usage patterns.

Failure behavior

On timeout (>6 seconds): show "Draft unavailable, here are possibly relevant articles" with top 3 KB matches. Agent drafts manually.
On low confidence (<0.6): show "I couldn't find a strong match for this question" with top 3 KB articles. No draft shown. Agent drafts manually.
On malformed output (missing required fields): log the error, show nothing to the agent, agent drafts manually. Alert engineering if error rate exceeds 1%.
On safety trigger (self-harm, threats, legal demands): suppress the draft entirely. Show escalation prompt: "This ticket may need immediate escalation." Route to Tier 2 queue.
On KB retrieval failure: show "Knowledge base temporarily unavailable." Agent drafts manually. Page on-call if retrieval is down for more than 5 minutes.

Observability requirements

From day one, log and dashboard:

Accept rate (draft used as-is)
Edit rate (draft used with modifications) and edit distance
Reject rate (draft discarded, agent wrote from scratch)
Retry rate (agent requested a new draft)
Escalation rate (ticket escalated to Tier 2)
Confidence score distribution
Low-confidence fallback rate
Citation click rate (did the agent click through to the cited KB article?)
Latency p50, p95, p99
Cost per draft (actual, not estimated)
Handle time delta (before vs. after, per intent)
Error rate (malformed outputs, timeouts, retrieval failures)
Agent-reported quality issues (thumbs down or free-text feedback)

Launch gates

Before pilot (8 agents, 2 weeks):

80% of the 100-example golden eval set passes all automated checks
Human grading average above 3.5/5 across all rubric dimensions
Zero hallucinated policy claims in eval set
Accept/edit/reject tracking is live and dashboarded
Low-confidence fallback behavior is implemented and tested
Safety escalation triggers are tested against 20 adversarial examples
Agent training completed (30-minute session covering: how drafts work, how to report issues, what to do when drafts are wrong)

Before production (all agents):

Pilot accept rate above 70%
Pilot reject rate below 8%
No severity-1 incidents during pilot
Handle time reduction is measurable and positive
Agent feedback is net positive
Post-launch review cadence is scheduled (weekly for first month)

Open questions

How do we handle tickets that span two intents (e.g., billing question + cancellation request)? Do we draft for the primary intent only, or attempt both?
Should the confidence threshold (0.6) be tunable per intent, or is a single global threshold sufficient for v1?
How do we feed agent edits back into the system? Manual review of high-edit-distance drafts, or automated fine-tuning pipeline?
What is the KB article refresh cadence, and who owns flagging outdated articles that the AI might cite?

AI PRD