Home/ Skills/ gtm-agent-design

general gtm-agent-design

gtm-agent-design

This skill should be used when the user asks to "build a GTM agent", "design a sales agent", "create an outbound agent", "build an AI agent for go-to-market", "design an agent for prospecting", "build an agent for lead scoring", "create an AI SDR", "automate GTM with agents", "design a revenue agent", or any variation of designing AI agents that perform go-to-market tasks for B2B SaaS teams.

Download .md

GTM Agent Design

A GTM agent is an AI system that performs a specific go-to-market task: researching accounts, writing outbound, scoring leads, handling replies, enriching data, or routing prospects. The agent replaces a repetitive human workflow with an LLM-powered process that runs faster, cheaper, and more consistently.

The design principle: start with the human workflow, not the technology. Map what a person does today step by step. Identify which steps are repetitive, rule-based, and don't require judgment. Automate those. Keep the judgment steps human.

The GTM Agent Landscape

Agent type	What it does	Replaces	Human-in-the-loop?
Research Agent	Pulls company and contact data from multiple sources, synthesizes into an account brief	Manual account research (45-90 min per account)	Review output before use
Email Writer Agent	Generates cold email sequences from an account brief	SDR writing emails (15-30 min per sequence)	Approve before send
Personalization Agent	Inserts per-prospect tokens into templated sequences	SDR personalizing templates (5-10 min per email)	Spot-check 10-20%
Reply Classifier Agent	Classifies inbound replies (positive, objection, OOO, opt-out)	SDR triaging inbox (ongoing)	Review low-confidence classifications
Lead Scorer Agent	Scores inbound leads on ICP fit and intent signals	RevOps manual scoring or static rules	Calibrate model quarterly
Enrichment Agent	Fills missing data fields from multiple providers	Ops team running enrichment workflows	Validate match rates
Signal Monitor Agent	Watches for buying signals across data sources, alerts when triggered	Manual signal scanning (daily)	Set alert thresholds
Routing Agent	Routes leads to the right rep based on territory, segment, and availability	RevOps routing rules in CRM	Audit routing accuracy weekly
Meeting Prep Agent	Generates pre-call briefs from CRM data, research, and prior notes	AE/SDR manual prep (15-30 min per meeting)	Read before the call
Follow-Up Agent	Generates post-meeting follow-up emails from call notes	AE writing follow-ups (10-20 min per email)	Edit and approve before send

Agent Design Process

Step 1: Map the human workflow

Before writing any code or prompts, document exactly what a person does today.

Workflow mapping template:

For each step, capture:

Field	What to document
Step name	What the person does ("Find company funding history")
Input	What they start with ("Company name")
Source	Where they get the data ("Crunchbase, press articles")
Action	What they do with it ("Read, extract key facts, summarize")
Output	What they produce ("Funding summary: round, amount, date, investors")
Time	How long it takes ("5-10 minutes")
Judgment required?	Does this step require human judgment or is it rule-based?
Error rate	How often do humans get this wrong?

Example: SDR account research workflow

Step	Input	Source	Action	Output	Time	Judgment?
1. Company snapshot	Company name	LinkedIn, website	Read about page, note size/stage/vertical	Snapshot fields	2 min	No
2. Funding history	Company name	Crunchbase	Search, extract rounds	Funding summary	3 min	No
3. Recent signals	Company name	LinkedIn, news, job boards	Scan for events in last 90 days	Signal list	10 min	Low
4. Tech stack	Company name	Job postings, BuiltWith	Extract tool mentions	Stack list	5 min	Low
5. Committee mapping	Company name	LinkedIn	Search titles, identify roles	Contact list	10 min	Medium
6. Problem hypothesis	All above	Synthesis	Connect signals to pain, write hypothesis	1 paragraph	10 min	High
7. Email drafting	Account brief	Writing	Craft 3-email sequence	3 emails	15 min	High

Steps 1-4 are low-judgment, high-repetition. Automate these. Steps 5-6 require moderate judgment. Semi-automate (agent proposes, human validates). Step 7 requires high judgment on tone and quality. Agent drafts, human approves.

Step 2: Define the agent's scope

One agent, one job. Draw the boundary tight.

Scoping rules:

An agent should complete in under 60 seconds for real-time tasks (reply classification, routing) or under 5 minutes for batch tasks (research, email writing)
An agent should have a single, testable output. "Account brief" is testable. "Help with sales" is not
If the agent needs more than 5 tools, it's probably too broad. Split it
If the system prompt is over 2,000 words, it's probably covering multiple jobs. Split it
If you're writing "if the input is X, do this; if the input is Y, do that" in the prompt, you need a router and two specialist agents, not one agent with branching logic

Step 3: Design the system prompt

The system prompt is the agent's operating manual. It determines output quality more than any other design choice.

System prompt structure:

1. Role and purpose (2-3 sentences)
2. Input specification (what the agent receives)
3. Output specification (exact format, schema, required fields)
4. Rules and constraints (hard rules the agent must follow)
5. Examples (2-3 input/output pairs showing ideal behavior)
6. Edge cases (what to do when data is missing or ambiguous)

Prompt design rules:

Lead with the role. "You are a B2B account research agent that produces structured account briefs from company names" is better than a paragraph of context
Specify the output format exactly. If you want JSON, show the schema. If you want markdown, show the template. Ambiguous output specs produce inconsistent results
Hard rules are non-negotiable constraints. "Never use em-dashes. Never exceed 80 words. Never fabricate a signal." These go in a dedicated rules section, not buried in paragraphs
Examples are the most powerful part of the prompt. Two good examples teach the agent more than 500 words of instructions. Show the input, the ideal output, and annotate why the output is good
Address missing data explicitly. "If funding data is not found, output 'Funding: Not found (checked Crunchbase, PitchBook)' instead of guessing" prevents hallucination

Step 4: Select tools

Tools are the actions an agent can take: search the web, query an API, read a database, call an MCP server.

Common GTM agent tools:

Tool	What it does	Used by
Web search	Searches the internet for company information	Research Agent, Signal Monitor
LinkedIn API / scraper	Pulls profile and company data from LinkedIn	Research Agent, Committee Mapper
CRM read/write	Reads and updates CRM records	Enrichment Agent, Routing Agent, Scorer
Enrichment API (Apollo, Clearbit)	Fills missing contact and company data	Enrichment Agent
Email sending API (Lemlist, Outreach)	Loads sequences and sends emails	Email Writer (with human approval gate)
Calendar API	Books meetings, checks availability	Meeting Booker Agent
Slack API	Sends alerts and notifications	Signal Monitor, Routing Agent
File read/write	Reads CSVs, writes reports	Batch processing agents

Tool design rules:

Every tool that writes to an external system (CRM, email, Slack) should have a confirmation step in development and a human approval gate in production
Tools should return structured data, not raw HTML or API responses. Parse before returning to the agent
Limit tool count per agent. 3-5 tools is ideal. Above 7, the agent spends more time deciding which tool to use than doing the work
Include error information in tool responses. "Search returned 0 results for [query]" is better than an empty response. The agent needs to know when data is missing vs when the tool failed

Step 5: Define evaluation criteria

Before building, define how you'll measure whether the agent works.

Evaluation framework:

Dimension	What to measure	How to measure	Minimum bar
Accuracy	Are the facts correct?	Human review of 50 outputs against ground truth	95%+ factual accuracy
Completeness	Are all required fields populated?	Automated schema check	90%+ field completion
Rule compliance	Does output follow all hard rules?	Automated rule checker (word count, banned phrases, format)	100% compliance
Quality	Is the output good enough to use?	Human rating (1-5 scale) on 50 outputs	Average ≥ 4.0
Latency	How long does it take?	Timer per run	Under threshold (60s real-time, 5min batch)
Cost	How much does it cost per run?	Token tracking	Under unit economics threshold

Evaluation rules:

Define the minimum bar before building. "We'll know it's good enough when..." should be answerable before writing the first prompt
Accuracy and rule compliance are non-negotiable. Quality and latency can be traded off
Test on at least 50 inputs before deploying. 5 test cases is a demo, not a test
Measure cost per unit of output. "$0.15 per account brief" or "$0.03 per email." If the agent costs more than the human time it replaces, the economics don't work

GTM Agent Archetypes

The Research Agent

Purpose: Transform a company name into a structured account brief.

Input: Company name, domain, optional ICP criteria

Output: Structured account brief with: company snapshot, funding history, recent signals, tech stack indicators, 3-5 committee contacts, problem hypothesis

Key design decisions:

Use web search + LinkedIn as primary tools. Add Crunchbase API if available
Structure output as JSON or structured markdown. Free-form summaries are harder for downstream agents to parse
Include a confidence score per field. "Funding: $45M Series B (high confidence, Crunchbase)" vs "ARR: ~$15M (low confidence, estimated from headcount)"
Handle private/stealth companies explicitly. "Limited public information available. Brief is incomplete" is better than a hallucinated profile
Time-cap research at 60 seconds per account. Beyond that, diminishing returns

The Email Writer Agent

Purpose: Generate a cold email sequence from an account brief.

Input: Account brief (from Research Agent or human), target contact name/title, product value prop

Output: 3-email sequence with subject lines, bodies, and send timing

Key design decisions:

Embed all cold-outbound-email-writing rules directly in the system prompt. Word limits, banned phrases, signal requirements, subject line rules
Include 2-3 examples of ideal output in the prompt. Examples train the model better than rules alone
Use a QA loop: Writer → Critic → Rewrite if needed. Cap at 3 iterations
Separate "generate" from "personalize." The writer creates the template sequence. A personalization agent inserts per-contact tokens. This separation lets you reuse the same sequence across contacts with different personalization
Output should include the raw email text plus metadata: word count per email, signal used, proof point used, subject line pattern used

The Reply Classifier Agent

Purpose: Classify inbound email replies into actionable categories.

Input: Reply email text, original outbound email text, prospect metadata

Output: Classification (positive, objection, question, OOO, opt-out, irrelevant), confidence score, recommended next action

Key design decisions:

Use a fast, cheap model (Haiku). Classification doesn't need Opus
Define 6-8 categories with clear boundaries and 3+ examples per category
Include a confidence threshold. Below 80% confidence, route to human review
Output the recommended next action alongside the classification. "Positive reply. Recommended: send meeting booking email with 3 time slots" gives the downstream system or human a clear next step
Handle multi-intent replies. "I'm interested but I'm OOO until the 15th" is both positive and OOO. The classifier should detect both and route accordingly

The Signal Monitor Agent

Purpose: Continuously watch data sources for buying signals on target accounts.

Input: List of target accounts, signal definitions (what to watch for)

Output: Alert when a signal is detected: account name, signal type, signal details, signal strength, recommended action

Key design decisions:

Run on a schedule (daily or weekly), not real-time. Most buying signals don't require instant response
Define signal types with explicit detection criteria. "Funding round" = specific press release or Crunchbase entry, not "they seem to be growing"
Deduplicate signals. The same funding round shouldn't trigger 5 alerts from 5 sources
Include signal strength scoring. A Series B announcement is stronger than a LinkedIn post about growth plans
Route alerts to the right person. Signal on a Tier 1 ABM account goes to the ABM marketer. Signal on a Tier 3 account goes to the SDR queue

Production Deployment

Progressive rollout

Phase	Duration	Volume	Human review	Goal
1. Prototype	Week 1-2	10 test inputs	100%	Does it work at all?
2. Pilot	Week 3-6	50-100 real inputs	100%	Does output quality meet the bar?
3. Controlled launch	Week 7-12	Full volume	50% spot-check	Does quality hold at scale?
4. Production	Week 12+	Full volume	10-20% spot-check	Ongoing quality assurance

Rollout rules:

Never skip phases. A prototype that works on 10 test inputs may fail at 100 real inputs
Define phase advancement criteria before starting. "Advance from Pilot to Controlled Launch when accuracy ≥ 95% on 50+ human-reviewed outputs"
Keep a human fallback throughout. If the agent goes down or quality drops, the team can revert to the manual process immediately
Track unit economics from Phase 2 onward. Cost per output, time saved per output, quality vs human baseline

Monitoring in production

Metric	Check frequency	Alert threshold
Output quality score (from human spot-checks)	Weekly	Average drops below 3.5/5
Rule compliance rate	Daily (automated)	Any rule violation
Latency per run	Per run	Exceeds 2x baseline
Cost per run	Daily	Exceeds budget by 20%
Error rate	Per run	Above 5%
Human override rate	Weekly	Above 30% (agent outputs being rejected)

Anti-Pattern Check

Starting with the technology instead of the workflow. "Let's build an agent with Claude and MCP" before mapping the human process produces agents that don't fit the actual need. Map the workflow first
Building one agent to do everything. A "GTM Agent" that researches, writes emails, scores leads, and updates CRM will be mediocre at all four. One agent, one job
No human review on customer-facing output. An agent sending emails without human approval will eventually send something embarrassing to a Tier 1 account. The cost of that one bad email exceeds the cost of reviewing 1,000 good ones
Optimizing for speed before quality. A fast agent that produces bad output is worse than no agent. Get quality right in Pilot phase. Optimize speed in Production
No evaluation framework. "It seems to work" is not evaluation. Define quantitative criteria (accuracy, compliance, cost) before building and measure against them continuously
Skipping the prototype phase. Going straight to full-volume deployment because "the prompt looks good" leads to expensive failures. Test on 10 inputs first. Always
Using the most expensive model for every agent. Match model to cognitive demand. Classification = Haiku. Extraction = Sonnet. Generation = Opus or Sonnet. Using Opus for routing is burning money
No fallback plan. If the agent breaks at 2am, what happens? If there's no answer, you're not ready for production. Maintain the manual process as a fallback until the agent has 30+ days of stable operation
Treating agent design as a one-time project. Prompts drift. Data sources change. Quality degrades. Agent design is ongoing. Budget for weekly monitoring and monthly iteration

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# GTM Agent Design

## The GTM Agent Landscape

| Agent type | What it does | Replaces | Human-in-the-loop? |
|-----------|-------------|---------|-------------------|
| Research Agent | Pulls company and contact data from multiple sources, synthesizes into an account brief | Manual account research (45-90 min per account) | Review output before use |
| Email Writer Agent | Generates cold email sequences from an account brief | SDR writing emails (15-30 min per sequence) | Approve before send |
| Personalization Agent | Inserts per-prospect tokens into templated sequences | SDR personalizing templates (5-10 min per email) | Spot-check 10-20% |
| Reply Classifier Agent | Classifies inbound replies (positive, objection, OOO, opt-out) | SDR triaging inbox (ongoing) | Review low-confidence classifications |
| Lead Scorer Agent | Scores inbound leads on ICP fit and intent signals | RevOps manual scoring or static rules | Calibrate model quarterly |
| Enrichment Agent | Fills missing data fields from multiple providers | Ops team running enrichment workflows | Validate match rates |
| Signal Monitor Agent | Watches for buying signals across data sources, alerts when triggered | Manual signal scanning (daily) | Set alert thresholds |
| Routing Agent | Routes leads to the right rep based on territory, segment, and availability | RevOps routing rules in CRM | Audit routing accuracy weekly |
| Meeting Prep Agent | Generates pre-call briefs from CRM data, research, and prior notes | AE/SDR manual prep (15-30 min per meeting) | Read before the call |
| Follow-Up Agent | Generates post-meeting follow-up emails from call notes | AE writing follow-ups (10-20 min per email) | Edit and approve before send |

---

## Agent Design Process

### Step 1: Map the human workflow

Before writing any code or prompts, document exactly what a person does today.

**Workflow mapping template:**

For each step, capture:

| Field | What to document |
|-------|-----------------|
| Step name | What the person does ("Find company funding history") |
| Input | What they start with ("Company name") |
| Source | Where they get the data ("Crunchbase, press articles") |
| Action | What they do with it ("Read, extract key facts, summarize") |
| Output | What they produce ("Funding summary: round, amount, date, investors") |
| Time | How long it takes ("5-10 minutes") |
| Judgment required? | Does this step require human judgment or is it rule-based? |
| Error rate | How often do humans get this wrong? |

**Example: SDR account research workflow**

| Step | Input | Source | Action | Output | Time | Judgment? |
|------|-------|--------|--------|--------|------|-----------|
| 1. Company snapshot | Company name | LinkedIn, website | Read about page, note size/stage/vertical | Snapshot fields | 2 min | No |
| 2. Funding history | Company name | Crunchbase | Search, extract rounds | Funding summary | 3 min | No |
| 3. Recent signals | Company name | LinkedIn, news, job boards | Scan for events in last 90 days | Signal list | 10 min | Low |
| 4. Tech stack | Company name | Job postings, BuiltWith | Extract tool mentions | Stack list | 5 min | Low |
| 5. Committee mapping | Company name | LinkedIn | Search titles, identify roles | Contact list | 10 min | Medium |
| 6. Problem hypothesis | All above | Synthesis | Connect signals to pain, write hypothesis | 1 paragraph | 10 min | High |
| 7. Email drafting | Account brief | Writing | Craft 3-email sequence | 3 emails | 15 min | High |

### Step 2: Define the agent's scope

One agent, one job. Draw the boundary tight.

**Scoping rules:**
- An agent should complete in under 60 seconds for real-time tasks (reply classification, routing) or under 5 minutes for batch tasks (research, email writing)
- An agent should have a single, testable output. "Account brief" is testable. "Help with sales" is not
- If the agent needs more than 5 tools, it's probably too broad. Split it
- If the system prompt is over 2,000 words, it's probably covering multiple jobs. Split it
- If you're writing "if the input is X, do this; if the input is Y, do that" in the prompt, you need a router and two specialist agents, not one agent with branching logic

### Step 3: Design the system prompt

The system prompt is the agent's operating manual. It determines output quality more than any other design choice.

**System prompt structure:**

```
1. Role and purpose (2-3 sentences)
2. Input specification (what the agent receives)
3. Output specification (exact format, schema, required fields)
4. Rules and constraints (hard rules the agent must follow)
5. Examples (2-3 input/output pairs showing ideal behavior)
6. Edge cases (what to do when data is missing or ambiguous)
```

**Prompt design rules:**
- Lead with the role. "You are a B2B account research agent that produces structured account briefs from company names" is better than a paragraph of context
- Specify the output format exactly. If you want JSON, show the schema. If you want markdown, show the template. Ambiguous output specs produce inconsistent results
- Hard rules are non-negotiable constraints. "Never use em-dashes. Never exceed 80 words. Never fabricate a signal." These go in a dedicated rules section, not buried in paragraphs
- Examples are the most powerful part of the prompt. Two good examples teach the agent more than 500 words of instructions. Show the input, the ideal output, and annotate why the output is good
- Address missing data explicitly. "If funding data is not found, output 'Funding: Not found (checked Crunchbase, PitchBook)' instead of guessing" prevents hallucination

### Step 4: Select tools

Tools are the actions an agent can take: search the web, query an API, read a database, call an MCP server.

**Common GTM agent tools:**

| Tool | What it does | Used by |
|------|-------------|---------|
| Web search | Searches the internet for company information | Research Agent, Signal Monitor |
| LinkedIn API / scraper | Pulls profile and company data from LinkedIn | Research Agent, Committee Mapper |
| CRM read/write | Reads and updates CRM records | Enrichment Agent, Routing Agent, Scorer |
| Enrichment API (Apollo, Clearbit) | Fills missing contact and company data | Enrichment Agent |
| Email sending API (Lemlist, Outreach) | Loads sequences and sends emails | Email Writer (with human approval gate) |
| Calendar API | Books meetings, checks availability | Meeting Booker Agent |
| Slack API | Sends alerts and notifications | Signal Monitor, Routing Agent |
| File read/write | Reads CSVs, writes reports | Batch processing agents |

**Tool design rules:**
- Every tool that writes to an external system (CRM, email, Slack) should have a confirmation step in development and a human approval gate in production
- Tools should return structured data, not raw HTML or API responses. Parse before returning to the agent
- Limit tool count per agent. 3-5 tools is ideal. Above 7, the agent spends more time deciding which tool to use than doing the work
- Include error information in tool responses. "Search returned 0 results for [query]" is better than an empty response. The agent needs to know when data is missing vs when the tool failed

### Step 5: Define evaluation criteria

Before building, define how you'll measure whether the agent works.

**Evaluation framework:**

| Dimension | What to measure | How to measure | Minimum bar |
|-----------|----------------|---------------|-------------|
| Accuracy | Are the facts correct? | Human review of 50 outputs against ground truth | 95%+ factual accuracy |
| Completeness | Are all required fields populated? | Automated schema check | 90%+ field completion |
| Rule compliance | Does output follow all hard rules? | Automated rule checker (word count, banned phrases, format) | 100% compliance |
| Quality | Is the output good enough to use? | Human rating (1-5 scale) on 50 outputs | Average ≥ 4.0 |
| Latency | How long does it take? | Timer per run | Under threshold (60s real-time, 5min batch) |
| Cost | How much does it cost per run? | Token tracking | Under unit economics threshold |

**Evaluation rules:**
- Define the minimum bar before building. "We'll know it's good enough when..." should be answerable before writing the first prompt
- Accuracy and rule compliance are non-negotiable. Quality and latency can be traded off
- Test on at least 50 inputs before deploying. 5 test cases is a demo, not a test
- Measure cost per unit of output. "$0.15 per account brief" or "$0.03 per email." If the agent costs more than the human time it replaces, the economics don't work

---

## GTM Agent Archetypes

### The Research Agent

**Purpose:** Transform a company name into a structured account brief.

**Input:** Company name, domain, optional ICP criteria

**Output:** Structured account brief with: company snapshot, funding history, recent signals, tech stack indicators, 3-5 committee contacts, problem hypothesis

**Key design decisions:**
- Use web search + LinkedIn as primary tools. Add Crunchbase API if available
- Structure output as JSON or structured markdown. Free-form summaries are harder for downstream agents to parse
- Include a confidence score per field. "Funding: $45M Series B (high confidence, Crunchbase)" vs "ARR: ~$15M (low confidence, estimated from headcount)"
- Handle private/stealth companies explicitly. "Limited public information available. Brief is incomplete" is better than a hallucinated profile
- Time-cap research at 60 seconds per account. Beyond that, diminishing returns

### The Email Writer Agent

**Purpose:** Generate a cold email sequence from an account brief.

**Input:** Account brief (from Research Agent or human), target contact name/title, product value prop

**Output:** 3-email sequence with subject lines, bodies, and send timing

**Key design decisions:**
- Embed all cold-outbound-email-writing rules directly in the system prompt. Word limits, banned phrases, signal requirements, subject line rules
- Include 2-3 examples of ideal output in the prompt. Examples train the model better than rules alone
- Use a QA loop: Writer → Critic → Rewrite if needed. Cap at 3 iterations
- Separate "generate" from "personalize." The writer creates the template sequence. A personalization agent inserts per-contact tokens. This separation lets you reuse the same sequence across contacts with different personalization
- Output should include the raw email text plus metadata: word count per email, signal used, proof point used, subject line pattern used

### The Reply Classifier Agent

**Purpose:** Classify inbound email replies into actionable categories.

**Input:** Reply email text, original outbound email text, prospect metadata

**Output:** Classification (positive, objection, question, OOO, opt-out, irrelevant), confidence score, recommended next action

**Key design decisions:**
- Use a fast, cheap model (Haiku). Classification doesn't need Opus
- Define 6-8 categories with clear boundaries and 3+ examples per category
- Include a confidence threshold. Below 80% confidence, route to human review
- Output the recommended next action alongside the classification. "Positive reply. Recommended: send meeting booking email with 3 time slots" gives the downstream system or human a clear next step
- Handle multi-intent replies. "I'm interested but I'm OOO until the 15th" is both positive and OOO. The classifier should detect both and route accordingly

### The Signal Monitor Agent

**Purpose:** Continuously watch data sources for buying signals on target accounts.

**Input:** List of target accounts, signal definitions (what to watch for)

**Output:** Alert when a signal is detected: account name, signal type, signal details, signal strength, recommended action

**Key design decisions:**
- Run on a schedule (daily or weekly), not real-time. Most buying signals don't require instant response
- Define signal types with explicit detection criteria. "Funding round" = specific press release or Crunchbase entry, not "they seem to be growing"
- Deduplicate signals. The same funding round shouldn't trigger 5 alerts from 5 sources
- Include signal strength scoring. A Series B announcement is stronger than a LinkedIn post about growth plans
- Route alerts to the right person. Signal on a Tier 1 ABM account goes to the ABM marketer. Signal on a Tier 3 account goes to the SDR queue

---

## Production Deployment

### Progressive rollout

| Phase | Duration | Volume | Human review | Goal |
|-------|----------|--------|-------------|------|
| 1. Prototype | Week 1-2 | 10 test inputs | 100% | Does it work at all? |
| 2. Pilot | Week 3-6 | 50-100 real inputs | 100% | Does output quality meet the bar? |
| 3. Controlled launch | Week 7-12 | Full volume | 50% spot-check | Does quality hold at scale? |
| 4. Production | Week 12+ | Full volume | 10-20% spot-check | Ongoing quality assurance |

**Rollout rules:**
- Never skip phases. A prototype that works on 10 test inputs may fail at 100 real inputs
- Define phase advancement criteria before starting. "Advance from Pilot to Controlled Launch when accuracy ≥ 95% on 50+ human-reviewed outputs"
- Keep a human fallback throughout. If the agent goes down or quality drops, the team can revert to the manual process immediately
- Track unit economics from Phase 2 onward. Cost per output, time saved per output, quality vs human baseline

### Monitoring in production

| Metric | Check frequency | Alert threshold |
|--------|----------------|----------------|
| Output quality score (from human spot-checks) | Weekly | Average drops below 3.5/5 |
| Rule compliance rate | Daily (automated) | Any rule violation |
| Latency per run | Per run | Exceeds 2x baseline |
| Cost per run | Daily | Exceeds budget by 20% |
| Error rate | Per run | Above 5% |
| Human override rate | Weekly | Above 30% (agent outputs being rejected) |

---

## Anti-Pattern Check

- Starting with the technology instead of the workflow. "Let's build an agent with Claude and MCP" before mapping the human process produces agents that don't fit the actual need. Map the workflow first
- Building one agent to do everything. A "GTM Agent" that researches, writes emails, scores leads, and updates CRM will be mediocre at all four. One agent, one job
- No human review on customer-facing output. An agent sending emails without human approval will eventually send something embarrassing to a Tier 1 account. The cost of that one bad email exceeds the cost of reviewing 1,000 good ones
- Optimizing for speed before quality. A fast agent that produces bad output is worse than no agent. Get quality right in Pilot phase. Optimize speed in Production
- No evaluation framework. "It seems to work" is not evaluation. Define quantitative criteria (accuracy, compliance, cost) before building and measure against them continuously
- Skipping the prototype phase. Going straight to full-volume deployment because "the prompt looks good" leads to expensive failures. Test on 10 inputs first. Always
- Using the most expensive model for every agent. Match model to cognitive demand. Classification = Haiku. Extraction = Sonnet. Generation = Opus or Sonnet. Using Opus for routing is burning money
- No fallback plan. If the agent breaks at 2am, what happens? If there's no answer, you're not ready for production. Maintain the manual process as a fallback until the agent has 30+ days of stable operation
- Treating agent design as a one-time project. Prompts drift. Data sources change. Quality degrades. Agent design is ongoing. Budget for weekly monitoring and monthly iteration