gtm-agent-design
GTM Agent Design
A GTM agent is an AI system that performs a specific go-to-market task: researching accounts, writing outbound, scoring leads, handling replies, enriching data, or routing prospects. The agent replaces a repetitive human workflow with an LLM-powered process that runs faster, cheaper, and more consistently.
The design principle: start with the human workflow, not the technology. Map what a person does today step by step. Identify which steps are repetitive, rule-based, and don't require judgment. Automate those. Keep the judgment steps human.
The GTM Agent Landscape
| Agent type | What it does | Replaces | Human-in-the-loop? |
|---|---|---|---|
| Research Agent | Pulls company and contact data from multiple sources, synthesizes into an account brief | Manual account research (45-90 min per account) | Review output before use |
| Email Writer Agent | Generates cold email sequences from an account brief | SDR writing emails (15-30 min per sequence) | Approve before send |
| Personalization Agent | Inserts per-prospect tokens into templated sequences | SDR personalizing templates (5-10 min per email) | Spot-check 10-20% |
| Reply Classifier Agent | Classifies inbound replies (positive, objection, OOO, opt-out) | SDR triaging inbox (ongoing) | Review low-confidence classifications |
| Lead Scorer Agent | Scores inbound leads on ICP fit and intent signals | RevOps manual scoring or static rules | Calibrate model quarterly |
| Enrichment Agent | Fills missing data fields from multiple providers | Ops team running enrichment workflows | Validate match rates |
| Signal Monitor Agent | Watches for buying signals across data sources, alerts when triggered | Manual signal scanning (daily) | Set alert thresholds |
| Routing Agent | Routes leads to the right rep based on territory, segment, and availability | RevOps routing rules in CRM | Audit routing accuracy weekly |
| Meeting Prep Agent | Generates pre-call briefs from CRM data, research, and prior notes | AE/SDR manual prep (15-30 min per meeting) | Read before the call |
| Follow-Up Agent | Generates post-meeting follow-up emails from call notes | AE writing follow-ups (10-20 min per email) | Edit and approve before send |
Agent Design Process
Step 1: Map the human workflow
Before writing any code or prompts, document exactly what a person does today.
Workflow mapping template:
For each step, capture:
| Field | What to document |
|---|---|
| Step name | What the person does ("Find company funding history") |
| Input | What they start with ("Company name") |
| Source | Where they get the data ("Crunchbase, press articles") |
| Action | What they do with it ("Read, extract key facts, summarize") |
| Output | What they produce ("Funding summary: round, amount, date, investors") |
| Time | How long it takes ("5-10 minutes") |
| Judgment required? | Does this step require human judgment or is it rule-based? |
| Error rate | How often do humans get this wrong? |
Example: SDR account research workflow
| Step | Input | Source | Action | Output | Time | Judgment? |
|---|---|---|---|---|---|---|
| 1. Company snapshot | Company name | LinkedIn, website | Read about page, note size/stage/vertical | Snapshot fields | 2 min | No |
| 2. Funding history | Company name | Crunchbase | Search, extract rounds | Funding summary | 3 min | No |
| 3. Recent signals | Company name | LinkedIn, news, job boards | Scan for events in last 90 days | Signal list | 10 min | Low |
| 4. Tech stack | Company name | Job postings, BuiltWith | Extract tool mentions | Stack list | 5 min | Low |
| 5. Committee mapping | Company name | Search titles, identify roles | Contact list | 10 min | Medium | |
| 6. Problem hypothesis | All above | Synthesis | Connect signals to pain, write hypothesis | 1 paragraph | 10 min | High |
| 7. Email drafting | Account brief | Writing | Craft 3-email sequence | 3 emails | 15 min | High |
Steps 1-4 are low-judgment, high-repetition. Automate these. Steps 5-6 require moderate judgment. Semi-automate (agent proposes, human validates). Step 7 requires high judgment on tone and quality. Agent drafts, human approves.
Step 2: Define the agent's scope
One agent, one job. Draw the boundary tight.
Scoping rules:
- An agent should complete in under 60 seconds for real-time tasks (reply classification, routing) or under 5 minutes for batch tasks (research, email writing)
- An agent should have a single, testable output. "Account brief" is testable. "Help with sales" is not
- If the agent needs more than 5 tools, it's probably too broad. Split it
- If the system prompt is over 2,000 words, it's probably covering multiple jobs. Split it
- If you're writing "if the input is X, do this; if the input is Y, do that" in the prompt, you need a router and two specialist agents, not one agent with branching logic
Step 3: Design the system prompt
The system prompt is the agent's operating manual. It determines output quality more than any other design choice.
System prompt structure:
1. Role and purpose (2-3 sentences)
2. Input specification (what the agent receives)
3. Output specification (exact format, schema, required fields)
4. Rules and constraints (hard rules the agent must follow)
5. Examples (2-3 input/output pairs showing ideal behavior)
6. Edge cases (what to do when data is missing or ambiguous)
Prompt design rules:
- Lead with the role. "You are a B2B account research agent that produces structured account briefs from company names" is better than a paragraph of context
- Specify the output format exactly. If you want JSON, show the schema. If you want markdown, show the template. Ambiguous output specs produce inconsistent results
- Hard rules are non-negotiable constraints. "Never use em-dashes. Never exceed 80 words. Never fabricate a signal." These go in a dedicated rules section, not buried in paragraphs
- Examples are the most powerful part of the prompt. Two good examples teach the agent more than 500 words of instructions. Show the input, the ideal output, and annotate why the output is good
- Address missing data explicitly. "If funding data is not found, output 'Funding: Not found (checked Crunchbase, PitchBook)' instead of guessing" prevents hallucination
Step 4: Select tools
Tools are the actions an agent can take: search the web, query an API, read a database, call an MCP server.
Common GTM agent tools:
| Tool | What it does | Used by |
|---|---|---|
| Web search | Searches the internet for company information | Research Agent, Signal Monitor |
| LinkedIn API / scraper | Pulls profile and company data from LinkedIn | Research Agent, Committee Mapper |
| CRM read/write | Reads and updates CRM records | Enrichment Agent, Routing Agent, Scorer |
| Enrichment API (Apollo, Clearbit) | Fills missing contact and company data | Enrichment Agent |
| Email sending API (Lemlist, Outreach) | Loads sequences and sends emails | Email Writer (with human approval gate) |
| Calendar API | Books meetings, checks availability | Meeting Booker Agent |
| Slack API | Sends alerts and notifications | Signal Monitor, Routing Agent |
| File read/write | Reads CSVs, writes reports | Batch processing agents |
Tool design rules:
- Every tool that writes to an external system (CRM, email, Slack) should have a confirmation step in development and a human approval gate in production
- Tools should return structured data, not raw HTML or API responses. Parse before returning to the agent
- Limit tool count per agent. 3-5 tools is ideal. Above 7, the agent spends more time deciding which tool to use than doing the work
- Include error information in tool responses. "Search returned 0 results for [query]" is better than an empty response. The agent needs to know when data is missing vs when the tool failed
Step 5: Define evaluation criteria
Before building, define how you'll measure whether the agent works.
Evaluation framework:
| Dimension | What to measure | How to measure | Minimum bar |
|---|---|---|---|
| Accuracy | Are the facts correct? | Human review of 50 outputs against ground truth | 95%+ factual accuracy |
| Completeness | Are all required fields populated? | Automated schema check | 90%+ field completion |
| Rule compliance | Does output follow all hard rules? | Automated rule checker (word count, banned phrases, format) | 100% compliance |
| Quality | Is the output good enough to use? | Human rating (1-5 scale) on 50 outputs | Average ≥ 4.0 |
| Latency | How long does it take? | Timer per run | Under threshold (60s real-time, 5min batch) |
| Cost | How much does it cost per run? | Token tracking | Under unit economics threshold |
Evaluation rules:
- Define the minimum bar before building. "We'll know it's good enough when..." should be answerable before writing the first prompt
- Accuracy and rule compliance are non-negotiable. Quality and latency can be traded off
- Test on at least 50 inputs before deploying. 5 test cases is a demo, not a test
- Measure cost per unit of output. "$0.15 per account brief" or "$0.03 per email." If the agent costs more than the human time it replaces, the economics don't work
GTM Agent Archetypes
The Research Agent
Purpose: Transform a company name into a structured account brief.
Input: Company name, domain, optional ICP criteria
Output: Structured account brief with: company snapshot, funding history, recent signals, tech stack indicators, 3-5 committee contacts, problem hypothesis
Key design decisions:
- Use web search + LinkedIn as primary tools. Add Crunchbase API if available
- Structure output as JSON or structured markdown. Free-form summaries are harder for downstream agents to parse
- Include a confidence score per field. "Funding: $45M Series B (high confidence, Crunchbase)" vs "ARR: ~$15M (low confidence, estimated from headcount)"
- Handle private/stealth companies explicitly. "Limited public information available. Brief is incomplete" is better than a hallucinated profile
- Time-cap research at 60 seconds per account. Beyond that, diminishing returns
The Email Writer Agent
Purpose: Generate a cold email sequence from an account brief.
Input: Account brief (from Research Agent or human), target contact name/title, product value prop
Output: 3-email sequence with subject lines, bodies, and send timing
Key design decisions:
- Embed all cold-outbound-email-writing rules directly in the system prompt. Word limits, banned phrases, signal requirements, subject line rules
- Include 2-3 examples of ideal output in the prompt. Examples train the model better than rules alone
- Use a QA loop: Writer → Critic → Rewrite if needed. Cap at 3 iterations
- Separate "generate" from "personalize." The writer creates the template sequence. A personalization agent inserts per-contact tokens. This separation lets you reuse the same sequence across contacts with different personalization
- Output should include the raw email text plus metadata: word count per email, signal used, proof point used, subject line pattern used
The Reply Classifier Agent
Purpose: Classify inbound email replies into actionable categories.
Input: Reply email text, original outbound email text, prospect metadata
Output: Classification (positive, objection, question, OOO, opt-out, irrelevant), confidence score, recommended next action
Key design decisions:
- Use a fast, cheap model (Haiku). Classification doesn't need Opus
- Define 6-8 categories with clear boundaries and 3+ examples per category
- Include a confidence threshold. Below 80% confidence, route to human review
- Output the recommended next action alongside the classification. "Positive reply. Recommended: send meeting booking email with 3 time slots" gives the downstream system or human a clear next step
- Handle multi-intent replies. "I'm interested but I'm OOO until the 15th" is both positive and OOO. The classifier should detect both and route accordingly
The Signal Monitor Agent
Purpose: Continuously watch data sources for buying signals on target accounts.
Input: List of target accounts, signal definitions (what to watch for)
Output: Alert when a signal is detected: account name, signal type, signal details, signal strength, recommended action
Key design decisions:
- Run on a schedule (daily or weekly), not real-time. Most buying signals don't require instant response
- Define signal types with explicit detection criteria. "Funding round" = specific press release or Crunchbase entry, not "they seem to be growing"
- Deduplicate signals. The same funding round shouldn't trigger 5 alerts from 5 sources
- Include signal strength scoring. A Series B announcement is stronger than a LinkedIn post about growth plans
- Route alerts to the right person. Signal on a Tier 1 ABM account goes to the ABM marketer. Signal on a Tier 3 account goes to the SDR queue
Production Deployment
Progressive rollout
| Phase | Duration | Volume | Human review | Goal |
|---|---|---|---|---|
| 1. Prototype | Week 1-2 | 10 test inputs | 100% | Does it work at all? |
| 2. Pilot | Week 3-6 | 50-100 real inputs | 100% | Does output quality meet the bar? |
| 3. Controlled launch | Week 7-12 | Full volume | 50% spot-check | Does quality hold at scale? |
| 4. Production | Week 12+ | Full volume | 10-20% spot-check | Ongoing quality assurance |
Rollout rules:
- Never skip phases. A prototype that works on 10 test inputs may fail at 100 real inputs
- Define phase advancement criteria before starting. "Advance from Pilot to Controlled Launch when accuracy ≥ 95% on 50+ human-reviewed outputs"
- Keep a human fallback throughout. If the agent goes down or quality drops, the team can revert to the manual process immediately
- Track unit economics from Phase 2 onward. Cost per output, time saved per output, quality vs human baseline
Monitoring in production
| Metric | Check frequency | Alert threshold |
|---|---|---|
| Output quality score (from human spot-checks) | Weekly | Average drops below 3.5/5 |
| Rule compliance rate | Daily (automated) | Any rule violation |
| Latency per run | Per run | Exceeds 2x baseline |
| Cost per run | Daily | Exceeds budget by 20% |
| Error rate | Per run | Above 5% |
| Human override rate | Weekly | Above 30% (agent outputs being rejected) |
Anti-Pattern Check
- Starting with the technology instead of the workflow. "Let's build an agent with Claude and MCP" before mapping the human process produces agents that don't fit the actual need. Map the workflow first
- Building one agent to do everything. A "GTM Agent" that researches, writes emails, scores leads, and updates CRM will be mediocre at all four. One agent, one job
- No human review on customer-facing output. An agent sending emails without human approval will eventually send something embarrassing to a Tier 1 account. The cost of that one bad email exceeds the cost of reviewing 1,000 good ones
- Optimizing for speed before quality. A fast agent that produces bad output is worse than no agent. Get quality right in Pilot phase. Optimize speed in Production
- No evaluation framework. "It seems to work" is not evaluation. Define quantitative criteria (accuracy, compliance, cost) before building and measure against them continuously
- Skipping the prototype phase. Going straight to full-volume deployment because "the prompt looks good" leads to expensive failures. Test on 10 inputs first. Always
- Using the most expensive model for every agent. Match model to cognitive demand. Classification = Haiku. Extraction = Sonnet. Generation = Opus or Sonnet. Using Opus for routing is burning money
- No fallback plan. If the agent breaks at 2am, what happens? If there's no answer, you're not ready for production. Maintain the manual process as a fallback until the agent has 30+ days of stable operation
- Treating agent design as a one-time project. Prompts drift. Data sources change. Quality degrades. Agent design is ongoing. Budget for weekly monitoring and monthly iteration