Home/ Skills/ agent-orchestration

general agent-orchestration

agent-orchestration

This skill should be used when the user asks to "orchestrate multiple agents", "coordinate agents", "design a multi-agent system", "build an agent pipeline", "chain agents together", "hand off between agents", "design agent workflows", "build an agent DAG", "route between agents", or any variation of designing and implementing systems where multiple AI agents work together on GTM tasks for B2B SaaS.

Download .md

Agent Orchestration

Agent orchestration is the design of systems where multiple specialized agents work together to complete a task no single agent could handle well alone. Each agent owns one job. The orchestration layer decides which agent runs, when, and what context it receives.

The principle: one agent, one job. An agent that researches accounts, writes emails, scores leads, and updates the CRM is not an agent. It's a monolith with an LLM inside. It will be mediocre at all four tasks.

When to Use Multi-Agent vs Single Agent

Scenario	Use single agent	Use multi-agent
Task is one step with clear input/output	Yes	No
Task requires 2+ distinct skill sets	No	Yes
Output quality varies by subtask	No	Yes (specialize each)
Task has branching logic (if X do Y, else Z)	Maybe	Yes (router + specialists)
Task requires human review at intermediate steps	No	Yes (pause between agents)
Task operates on different data at different stages	No	Yes (each agent gets relevant context)

Rule of thumb: If you're writing one system prompt longer than 2,000 words to cover multiple distinct tasks, split into agents.

The 4 Orchestration Patterns

Pattern 1: Pipeline (Sequential)

Agents run in order. Each agent's output becomes the next agent's input.

[Agent A] → [Agent B] → [Agent C] → Output

When to use: The task has clear stages where each stage's output is the next stage's input. Account research → email writing → sequence loading.

Example: Account research → cold email pipeline

Step	Agent	Input	Output
1	Research Agent	Company name + ICP criteria	Account brief (company snapshot, signals, committee, problem hypothesis)
2	Email Writer Agent	Account brief	3-email sequence (subject lines, bodies, per cold-outbound-email-writing rules)
3	Personalization Agent	Email drafts + LinkedIn data	Emails with per-contact personalization tokens inserted
4	QA Agent	Final emails	Pass/fail on word count, banned phrases, signal presence, subject line rules

Pipeline rules:

Define the contract between agents explicitly. Agent A's output schema must match Agent B's expected input schema. Mismatches cause silent quality degradation
Each agent should be independently testable. If Agent B produces bad output, you need to know whether the problem is Agent B's prompt or Agent A's output
Include a QA agent at the end of any pipeline that produces customer-facing output. The QA agent checks rules compliance, not creativity
Keep pipelines to 4-5 stages max. Each stage introduces latency and potential error propagation. Beyond 5, refactor

Pattern 2: Router (Branching)

A router agent classifies the input and sends it to the right specialist agent.

                ┌→ [Agent A]
[Router] ──────┼→ [Agent B]
                └→ [Agent C]

When to use: Different inputs require fundamentally different handling. Inbound lead classification (demo request vs. pricing question vs. support ticket). Reply classification (positive vs. objection vs. out-of-office).

Example: Inbound reply router

Classification	Routed to	Action
Positive reply ("interested, let's talk")	Meeting Booker Agent	Propose 2-3 times, include Calendly link
Objection ("we already have a tool")	Objection Handler Agent	Acknowledge, reframe, offer a different angle
Question ("how does pricing work?")	Info Agent	Answer the specific question, re-ask softly
Out-of-office	OOO Handler Agent	Parse return date, reschedule follow-up
Not interested ("remove me")	Opt-Out Agent	Confirm removal, update CRM, stop sequence
Irrelevant / spam	Null	Log and discard. No response

Router rules:

The router should be fast and cheap. Use a smaller/faster model for classification. The router doesn't generate content, it classifies
Define every category explicitly with examples. Ambiguous categories cause misrouting. "Positive-ish" is not a category
Include a fallback/escalation path. If the router can't classify with confidence, route to a human, not a random agent
Log every routing decision with the confidence score. Low-confidence routes are where quality breaks down. Review them weekly
Never let the router also be a specialist. The router classifies. It does not act. Mixing classification and action in one agent creates bias

Pattern 3: Fan-Out / Fan-In (Parallel)

Multiple agents work on the same input simultaneously. A merge agent combines their outputs.

         ┌→ [Agent A] ─┐
Input ───┼→ [Agent B] ──┼→ [Merge Agent] → Output
         └→ [Agent C] ─┘

When to use: Multiple independent analyses of the same input. Account research from different sources. Multi-angle email generation for A/B testing. Scoring an account on multiple dimensions simultaneously.

Example: Parallel account research

Agent	Input	Researches	Output
Funding Agent	Company name	Crunchbase, press, SEC filings	Funding history, investors, runway estimate
Hiring Agent	Company name	LinkedIn, job boards	Open roles, hiring velocity, org structure signals
Tech Stack Agent	Company name + domain	Job postings, BuiltWith, GitHub	Current stack, recent changes, integration points
Signal Agent	Company name	News, LinkedIn posts, G2 reviews	Active signals, recency, strength
Merge Agent	All four outputs	Combines, deduplicates, synthesizes	Unified account brief

Fan-out rules:

Each parallel agent must be independent. If Agent B needs Agent A's output to do its job, it's not parallel. It's a pipeline
The merge agent is not optional. Raw concatenation of parallel outputs is not orchestration. The merge agent resolves conflicts, removes duplicates, and produces a unified document
Set timeouts per agent. If one parallel agent hangs, the others shouldn't wait indefinitely. Return partial results after the timeout
Fan-out works best when each agent accesses different data sources. If all agents are reading the same data and interpreting it differently, the outputs will conflict more than complement

Pattern 4: Loop (Iterative)

An agent produces output. A critic agent evaluates it. If it fails, the output cycles back to the producer with feedback.

[Producer Agent] → [Critic Agent] → Pass? → Output
                        ↓ Fail
                  [Feedback → Producer Agent]

When to use: Output quality is critical and objectively measurable. Email writing with strict rules. Data extraction with validation. Any task where "good enough on first try" isn't reliable enough.

Example: Email writing with QA loop

Cycle	Producer	Critic checks	Pass/fail
1	Writes 3-email sequence	Word count, banned phrases, signal presence, subject line rules, em-dash check	Fail: Email 1 is 95 words, uses "leveraging"
2	Rewrites with critic feedback	Same checks	Fail: Email 3 has a pitch in it
3	Rewrites Email 3 only	Same checks	Pass: all rules met

Loop rules:

Cap iterations. 3 attempts max. If the producer can't pass the critic in 3 tries, the prompt or the rules need fixing, not a 4th attempt
The critic must use objective, binary checks. "Is this email good?" is not a valid critic check. "Is the word count under 80?" is
Feed specific failure reasons back to the producer. "Failed" is not feedback. "Email 1 is 15 words over limit. Remove the second sentence in paragraph 2" is feedback
Log every iteration. Patterns in critic feedback reveal systematic prompt issues. If the producer consistently fails on the same rule, fix the producer's prompt

Orchestration Architecture

Context management

The most common orchestration failure is context pollution: passing too much information between agents, degrading each agent's focus.

Rules:

Each agent receives only the context it needs. The email writer doesn't need raw Crunchbase HTML. It needs the structured account brief
Define explicit schemas for inter-agent data. Use structured output (JSON) for agent outputs that feed into other agents. Free-form text between agents causes parsing failures
Strip intermediate reasoning. Agent A's chain-of-thought is noise for Agent B. Pass conclusions, not the reasoning process
Include metadata with context: source, timestamp, confidence level. "Funding: $45M Series B (Crunchbase, Oct 2025, confirmed)" is better than "They raised $45M"

State management

Orchestrated systems need to track state across agents and across runs.

State type	Where to store	Example
Run state (current execution)	In-memory or run database	Which agents have completed, current stage, intermediate outputs
Account state (persistent)	CRM or account database	Committee map, signal inventory, prior outreach history
Agent config (static)	Config files or environment	System prompts, model selection, temperature, tool access
Quality metrics (aggregate)	Analytics database	Pass rates, iteration counts, error patterns

State rules:

Never store state in the agent's prompt. Prompts are for instructions, not data. Pass state through structured input
Every run should have a unique ID that traces through all agents. Debugging a 4-agent pipeline without trace IDs is impossible
Persist intermediate outputs. If Agent C fails, you should be able to restart from Agent C's input without re-running A and B

Error handling

Error type	Response	Example
Agent timeout	Return partial results. Flag incomplete fields	Research Agent hangs on one data source. Return what you have, mark the missing source
Agent produces invalid output	Retry once with explicit format instructions. If still invalid, escalate to human	Email Writer returns a 4-email sequence instead of 3
Agent produces low-confidence output	Route to human review. Don't pass to next agent	Router classifies a reply as "positive" with 55% confidence
Upstream agent failure	Skip dependent agents. Return partial pipeline results	If Research Agent fails, don't run Email Writer. Return the failure with context
Rate limit / API error	Retry with exponential backoff. 3 attempts max	Model API returns 429. Wait 30s, retry. Wait 60s, retry. Then fail

Error rules:

Never silently swallow errors. Every failure should be logged with the agent name, input, and error message
Design for partial success. A pipeline that produces 3 out of 4 outputs is more useful than one that produces 0 because step 2 failed
Human escalation is a valid error handling strategy. Not every edge case needs an automated fix. Some need a person

Agent Design Principles for GTM

Keep agents narrow

Good agent scope	Bad agent scope
"Research funding history for one company"	"Research everything about a company"
"Write Email 1 of a cold sequence given an account brief"	"Write a full campaign"
"Classify an inbound reply into one of 6 categories"	"Handle all inbound replies"
"Score ICP fit on 4 firmographic dimensions"	"Decide if we should target this account"

Narrow agents are easier to test, faster to iterate, and produce more consistent output.

Match model to task

Task type	Model recommendation	Why
Classification / routing	Fast, cheap model (Haiku)	Speed matters. Creativity doesn't
Data extraction / structuring	Mid-tier model (Sonnet)	Needs accuracy. Doesn't need creativity
Content generation (emails, copy)	Top-tier model (Opus/Sonnet)	Quality matters. Worth the cost and latency
QA / critic	Mid-tier model (Sonnet)	Rule-checking, not generation. Fast + accurate
Synthesis / merge	Top-tier model (Opus/Sonnet)	Combining multiple inputs requires strong reasoning

Don't use Opus for classification. Don't use Haiku for email writing. Match the model to the cognitive demand.

Design for human-in-the-loop

Every production GTM agent system should have human checkpoints.

Where to insert human review:

Stage	Why	Review type
After account selection / scoring	Humans catch fit signals models miss	Approve/reject list
After email generation	Brand voice, factual accuracy, tone	Edit or approve drafts
After reply classification	Misrouted replies damage relationships	Spot-check low-confidence classifications
Before CRM updates	Bad data in CRM cascades everywhere	Approve before write
Before sending	Final gate on anything customer-facing	Approve batch or individual

Human-in-the-loop rules:

Default to human review for anything customer-facing until you have 30+ days of quality data showing the agent is reliable
Automate progressively. Start with human review on 100% of outputs. Drop to 50% after quality stabilizes. Drop to 10% spot-check after 30 days of consistent quality. Never drop to 0%
Make review fast. Present the agent's output alongside the input and the rules it was supposed to follow. The reviewer should approve or reject in under 30 seconds per item

Testing Orchestrated Systems

Unit testing (per agent)

Test each agent independently with known inputs and expected outputs.

Test type	What to verify	Example
Happy path	Agent produces correct output for standard input	Research Agent returns valid account brief for a well-known company
Edge case	Agent handles unusual input gracefully	Research Agent handles a company with no funding data
Rule compliance	Output follows all specified rules	Email Writer output passes banned-phrase check, word count check
Format compliance	Output matches the expected schema	Research Agent returns valid JSON matching the account brief schema

Integration testing (across agents)

Test agent-to-agent handoffs with real data flow.

Run the full pipeline end-to-end with 5 test accounts
Verify each intermediate output is valid input for the next agent
Check that the final output meets all quality criteria
Measure total latency. If the full pipeline takes 10 minutes per account, that may be acceptable for ABM but not for real-time reply handling

Regression testing

After any prompt change, re-run the full test suite before deploying.

Prompt changes to one agent can cascade. A change to the Research Agent's output format may break the Email Writer's parser
Keep a golden set of 10-20 test cases with known-good outputs. Compare new outputs against the golden set
Track quality metrics over time. A gradual decline in critic pass rates signals prompt drift

Anti-Pattern Check

One agent doing everything. If the system prompt is 3,000+ words covering research, writing, scoring, and CRM updates, split it. Monolith agents are slow, expensive, and unreliable
No schema between agents. Passing free-form text between agents causes parsing failures and silent quality degradation. Define structured contracts
Router and specialist combined. An agent that classifies a reply and also writes the response will be biased toward classifications that are easy to respond to. Separate the decisions
No human review on customer-facing output. Every email, every LinkedIn message, every reply should be human-reviewed until you have data proving the agent is reliable. "It looked good in testing" is not data
Retry loops without caps. An email writer that keeps failing the critic and retrying indefinitely will burn tokens and never converge. Cap at 3 iterations
Using the most expensive model for every agent. Classification doesn't need Opus. Data extraction doesn't need Opus. Reserve top-tier models for tasks that require strong reasoning or creative output
No trace IDs. When something goes wrong in a 4-agent pipeline, you need to trace the exact input and output at every stage. Without trace IDs, debugging is guesswork
Designing for full automation on Day 1. Start with human-in-the-loop. Earn automation through demonstrated quality. The cost of a bad automated email to a Tier 1 ABM account is higher than the cost of a human reviewing 50 drafts

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# Agent Orchestration

## When to Use Multi-Agent vs Single Agent

| Scenario | Use single agent | Use multi-agent |
|----------|-----------------|----------------|
| Task is one step with clear input/output | Yes | No |
| Task requires 2+ distinct skill sets | No | Yes |
| Output quality varies by subtask | No | Yes (specialize each) |
| Task has branching logic (if X do Y, else Z) | Maybe | Yes (router + specialists) |
| Task requires human review at intermediate steps | No | Yes (pause between agents) |
| Task operates on different data at different stages | No | Yes (each agent gets relevant context) |

**Rule of thumb:** If you're writing one system prompt longer than 2,000 words to cover multiple distinct tasks, split into agents.

---

## The 4 Orchestration Patterns

### Pattern 1: Pipeline (Sequential)

Agents run in order. Each agent's output becomes the next agent's input.

```
[Agent A] → [Agent B] → [Agent C] → Output
```

**When to use:** The task has clear stages where each stage's output is the next stage's input. Account research → email writing → sequence loading.

**Example: Account research → cold email pipeline**

| Step | Agent | Input | Output |
|------|-------|-------|--------|
| 1 | Research Agent | Company name + ICP criteria | Account brief (company snapshot, signals, committee, problem hypothesis) |
| 2 | Email Writer Agent | Account brief | 3-email sequence (subject lines, bodies, per cold-outbound-email-writing rules) |
| 3 | Personalization Agent | Email drafts + LinkedIn data | Emails with per-contact personalization tokens inserted |
| 4 | QA Agent | Final emails | Pass/fail on word count, banned phrases, signal presence, subject line rules |

**Pipeline rules:**
- Define the contract between agents explicitly. Agent A's output schema must match Agent B's expected input schema. Mismatches cause silent quality degradation
- Each agent should be independently testable. If Agent B produces bad output, you need to know whether the problem is Agent B's prompt or Agent A's output
- Include a QA agent at the end of any pipeline that produces customer-facing output. The QA agent checks rules compliance, not creativity
- Keep pipelines to 4-5 stages max. Each stage introduces latency and potential error propagation. Beyond 5, refactor

### Pattern 2: Router (Branching)

A router agent classifies the input and sends it to the right specialist agent.

```
                ┌→ [Agent A]
[Router] ──────┼→ [Agent B]
                └→ [Agent C]
```

**When to use:** Different inputs require fundamentally different handling. Inbound lead classification (demo request vs. pricing question vs. support ticket). Reply classification (positive vs. objection vs. out-of-office).

**Example: Inbound reply router**

| Classification | Routed to | Action |
|---------------|-----------|--------|
| Positive reply ("interested, let's talk") | Meeting Booker Agent | Propose 2-3 times, include Calendly link |
| Objection ("we already have a tool") | Objection Handler Agent | Acknowledge, reframe, offer a different angle |
| Question ("how does pricing work?") | Info Agent | Answer the specific question, re-ask softly |
| Out-of-office | OOO Handler Agent | Parse return date, reschedule follow-up |
| Not interested ("remove me") | Opt-Out Agent | Confirm removal, update CRM, stop sequence |
| Irrelevant / spam | Null | Log and discard. No response |

**Router rules:**
- The router should be fast and cheap. Use a smaller/faster model for classification. The router doesn't generate content, it classifies
- Define every category explicitly with examples. Ambiguous categories cause misrouting. "Positive-ish" is not a category
- Include a fallback/escalation path. If the router can't classify with confidence, route to a human, not a random agent
- Log every routing decision with the confidence score. Low-confidence routes are where quality breaks down. Review them weekly
- Never let the router also be a specialist. The router classifies. It does not act. Mixing classification and action in one agent creates bias

### Pattern 3: Fan-Out / Fan-In (Parallel)

Multiple agents work on the same input simultaneously. A merge agent combines their outputs.

```
         ┌→ [Agent A] ─┐
Input ───┼→ [Agent B] ──┼→ [Merge Agent] → Output
         └→ [Agent C] ─┘
```

**When to use:** Multiple independent analyses of the same input. Account research from different sources. Multi-angle email generation for A/B testing. Scoring an account on multiple dimensions simultaneously.

**Example: Parallel account research**

| Agent | Input | Researches | Output |
|-------|-------|-----------|--------|
| Funding Agent | Company name | Crunchbase, press, SEC filings | Funding history, investors, runway estimate |
| Hiring Agent | Company name | LinkedIn, job boards | Open roles, hiring velocity, org structure signals |
| Tech Stack Agent | Company name + domain | Job postings, BuiltWith, GitHub | Current stack, recent changes, integration points |
| Signal Agent | Company name | News, LinkedIn posts, G2 reviews | Active signals, recency, strength |
| Merge Agent | All four outputs | Combines, deduplicates, synthesizes | Unified account brief |

**Fan-out rules:**
- Each parallel agent must be independent. If Agent B needs Agent A's output to do its job, it's not parallel. It's a pipeline
- The merge agent is not optional. Raw concatenation of parallel outputs is not orchestration. The merge agent resolves conflicts, removes duplicates, and produces a unified document
- Set timeouts per agent. If one parallel agent hangs, the others shouldn't wait indefinitely. Return partial results after the timeout
- Fan-out works best when each agent accesses different data sources. If all agents are reading the same data and interpreting it differently, the outputs will conflict more than complement

### Pattern 4: Loop (Iterative)

An agent produces output. A critic agent evaluates it. If it fails, the output cycles back to the producer with feedback.

```
[Producer Agent] → [Critic Agent] → Pass? → Output
                        ↓ Fail
                  [Feedback → Producer Agent]
```

**When to use:** Output quality is critical and objectively measurable. Email writing with strict rules. Data extraction with validation. Any task where "good enough on first try" isn't reliable enough.

**Example: Email writing with QA loop**

| Cycle | Producer | Critic checks | Pass/fail |
|-------|----------|--------------|-----------|
| 1 | Writes 3-email sequence | Word count, banned phrases, signal presence, subject line rules, em-dash check | Fail: Email 1 is 95 words, uses "leveraging" |
| 2 | Rewrites with critic feedback | Same checks | Fail: Email 3 has a pitch in it |
| 3 | Rewrites Email 3 only | Same checks | Pass: all rules met |

**Loop rules:**
- Cap iterations. 3 attempts max. If the producer can't pass the critic in 3 tries, the prompt or the rules need fixing, not a 4th attempt
- The critic must use objective, binary checks. "Is this email good?" is not a valid critic check. "Is the word count under 80?" is
- Feed specific failure reasons back to the producer. "Failed" is not feedback. "Email 1 is 15 words over limit. Remove the second sentence in paragraph 2" is feedback
- Log every iteration. Patterns in critic feedback reveal systematic prompt issues. If the producer consistently fails on the same rule, fix the producer's prompt

---

## Orchestration Architecture

### Context management

The most common orchestration failure is context pollution: passing too much information between agents, degrading each agent's focus.

**Rules:**
- Each agent receives only the context it needs. The email writer doesn't need raw Crunchbase HTML. It needs the structured account brief
- Define explicit schemas for inter-agent data. Use structured output (JSON) for agent outputs that feed into other agents. Free-form text between agents causes parsing failures
- Strip intermediate reasoning. Agent A's chain-of-thought is noise for Agent B. Pass conclusions, not the reasoning process
- Include metadata with context: source, timestamp, confidence level. "Funding: $45M Series B (Crunchbase, Oct 2025, confirmed)" is better than "They raised $45M"

### State management

Orchestrated systems need to track state across agents and across runs.

| State type | Where to store | Example |
|-----------|---------------|---------|
| Run state (current execution) | In-memory or run database | Which agents have completed, current stage, intermediate outputs |
| Account state (persistent) | CRM or account database | Committee map, signal inventory, prior outreach history |
| Agent config (static) | Config files or environment | System prompts, model selection, temperature, tool access |
| Quality metrics (aggregate) | Analytics database | Pass rates, iteration counts, error patterns |

**State rules:**
- Never store state in the agent's prompt. Prompts are for instructions, not data. Pass state through structured input
- Every run should have a unique ID that traces through all agents. Debugging a 4-agent pipeline without trace IDs is impossible
- Persist intermediate outputs. If Agent C fails, you should be able to restart from Agent C's input without re-running A and B

### Error handling

| Error type | Response | Example |
|-----------|----------|---------|
| Agent timeout | Return partial results. Flag incomplete fields | Research Agent hangs on one data source. Return what you have, mark the missing source |
| Agent produces invalid output | Retry once with explicit format instructions. If still invalid, escalate to human | Email Writer returns a 4-email sequence instead of 3 |
| Agent produces low-confidence output | Route to human review. Don't pass to next agent | Router classifies a reply as "positive" with 55% confidence |
| Upstream agent failure | Skip dependent agents. Return partial pipeline results | If Research Agent fails, don't run Email Writer. Return the failure with context |
| Rate limit / API error | Retry with exponential backoff. 3 attempts max | Model API returns 429. Wait 30s, retry. Wait 60s, retry. Then fail |

**Error rules:**
- Never silently swallow errors. Every failure should be logged with the agent name, input, and error message
- Design for partial success. A pipeline that produces 3 out of 4 outputs is more useful than one that produces 0 because step 2 failed
- Human escalation is a valid error handling strategy. Not every edge case needs an automated fix. Some need a person

---

## Agent Design Principles for GTM

### Keep agents narrow

| Good agent scope | Bad agent scope |
|-----------------|----------------|
| "Research funding history for one company" | "Research everything about a company" |
| "Write Email 1 of a cold sequence given an account brief" | "Write a full campaign" |
| "Classify an inbound reply into one of 6 categories" | "Handle all inbound replies" |
| "Score ICP fit on 4 firmographic dimensions" | "Decide if we should target this account" |

Narrow agents are easier to test, faster to iterate, and produce more consistent output.

### Match model to task

| Task type | Model recommendation | Why |
|-----------|---------------------|-----|
| Classification / routing | Fast, cheap model (Haiku) | Speed matters. Creativity doesn't |
| Data extraction / structuring | Mid-tier model (Sonnet) | Needs accuracy. Doesn't need creativity |
| Content generation (emails, copy) | Top-tier model (Opus/Sonnet) | Quality matters. Worth the cost and latency |
| QA / critic | Mid-tier model (Sonnet) | Rule-checking, not generation. Fast + accurate |
| Synthesis / merge | Top-tier model (Opus/Sonnet) | Combining multiple inputs requires strong reasoning |

Don't use Opus for classification. Don't use Haiku for email writing. Match the model to the cognitive demand.

### Design for human-in-the-loop

Every production GTM agent system should have human checkpoints.

**Where to insert human review:**

| Stage | Why | Review type |
|-------|-----|-------------|
| After account selection / scoring | Humans catch fit signals models miss | Approve/reject list |
| After email generation | Brand voice, factual accuracy, tone | Edit or approve drafts |
| After reply classification | Misrouted replies damage relationships | Spot-check low-confidence classifications |
| Before CRM updates | Bad data in CRM cascades everywhere | Approve before write |
| Before sending | Final gate on anything customer-facing | Approve batch or individual |

**Human-in-the-loop rules:**
- Default to human review for anything customer-facing until you have 30+ days of quality data showing the agent is reliable
- Automate progressively. Start with human review on 100% of outputs. Drop to 50% after quality stabilizes. Drop to 10% spot-check after 30 days of consistent quality. Never drop to 0%
- Make review fast. Present the agent's output alongside the input and the rules it was supposed to follow. The reviewer should approve or reject in under 30 seconds per item

---

## Testing Orchestrated Systems

### Unit testing (per agent)

Test each agent independently with known inputs and expected outputs.

| Test type | What to verify | Example |
|-----------|---------------|---------|
| Happy path | Agent produces correct output for standard input | Research Agent returns valid account brief for a well-known company |
| Edge case | Agent handles unusual input gracefully | Research Agent handles a company with no funding data |
| Rule compliance | Output follows all specified rules | Email Writer output passes banned-phrase check, word count check |
| Format compliance | Output matches the expected schema | Research Agent returns valid JSON matching the account brief schema |

### Integration testing (across agents)

Test agent-to-agent handoffs with real data flow.

- Run the full pipeline end-to-end with 5 test accounts
- Verify each intermediate output is valid input for the next agent
- Check that the final output meets all quality criteria
- Measure total latency. If the full pipeline takes 10 minutes per account, that may be acceptable for ABM but not for real-time reply handling

### Regression testing

After any prompt change, re-run the full test suite before deploying.

- Prompt changes to one agent can cascade. A change to the Research Agent's output format may break the Email Writer's parser
- Keep a golden set of 10-20 test cases with known-good outputs. Compare new outputs against the golden set
- Track quality metrics over time. A gradual decline in critic pass rates signals prompt drift

---

## Anti-Pattern Check

- One agent doing everything. If the system prompt is 3,000+ words covering research, writing, scoring, and CRM updates, split it. Monolith agents are slow, expensive, and unreliable
- No schema between agents. Passing free-form text between agents causes parsing failures and silent quality degradation. Define structured contracts
- Router and specialist combined. An agent that classifies a reply and also writes the response will be biased toward classifications that are easy to respond to. Separate the decisions
- No human review on customer-facing output. Every email, every LinkedIn message, every reply should be human-reviewed until you have data proving the agent is reliable. "It looked good in testing" is not data
- Retry loops without caps. An email writer that keeps failing the critic and retrying indefinitely will burn tokens and never converge. Cap at 3 iterations
- Using the most expensive model for every agent. Classification doesn't need Opus. Data extraction doesn't need Opus. Reserve top-tier models for tasks that require strong reasoning or creative output
- No trace IDs. When something goes wrong in a 4-agent pipeline, you need to trace the exact input and output at every stage. Without trace IDs, debugging is guesswork
- Designing for full automation on Day 1. Start with human-in-the-loop. Earn automation through demonstrated quality. The cost of a bad automated email to a Tier 1 ABM account is higher than the cost of a human reviewing 50 drafts