---
name: agent-orchestration
slug: agent-orchestration
description: This skill should be used when the user asks to "orchestrate multiple agents", "coordinate agents", "design a multi-agent system", "build an agent pipeline", "chain agents together", "hand off between agents", "design agent workflows", "build an agent DAG", "route between agents", or any variation of designing and implementing systems where multiple AI agents work together on GTM tasks for B2B SaaS.
category: general
---

# Agent Orchestration

Agent orchestration is the design of systems where multiple specialized agents work together to complete a task no single agent could handle well alone. Each agent owns one job. The orchestration layer decides which agent runs, when, and what context it receives.

The principle: one agent, one job. An agent that researches accounts, writes emails, scores leads, and updates the CRM is not an agent. It's a monolith with an LLM inside. It will be mediocre at all four tasks.

## When to Use Multi-Agent vs Single Agent

| Scenario | Use single agent | Use multi-agent |
|----------|-----------------|----------------|
| Task is one step with clear input/output | Yes | No |
| Task requires 2+ distinct skill sets | No | Yes |
| Output quality varies by subtask | No | Yes (specialize each) |
| Task has branching logic (if X do Y, else Z) | Maybe | Yes (router + specialists) |
| Task requires human review at intermediate steps | No | Yes (pause between agents) |
| Task operates on different data at different stages | No | Yes (each agent gets relevant context) |

**Rule of thumb:** If you're writing one system prompt longer than 2,000 words to cover multiple distinct tasks, split into agents.

---

## The 4 Orchestration Patterns

### Pattern 1: Pipeline (Sequential)

Agents run in order. Each agent's output becomes the next agent's input.

```
[Agent A] → [Agent B] → [Agent C] → Output
```

**When to use:** The task has clear stages where each stage's output is the next stage's input. Account research → email writing → sequence loading.

**Example: Account research → cold email pipeline**

| Step | Agent | Input | Output |
|------|-------|-------|--------|
| 1 | Research Agent | Company name + ICP criteria | Account brief (company snapshot, signals, committee, problem hypothesis) |
| 2 | Email Writer Agent | Account brief | 3-email sequence (subject lines, bodies, per cold-outbound-email-writing rules) |
| 3 | Personalization Agent | Email drafts + LinkedIn data | Emails with per-contact personalization tokens inserted |
| 4 | QA Agent | Final emails | Pass/fail on word count, banned phrases, signal presence, subject line rules |

**Pipeline rules:**
- Define the contract between agents explicitly. Agent A's output schema must match Agent B's expected input schema. Mismatches cause silent quality degradation
- Each agent should be independently testable. If Agent B produces bad output, you need to know whether the problem is Agent B's prompt or Agent A's output
- Include a QA agent at the end of any pipeline that produces customer-facing output. The QA agent checks rules compliance, not creativity
- Keep pipelines to 4-5 stages max. Each stage introduces latency and potential error propagation. Beyond 5, refactor

### Pattern 2: Router (Branching)

A router agent classifies the input and sends it to the right specialist agent.

```
                ┌→ [Agent A]
[Router] ──────┼→ [Agent B]
                └→ [Agent C]
```

**When to use:** Different inputs require fundamentally different handling. Inbound lead classification (demo request vs. pricing question vs. support ticket). Reply classification (positive vs. objection vs. out-of-office).

**Example: Inbound reply router**

| Classification | Routed to | Action |
|---------------|-----------|--------|
| Positive reply ("interested, let's talk") | Meeting Booker Agent | Propose 2-3 times, include Calendly link |
| Objection ("we already have a tool") | Objection Handler Agent | Acknowledge, reframe, offer a different angle |
| Question ("how does pricing work?") | Info Agent | Answer the specific question, re-ask softly |
| Out-of-office | OOO Handler Agent | Parse return date, reschedule follow-up |
| Not interested ("remove me") | Opt-Out Agent | Confirm removal, update CRM, stop sequence |
| Irrelevant / spam | Null | Log and discard. No response |

**Router rules:**
- The router should be fast and cheap. Use a smaller/faster model for classification. The router doesn't generate content, it classifies
- Define every category explicitly with examples. Ambiguous categories cause misrouting. "Positive-ish" is not a category
- Include a fallback/escalation path. If the router can't classify with confidence, route to a human, not a random agent
- Log every routing decision with the confidence score. Low-confidence routes are where quality breaks down. Review them weekly
- Never let the router also be a specialist. The router classifies. It does not act. Mixing classification and action in one agent creates bias

### Pattern 3: Fan-Out / Fan-In (Parallel)

Multiple agents work on the same input simultaneously. A merge agent combines their outputs.

```
         ┌→ [Agent A] ─┐
Input ───┼→ [Agent B] ──┼→ [Merge Agent] → Output
         └→ [Agent C] ─┘
```

**When to use:** Multiple independent analyses of the same input. Account research from different sources. Multi-angle email generation for A/B testing. Scoring an account on multiple dimensions simultaneously.

**Example: Parallel account research**

| Agent | Input | Researches | Output |
|-------|-------|-----------|--------|
| Funding Agent | Company name | Crunchbase, press, SEC filings | Funding history, investors, runway estimate |
| Hiring Agent | Company name | LinkedIn, job boards | Open roles, hiring velocity, org structure signals |
| Tech Stack Agent | Company name + domain | Job postings, BuiltWith, GitHub | Current stack, recent changes, integration points |
| Signal Agent | Company name | News, LinkedIn posts, G2 reviews | Active signals, recency, strength |
| Merge Agent | All four outputs | Combines, deduplicates, synthesizes | Unified account brief |

**Fan-out rules:**
- Each parallel agent must be independent. If Agent B needs Agent A's output to do its job, it's not parallel. It's a pipeline
- The merge agent is not optional. Raw concatenation of parallel outputs is not orchestration. The merge agent resolves conflicts, removes duplicates, and produces a unified document
- Set timeouts per agent. If one parallel agent hangs, the others shouldn't wait indefinitely. Return partial results after the timeout
- Fan-out works best when each agent accesses different data sources. If all agents are reading the same data and interpreting it differently, the outputs will conflict more than complement

### Pattern 4: Loop (Iterative)

An agent produces output. A critic agent evaluates it. If it fails, the output cycles back to the producer with feedback.

```
[Producer Agent] → [Critic Agent] → Pass? → Output
                        ↓ Fail
                  [Feedback → Producer Agent]
```

**When to use:** Output quality is critical and objectively measurable. Email writing with strict rules. Data extraction with validation. Any task where "good enough on first try" isn't reliable enough.

**Example: Email writing with QA loop**

| Cycle | Producer | Critic checks | Pass/fail |
|-------|----------|--------------|-----------|
| 1 | Writes 3-email sequence | Word count, banned phrases, signal presence, subject line rules, em-dash check | Fail: Email 1 is 95 words, uses "leveraging" |
| 2 | Rewrites with critic feedback | Same checks | Fail: Email 3 has a pitch in it |
| 3 | Rewrites Email 3 only | Same checks | Pass: all rules met |

**Loop rules:**
- Cap iterations. 3 attempts max. If the producer can't pass the critic in 3 tries, the prompt or the rules need fixing, not a 4th attempt
- The critic must use objective, binary checks. "Is this email good?" is not a valid critic check. "Is the word count under 80?" is
- Feed specific failure reasons back to the producer. "Failed" is not feedback. "Email 1 is 15 words over limit. Remove the second sentence in paragraph 2" is feedback
- Log every iteration. Patterns in critic feedback reveal systematic prompt issues. If the producer consistently fails on the same rule, fix the producer's prompt

---

## Orchestration Architecture

### Context management

The most common orchestration failure is context pollution: passing too much information between agents, degrading each agent's focus.

**Rules:**
- Each agent receives only the context it needs. The email writer doesn't need raw Crunchbase HTML. It needs the structured account brief
- Define explicit schemas for inter-agent data. Use structured output (JSON) for agent outputs that feed into other agents. Free-form text between agents causes parsing failures
- Strip intermediate reasoning. Agent A's chain-of-thought is noise for Agent B. Pass conclusions, not the reasoning process
- Include metadata with context: source, timestamp, confidence level. "Funding: $45M Series B (Crunchbase, Oct 2025, confirmed)" is better than "They raised $45M"

### State management

Orchestrated systems need to track state across agents and across runs.

| State type | Where to store | Example |
|-----------|---------------|---------|
| Run state (current execution) | In-memory or run database | Which agents have completed, current stage, intermediate outputs |
| Account state (persistent) | CRM or account database | Committee map, signal inventory, prior outreach history |
| Agent config (static) | Config files or environment | System prompts, model selection, temperature, tool access |
| Quality metrics (aggregate) | Analytics database | Pass rates, iteration counts, error patterns |

**State rules:**
- Never store state in the agent's prompt. Prompts are for instructions, not data. Pass state through structured input
- Every run should have a unique ID that traces through all agents. Debugging a 4-agent pipeline without trace IDs is impossible
- Persist intermediate outputs. If Agent C fails, you should be able to restart from Agent C's input without re-running A and B

### Error handling

| Error type | Response | Example |
|-----------|----------|---------|
| Agent timeout | Return partial results. Flag incomplete fields | Research Agent hangs on one data source. Return what you have, mark the missing source |
| Agent produces invalid output | Retry once with explicit format instructions. If still invalid, escalate to human | Email Writer returns a 4-email sequence instead of 3 |
| Agent produces low-confidence output | Route to human review. Don't pass to next agent | Router classifies a reply as "positive" with 55% confidence |
| Upstream agent failure | Skip dependent agents. Return partial pipeline results | If Research Agent fails, don't run Email Writer. Return the failure with context |
| Rate limit / API error | Retry with exponential backoff. 3 attempts max | Model API returns 429. Wait 30s, retry. Wait 60s, retry. Then fail |

**Error rules:**
- Never silently swallow errors. Every failure should be logged with the agent name, input, and error message
- Design for partial success. A pipeline that produces 3 out of 4 outputs is more useful than one that produces 0 because step 2 failed
- Human escalation is a valid error handling strategy. Not every edge case needs an automated fix. Some need a person

---

## Agent Design Principles for GTM

### Keep agents narrow

| Good agent scope | Bad agent scope |
|-----------------|----------------|
| "Research funding history for one company" | "Research everything about a company" |
| "Write Email 1 of a cold sequence given an account brief" | "Write a full campaign" |
| "Classify an inbound reply into one of 6 categories" | "Handle all inbound replies" |
| "Score ICP fit on 4 firmographic dimensions" | "Decide if we should target this account" |

Narrow agents are easier to test, faster to iterate, and produce more consistent output.

### Match model to task

| Task type | Model recommendation | Why |
|-----------|---------------------|-----|
| Classification / routing | Fast, cheap model (Haiku) | Speed matters. Creativity doesn't |
| Data extraction / structuring | Mid-tier model (Sonnet) | Needs accuracy. Doesn't need creativity |
| Content generation (emails, copy) | Top-tier model (Opus/Sonnet) | Quality matters. Worth the cost and latency |
| QA / critic | Mid-tier model (Sonnet) | Rule-checking, not generation. Fast + accurate |
| Synthesis / merge | Top-tier model (Opus/Sonnet) | Combining multiple inputs requires strong reasoning |

Don't use Opus for classification. Don't use Haiku for email writing. Match the model to the cognitive demand.

### Design for human-in-the-loop

Every production GTM agent system should have human checkpoints.

**Where to insert human review:**

| Stage | Why | Review type |
|-------|-----|-------------|
| After account selection / scoring | Humans catch fit signals models miss | Approve/reject list |
| After email generation | Brand voice, factual accuracy, tone | Edit or approve drafts |
| After reply classification | Misrouted replies damage relationships | Spot-check low-confidence classifications |
| Before CRM updates | Bad data in CRM cascades everywhere | Approve before write |
| Before sending | Final gate on anything customer-facing | Approve batch or individual |

**Human-in-the-loop rules:**
- Default to human review for anything customer-facing until you have 30+ days of quality data showing the agent is reliable
- Automate progressively. Start with human review on 100% of outputs. Drop to 50% after quality stabilizes. Drop to 10% spot-check after 30 days of consistent quality. Never drop to 0%
- Make review fast. Present the agent's output alongside the input and the rules it was supposed to follow. The reviewer should approve or reject in under 30 seconds per item

---

## Testing Orchestrated Systems

### Unit testing (per agent)

Test each agent independently with known inputs and expected outputs.

| Test type | What to verify | Example |
|-----------|---------------|---------|
| Happy path | Agent produces correct output for standard input | Research Agent returns valid account brief for a well-known company |
| Edge case | Agent handles unusual input gracefully | Research Agent handles a company with no funding data |
| Rule compliance | Output follows all specified rules | Email Writer output passes banned-phrase check, word count check |
| Format compliance | Output matches the expected schema | Research Agent returns valid JSON matching the account brief schema |

### Integration testing (across agents)

Test agent-to-agent handoffs with real data flow.

- Run the full pipeline end-to-end with 5 test accounts
- Verify each intermediate output is valid input for the next agent
- Check that the final output meets all quality criteria
- Measure total latency. If the full pipeline takes 10 minutes per account, that may be acceptable for ABM but not for real-time reply handling

### Regression testing

After any prompt change, re-run the full test suite before deploying.

- Prompt changes to one agent can cascade. A change to the Research Agent's output format may break the Email Writer's parser
- Keep a golden set of 10-20 test cases with known-good outputs. Compare new outputs against the golden set
- Track quality metrics over time. A gradual decline in critic pass rates signals prompt drift

---

## Anti-Pattern Check

- One agent doing everything. If the system prompt is 3,000+ words covering research, writing, scoring, and CRM updates, split it. Monolith agents are slow, expensive, and unreliable
- No schema between agents. Passing free-form text between agents causes parsing failures and silent quality degradation. Define structured contracts
- Router and specialist combined. An agent that classifies a reply and also writes the response will be biased toward classifications that are easy to respond to. Separate the decisions
- No human review on customer-facing output. Every email, every LinkedIn message, every reply should be human-reviewed until you have data proving the agent is reliable. "It looked good in testing" is not data
- Retry loops without caps. An email writer that keeps failing the critic and retrying indefinitely will burn tokens and never converge. Cap at 3 iterations
- Using the most expensive model for every agent. Classification doesn't need Opus. Data extraction doesn't need Opus. Reserve top-tier models for tasks that require strong reasoning or creative output
- No trace IDs. When something goes wrong in a 4-agent pipeline, you need to trace the exact input and output at every stage. Without trace IDs, debugging is guesswork
- Designing for full automation on Day 1. Start with human-in-the-loop. Earn automation through demonstrated quality. The cost of a bad automated email to a Tier 1 ABM account is higher than the cost of a human reviewing 50 drafts