general agent-orchestration

agent-orchestration

This skill should be used when the user asks to "orchestrate multiple agents", "coordinate agents", "design a multi-agent system", "build an agent pipeline", "chain agents together", "hand off between agents", "design agent workflows", "build an agent DAG", "route between agents", or any variation of designing and implementing systems where multiple AI agents work together on GTM tasks for B2B SaaS.
Download .md

Agent Orchestration

Agent orchestration is the design of systems where multiple specialized agents work together to complete a task no single agent could handle well alone. Each agent owns one job. The orchestration layer decides which agent runs, when, and what context it receives.

The principle: one agent, one job. An agent that researches accounts, writes emails, scores leads, and updates the CRM is not an agent. It's a monolith with an LLM inside. It will be mediocre at all four tasks.

When to Use Multi-Agent vs Single Agent

Scenario Use single agent Use multi-agent
Task is one step with clear input/output Yes No
Task requires 2+ distinct skill sets No Yes
Output quality varies by subtask No Yes (specialize each)
Task has branching logic (if X do Y, else Z) Maybe Yes (router + specialists)
Task requires human review at intermediate steps No Yes (pause between agents)
Task operates on different data at different stages No Yes (each agent gets relevant context)

Rule of thumb: If you're writing one system prompt longer than 2,000 words to cover multiple distinct tasks, split into agents.


The 4 Orchestration Patterns

Pattern 1: Pipeline (Sequential)

Agents run in order. Each agent's output becomes the next agent's input.

[Agent A] → [Agent B] → [Agent C] → Output

When to use: The task has clear stages where each stage's output is the next stage's input. Account research → email writing → sequence loading.

Example: Account research → cold email pipeline

Step Agent Input Output
1 Research Agent Company name + ICP criteria Account brief (company snapshot, signals, committee, problem hypothesis)
2 Email Writer Agent Account brief 3-email sequence (subject lines, bodies, per cold-outbound-email-writing rules)
3 Personalization Agent Email drafts + LinkedIn data Emails with per-contact personalization tokens inserted
4 QA Agent Final emails Pass/fail on word count, banned phrases, signal presence, subject line rules

Pipeline rules:

  • Define the contract between agents explicitly. Agent A's output schema must match Agent B's expected input schema. Mismatches cause silent quality degradation
  • Each agent should be independently testable. If Agent B produces bad output, you need to know whether the problem is Agent B's prompt or Agent A's output
  • Include a QA agent at the end of any pipeline that produces customer-facing output. The QA agent checks rules compliance, not creativity
  • Keep pipelines to 4-5 stages max. Each stage introduces latency and potential error propagation. Beyond 5, refactor

Pattern 2: Router (Branching)

A router agent classifies the input and sends it to the right specialist agent.

                ┌→ [Agent A]
[Router] ──────┼→ [Agent B]
                └→ [Agent C]

When to use: Different inputs require fundamentally different handling. Inbound lead classification (demo request vs. pricing question vs. support ticket). Reply classification (positive vs. objection vs. out-of-office).

Example: Inbound reply router

Classification Routed to Action
Positive reply ("interested, let's talk") Meeting Booker Agent Propose 2-3 times, include Calendly link
Objection ("we already have a tool") Objection Handler Agent Acknowledge, reframe, offer a different angle
Question ("how does pricing work?") Info Agent Answer the specific question, re-ask softly
Out-of-office OOO Handler Agent Parse return date, reschedule follow-up
Not interested ("remove me") Opt-Out Agent Confirm removal, update CRM, stop sequence
Irrelevant / spam Null Log and discard. No response

Router rules:

  • The router should be fast and cheap. Use a smaller/faster model for classification. The router doesn't generate content, it classifies
  • Define every category explicitly with examples. Ambiguous categories cause misrouting. "Positive-ish" is not a category
  • Include a fallback/escalation path. If the router can't classify with confidence, route to a human, not a random agent
  • Log every routing decision with the confidence score. Low-confidence routes are where quality breaks down. Review them weekly
  • Never let the router also be a specialist. The router classifies. It does not act. Mixing classification and action in one agent creates bias

Pattern 3: Fan-Out / Fan-In (Parallel)

Multiple agents work on the same input simultaneously. A merge agent combines their outputs.

         ┌→ [Agent A] ─┐
Input ───┼→ [Agent B] ──┼→ [Merge Agent] → Output
         └→ [Agent C] ─┘

When to use: Multiple independent analyses of the same input. Account research from different sources. Multi-angle email generation for A/B testing. Scoring an account on multiple dimensions simultaneously.

Example: Parallel account research

Agent Input Researches Output
Funding Agent Company name Crunchbase, press, SEC filings Funding history, investors, runway estimate
Hiring Agent Company name LinkedIn, job boards Open roles, hiring velocity, org structure signals
Tech Stack Agent Company name + domain Job postings, BuiltWith, GitHub Current stack, recent changes, integration points
Signal Agent Company name News, LinkedIn posts, G2 reviews Active signals, recency, strength
Merge Agent All four outputs Combines, deduplicates, synthesizes Unified account brief

Fan-out rules:

  • Each parallel agent must be independent. If Agent B needs Agent A's output to do its job, it's not parallel. It's a pipeline
  • The merge agent is not optional. Raw concatenation of parallel outputs is not orchestration. The merge agent resolves conflicts, removes duplicates, and produces a unified document
  • Set timeouts per agent. If one parallel agent hangs, the others shouldn't wait indefinitely. Return partial results after the timeout
  • Fan-out works best when each agent accesses different data sources. If all agents are reading the same data and interpreting it differently, the outputs will conflict more than complement

Pattern 4: Loop (Iterative)

An agent produces output. A critic agent evaluates it. If it fails, the output cycles back to the producer with feedback.

[Producer Agent] → [Critic Agent] → Pass? → Output
                        ↓ Fail
                  [Feedback → Producer Agent]

When to use: Output quality is critical and objectively measurable. Email writing with strict rules. Data extraction with validation. Any task where "good enough on first try" isn't reliable enough.

Example: Email writing with QA loop

Cycle Producer Critic checks Pass/fail
1 Writes 3-email sequence Word count, banned phrases, signal presence, subject line rules, em-dash check Fail: Email 1 is 95 words, uses "leveraging"
2 Rewrites with critic feedback Same checks Fail: Email 3 has a pitch in it
3 Rewrites Email 3 only Same checks Pass: all rules met

Loop rules:

  • Cap iterations. 3 attempts max. If the producer can't pass the critic in 3 tries, the prompt or the rules need fixing, not a 4th attempt
  • The critic must use objective, binary checks. "Is this email good?" is not a valid critic check. "Is the word count under 80?" is
  • Feed specific failure reasons back to the producer. "Failed" is not feedback. "Email 1 is 15 words over limit. Remove the second sentence in paragraph 2" is feedback
  • Log every iteration. Patterns in critic feedback reveal systematic prompt issues. If the producer consistently fails on the same rule, fix the producer's prompt

Orchestration Architecture

Context management

The most common orchestration failure is context pollution: passing too much information between agents, degrading each agent's focus.

Rules:

  • Each agent receives only the context it needs. The email writer doesn't need raw Crunchbase HTML. It needs the structured account brief
  • Define explicit schemas for inter-agent data. Use structured output (JSON) for agent outputs that feed into other agents. Free-form text between agents causes parsing failures
  • Strip intermediate reasoning. Agent A's chain-of-thought is noise for Agent B. Pass conclusions, not the reasoning process
  • Include metadata with context: source, timestamp, confidence level. "Funding: $45M Series B (Crunchbase, Oct 2025, confirmed)" is better than "They raised $45M"

State management

Orchestrated systems need to track state across agents and across runs.

State type Where to store Example
Run state (current execution) In-memory or run database Which agents have completed, current stage, intermediate outputs
Account state (persistent) CRM or account database Committee map, signal inventory, prior outreach history
Agent config (static) Config files or environment System prompts, model selection, temperature, tool access
Quality metrics (aggregate) Analytics database Pass rates, iteration counts, error patterns

State rules:

  • Never store state in the agent's prompt. Prompts are for instructions, not data. Pass state through structured input
  • Every run should have a unique ID that traces through all agents. Debugging a 4-agent pipeline without trace IDs is impossible
  • Persist intermediate outputs. If Agent C fails, you should be able to restart from Agent C's input without re-running A and B

Error handling

Error type Response Example
Agent timeout Return partial results. Flag incomplete fields Research Agent hangs on one data source. Return what you have, mark the missing source
Agent produces invalid output Retry once with explicit format instructions. If still invalid, escalate to human Email Writer returns a 4-email sequence instead of 3
Agent produces low-confidence output Route to human review. Don't pass to next agent Router classifies a reply as "positive" with 55% confidence
Upstream agent failure Skip dependent agents. Return partial pipeline results If Research Agent fails, don't run Email Writer. Return the failure with context
Rate limit / API error Retry with exponential backoff. 3 attempts max Model API returns 429. Wait 30s, retry. Wait 60s, retry. Then fail

Error rules:

  • Never silently swallow errors. Every failure should be logged with the agent name, input, and error message
  • Design for partial success. A pipeline that produces 3 out of 4 outputs is more useful than one that produces 0 because step 2 failed
  • Human escalation is a valid error handling strategy. Not every edge case needs an automated fix. Some need a person

Agent Design Principles for GTM

Keep agents narrow

Good agent scope Bad agent scope
"Research funding history for one company" "Research everything about a company"
"Write Email 1 of a cold sequence given an account brief" "Write a full campaign"
"Classify an inbound reply into one of 6 categories" "Handle all inbound replies"
"Score ICP fit on 4 firmographic dimensions" "Decide if we should target this account"

Narrow agents are easier to test, faster to iterate, and produce more consistent output.

Match model to task

Task type Model recommendation Why
Classification / routing Fast, cheap model (Haiku) Speed matters. Creativity doesn't
Data extraction / structuring Mid-tier model (Sonnet) Needs accuracy. Doesn't need creativity
Content generation (emails, copy) Top-tier model (Opus/Sonnet) Quality matters. Worth the cost and latency
QA / critic Mid-tier model (Sonnet) Rule-checking, not generation. Fast + accurate
Synthesis / merge Top-tier model (Opus/Sonnet) Combining multiple inputs requires strong reasoning

Don't use Opus for classification. Don't use Haiku for email writing. Match the model to the cognitive demand.

Design for human-in-the-loop

Every production GTM agent system should have human checkpoints.

Where to insert human review:

Stage Why Review type
After account selection / scoring Humans catch fit signals models miss Approve/reject list
After email generation Brand voice, factual accuracy, tone Edit or approve drafts
After reply classification Misrouted replies damage relationships Spot-check low-confidence classifications
Before CRM updates Bad data in CRM cascades everywhere Approve before write
Before sending Final gate on anything customer-facing Approve batch or individual

Human-in-the-loop rules:

  • Default to human review for anything customer-facing until you have 30+ days of quality data showing the agent is reliable
  • Automate progressively. Start with human review on 100% of outputs. Drop to 50% after quality stabilizes. Drop to 10% spot-check after 30 days of consistent quality. Never drop to 0%
  • Make review fast. Present the agent's output alongside the input and the rules it was supposed to follow. The reviewer should approve or reject in under 30 seconds per item

Testing Orchestrated Systems

Unit testing (per agent)

Test each agent independently with known inputs and expected outputs.

Test type What to verify Example
Happy path Agent produces correct output for standard input Research Agent returns valid account brief for a well-known company
Edge case Agent handles unusual input gracefully Research Agent handles a company with no funding data
Rule compliance Output follows all specified rules Email Writer output passes banned-phrase check, word count check
Format compliance Output matches the expected schema Research Agent returns valid JSON matching the account brief schema

Integration testing (across agents)

Test agent-to-agent handoffs with real data flow.

  • Run the full pipeline end-to-end with 5 test accounts
  • Verify each intermediate output is valid input for the next agent
  • Check that the final output meets all quality criteria
  • Measure total latency. If the full pipeline takes 10 minutes per account, that may be acceptable for ABM but not for real-time reply handling

Regression testing

After any prompt change, re-run the full test suite before deploying.

  • Prompt changes to one agent can cascade. A change to the Research Agent's output format may break the Email Writer's parser
  • Keep a golden set of 10-20 test cases with known-good outputs. Compare new outputs against the golden set
  • Track quality metrics over time. A gradual decline in critic pass rates signals prompt drift

Anti-Pattern Check

  • One agent doing everything. If the system prompt is 3,000+ words covering research, writing, scoring, and CRM updates, split it. Monolith agents are slow, expensive, and unreliable
  • No schema between agents. Passing free-form text between agents causes parsing failures and silent quality degradation. Define structured contracts
  • Router and specialist combined. An agent that classifies a reply and also writes the response will be biased toward classifications that are easy to respond to. Separate the decisions
  • No human review on customer-facing output. Every email, every LinkedIn message, every reply should be human-reviewed until you have data proving the agent is reliable. "It looked good in testing" is not data
  • Retry loops without caps. An email writer that keeps failing the critic and retrying indefinitely will burn tokens and never converge. Cap at 3 iterations
  • Using the most expensive model for every agent. Classification doesn't need Opus. Data extraction doesn't need Opus. Reserve top-tier models for tasks that require strong reasoning or creative output
  • No trace IDs. When something goes wrong in a 4-agent pipeline, you need to trace the exact input and output at every stage. Without trace IDs, debugging is guesswork
  • Designing for full automation on Day 1. Start with human-in-the-loop. Earn automation through demonstrated quality. The cost of a bad automated email to a Tier 1 ABM account is higher than the cost of a human reviewing 50 drafts
Want agents that use skill files like this?
We customize skill files for your brand voice and methodology, then run content agents against them.
Book a call