Home/ Skills/ multi-agent-handoffs

general multi-agent-handoffs

multi-agent-handoffs

This skill should be used when the user asks to "design agent handoffs", "hand off between agents", "pass context between agents", "build multi-agent pipelines", "design agent-to-agent communication", "orchestrate agent handoffs", "structure agent pipelines", "pass state between AI agents", "design agent chains", or any variation of designing how AI agents hand off work to each other in B2B SaaS GTM workflows.

Download .md

Multi-Agent Handoffs

A multi-agent handoff is the moment one agent finishes its job and passes output to the next agent. Research agent finds signals. Email agent writes the sequence. Scoring agent ranks the prospect. Each agent does one thing well. The handoff is where quality breaks or compounds.

The principle: every handoff is a contract. Agent A promises a specific output schema. Agent B expects that exact schema. If the contract breaks, the downstream agent hallucinates, guesses, or fails silently. Define the contract before building either agent.

The Handoff Contract

What a contract includes

Element	What it defines	Example
Output schema	Exact fields Agent A must produce	`{ company_name: string, signal: string, signal_date: string, problem_hypothesis: string }`
Required fields	Fields that must be non-null	`company_name`, `signal` are required. `signal_date` is optional
Field validation	Rules each field must satisfy	`signal` must be ≤ 50 words. `signal_date` must be within 90 days
Failure mode	What happens when a field is missing or invalid	Missing `signal` = skip prospect (don't pass to email agent)
Confidence score	Agent A's self-assessed confidence in the output	`confidence: 0.0-1.0`. Below 0.7 = flag for human review

Contract rules

Define the schema before building the agent. Write the output contract first. Then build Agent A to produce it and Agent B to consume it. Not the reverse
Required fields are non-negotiable. If Agent A can't populate a required field, the handoff fails. Don't pass partial data and hope Agent B handles it
Optional fields degrade gracefully. Agent B must produce valid output even when optional fields are null. Test with every optional field empty
Validate at the boundary. Run schema validation between every agent pair. A 2-line JSON schema check catches errors that would otherwise cascade through 3 downstream agents

Common GTM Agent Pipelines

Pipeline 1: Research to Email

Research Agent → Email Writer Agent

Input: prospect record (name, title, company, LinkedIn URL)
  ↓
Research Agent produces:
  - company_name (required)
  - employee_count (optional)
  - funding_stage (optional)
  - signal (required): specific trigger event
  - signal_source (required): where the signal came from
  - problem_hypothesis (required): why this signal matters
  - proof_points (optional): relevant case studies or stats
  ↓
Schema validation (required fields present? types correct?)
  ↓
Email Writer Agent consumes research output
  → Produces 3-email sequence using signal + hypothesis

Handoff rules for research-to-email:

The research agent must cite the signal source. "Hiring a RevOps lead" is not enough. "Posted RevOps lead role on LinkedIn, 3 days ago" gives the email agent something verifiable
The problem hypothesis must connect signal to pain. "They're hiring" is an observation. "Hiring RevOps usually means attribution is broken and pipeline reporting is manual" is a hypothesis the email agent can use
If the research agent returns no signal, do not pass to the email agent. No signal = no Email 1. Route to nurture instead

Pipeline 2: Research to Scoring to Email

Research Agent → Scoring Agent → Email Writer Agent

Research Agent produces: full prospect research
  ↓
Scoring Agent consumes research, produces:
  - fit_score: 1-100 (firmographic match to ICP)
  - timing_score: 1-100 (signal recency and relevance)
  - tier: 1, 2, or 3
  - recommended_action: "sequence", "nurture", or "skip"
  ↓
Router (not an agent — just logic):
  tier 1 + "sequence" → Email Writer Agent (high-touch)
  tier 2 + "sequence" → Email Writer Agent (standard)
  tier 3 or "nurture" → Nurture queue
  "skip" → Disqualified list

Handoff rules for scoring-to-routing:

The scoring agent outputs a recommendation, not a decision. The router logic makes the decision. This separation means you can change routing rules without changing the scoring prompt
Tier assignment must include reasoning. tier: 1, reason: "Series B + hiring RevOps + 150 employees = classic ICP". The reasoning is logged for debugging and prompt improvement
Never let the email agent see the score. The email agent doesn't need to know the prospect is Tier 2. It needs the signal, hypothesis, and proof points. Leaking internal scores into email prompts causes the agent to adjust tone based on score, which produces worse emails

Pipeline 3: Email to Reply Classifier to Follow-Up

Email Writer Agent → [emails sent] → Reply Classifier Agent → Follow-Up Router

Prospect replies to email
  ↓
Reply Classifier Agent consumes:
  - original_email: what was sent
  - reply_text: what the prospect wrote
  - thread_context: previous emails in the thread
  ↓
Reply Classifier produces:
  - classification: positive | negative | question | ooo | referral | unsubscribe
  - confidence: 0.0-1.0
  - sub_classification: (e.g., positive → "interested_now" | "interested_later")
  - recommended_action: specific next step
  - extracted_info: any new data from the reply (new contact, timeline, objection)
  ↓
Follow-Up Router:
  positive + high confidence → Alert SDR immediately + draft follow-up
  question → Draft answer + queue for SDR review
  ooo → Parse return date, reschedule sequence
  negative → Log, remove from sequence
  referral → Create new prospect record, draft outreach
  low confidence (any) → Queue for human classification

Handoff Patterns

Pattern 1: Strict pass-through

Agent A's output is Agent B's input with no transformation.

When to use	When to avoid
Agents are tightly coupled and purpose-built for each other	Agents serve multiple upstream/downstream partners
Schema is stable and won't change	Schema is evolving or experimental
Pipeline is linear (A → B, no branching)	Pipeline has conditional routing

Pattern 2: Transform layer

A lightweight function between agents that reshapes output.

Research Agent → Transform → Email Agent

Transform:
  - Selects only the fields the email agent needs
  - Truncates signal to 50 words if over
  - Sets defaults for optional fields
  - Validates required fields, rejects if missing

Use when: Agents were built independently or serve multiple pipelines. The transform adapts one agent's output to another's input without coupling them.

Pattern 3: Accumulator

Each agent adds to a growing context object that passes through the pipeline.

{
  "prospect": { ... },          // original input
  "research": { ... },          // added by Research Agent
  "score": { ... },             // added by Scoring Agent
  "email_sequence": { ... },    // added by Email Agent
  "reply_classification": { ... } // added by Classifier Agent
}

Rules for accumulators:

Each agent reads only the fields it needs from the context. Don't dump the entire accumulator into the prompt
Each agent writes to its own namespace. Research agent writes to research, scoring agent writes to score. No cross-writing
The accumulator is the audit trail. Log the full object at each stage for debugging

Error Handling at Handoffs

What can go wrong

Failure	Cause	Impact if unhandled
Missing required field	Research agent found no signal	Email agent writes generic email or hallucinates a signal
Invalid field value	Employee count is "lots" instead of a number	Scoring agent scores incorrectly or crashes
Schema mismatch	Research agent adds a new field, email agent doesn't expect it	Silent data loss (benign) or parsing error (bad)
Hallucinated data	Research agent fabricates a funding round	Email references a fake funding round. Prospect knows instantly
Timeout	Research agent takes 60 seconds on one prospect	Pipeline stalls. Batch processing hangs
Confidence too low	Scoring agent is 40% confident in the tier	Wrong routing. Tier 3 prospect gets Tier 1 treatment

Error handling rules

Fail fast, fail loud. If a required field is missing, reject the handoff immediately. Don't pass partial data and hope the next agent handles it. Log the failure with the prospect ID and the missing field
Never let Agent B guess. If Agent A didn't provide a signal, Agent B must not invent one. The handoff validator rejects the record before Agent B ever sees it
Set timeouts per agent. Research agent: 30 seconds max. Email agent: 15 seconds max. If an agent exceeds its timeout, log the failure and skip to the next prospect
Route low-confidence outputs to humans. Any agent output below 0.7 confidence goes to a human review queue, not to the next agent. Automated pipelines run on high-confidence outputs only
Dead letter queue. Failed handoffs go to a queue for human review and retry. Don't silently drop prospects because one agent failed

Context Management

What to pass vs. what to withhold

Pass to downstream agent	Withhold from downstream agent
Fields the agent needs to do its job	Internal scores and confidence values
Source citations (where data came from)	Raw research notes or intermediate reasoning
Validated, cleaned data	Unvalidated or conflicting data points
The specific prompt context it needs	The full accumulator (pass a filtered view)

Context rules

Minimal context, maximum relevance. The email agent needs: signal, hypothesis, proof points, prospect name, title, company. It does not need: employee count, funding amount, tech stack, or the 500-word research brief. Pass only what the prompt references
Never pass the upstream agent's prompt. Agent B should not see Agent A's system prompt. This prevents prompt leakage and keeps agents independent
Structured data over prose. Pass signal: "Hiring RevOps lead, posted 3 days ago on LinkedIn" not a paragraph of research. Structured data is easier for the downstream agent to use correctly
Include provenance. Every field should trace back to its source. signal_source: "LinkedIn job posting" lets the email agent (or a human reviewer) verify the claim

Testing Handoffs

What to test

Test type	What it validates	How to run
Schema compliance	Does Agent A's output match the contract?	JSON schema validation on every output
Required field coverage	Are all required fields populated?	Automated check: null/empty detection
Field quality	Are field values useful, not just present?	LLM-as-judge on a sample: "Is this signal specific enough for an email?"
End-to-end	Does the full pipeline produce good final output?	Run 20 prospects through the pipeline. Human-review the emails
Failure injection	Does the pipeline handle missing fields correctly?	Deliberately pass records with missing required fields. Verify they're rejected
Edge cases	Does the pipeline handle unusual inputs?	Companies with no funding data, prospects with no LinkedIn, signals older than 90 days

Testing rules

Test the handoff, not just the agents. Each agent can pass its own tests but fail at the handoff. A research agent that produces great research in a format the email agent can't parse is a broken pipeline
Run end-to-end tests on every prompt change. Changing the research agent's prompt may change its output schema subtly. The email agent may not handle the new format. Always test the full pipeline
Maintain a golden set per handoff. 10 known-good handoff examples (Agent A output + Agent B output) that you re-run after every change

Pre-Build Checklist

Before building a multi-agent pipeline:

[ ] Each agent has a single, clear responsibility (one job per agent)
[ ] Output schema defined for every agent before implementation
[ ] Required vs optional fields documented for every handoff
[ ] Schema validation runs between every agent pair
[ ] Failure mode defined for every required field (what happens if missing?)
[ ] Timeout set per agent
[ ] Low-confidence outputs route to human review, not next agent
[ ] Dead letter queue exists for failed handoffs
[ ] Context passed to each agent is minimal and relevant
[ ] End-to-end golden set exists (10+ examples through full pipeline)
[ ] Handoff logging captures full input/output at each stage

Anti-Pattern Check

Passing the entire context to every agent. The email agent gets 2,000 tokens of research when it needs 200. Excess context increases hallucination risk, slows inference, and costs more. Filter to relevant fields only
No schema validation between agents. Agent A changes its output format slightly. Agent B silently misparses a field. 50 emails go out with wrong company names. Validate at every boundary
Letting Agent B guess when Agent A fails. Research agent returns no signal. Email agent writes "I noticed your company is doing great things." That's a hallucination dressed as a greeting. Reject at the handoff
Building agents that depend on each other's prompts. Changing Agent A's prompt breaks Agent B because B assumed a specific output format that wasn't in the contract. Agents should couple to schemas, not prompts
No dead letter queue. Failed handoffs are silently dropped. 15% of prospects never get emailed because the research agent timed out. Nobody notices for 2 weeks. Log and queue every failure
Scoring agent output visible to email agent. The email agent sees tier: 3, fit_score: 28 and writes a half-hearted email. Scores are for routing decisions, not for downstream agents. Filter them out
Testing agents in isolation only. Each agent passes its unit tests. The pipeline fails because Agent A's "date" field is ISO format and Agent B expects MM/DD/YYYY. Always test end-to-end
One monolithic agent instead of a pipeline. A single agent that researches, scores, writes emails, and classifies replies. When it fails, you can't tell which step broke. When you fix one step, you risk regressing another. One job per agent

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# Multi-Agent Handoffs

## The Handoff Contract

### What a contract includes

| Element | What it defines | Example |
|---------|----------------|---------|
| Output schema | Exact fields Agent A must produce | `{ company_name: string, signal: string, signal_date: string, problem_hypothesis: string }` |
| Required fields | Fields that must be non-null | `company_name`, `signal` are required. `signal_date` is optional |
| Field validation | Rules each field must satisfy | `signal` must be ≤ 50 words. `signal_date` must be within 90 days |
| Failure mode | What happens when a field is missing or invalid | Missing `signal` = skip prospect (don't pass to email agent) |
| Confidence score | Agent A's self-assessed confidence in the output | `confidence: 0.0-1.0`. Below 0.7 = flag for human review |

### Contract rules

- **Define the schema before building the agent.** Write the output contract first. Then build Agent A to produce it and Agent B to consume it. Not the reverse
- **Required fields are non-negotiable.** If Agent A can't populate a required field, the handoff fails. Don't pass partial data and hope Agent B handles it
- **Optional fields degrade gracefully.** Agent B must produce valid output even when optional fields are null. Test with every optional field empty
- **Validate at the boundary.** Run schema validation between every agent pair. A 2-line JSON schema check catches errors that would otherwise cascade through 3 downstream agents

---

## Common GTM Agent Pipelines

### Pipeline 1: Research to Email

```
Research Agent → Email Writer Agent

Input: prospect record (name, title, company, LinkedIn URL)
  ↓
Research Agent produces:
  - company_name (required)
  - employee_count (optional)
  - funding_stage (optional)
  - signal (required): specific trigger event
  - signal_source (required): where the signal came from
  - problem_hypothesis (required): why this signal matters
  - proof_points (optional): relevant case studies or stats
  ↓
Schema validation (required fields present? types correct?)
  ↓
Email Writer Agent consumes research output
  → Produces 3-email sequence using signal + hypothesis
```

**Handoff rules for research-to-email:**
- The research agent must cite the signal source. "Hiring a RevOps lead" is not enough. "Posted RevOps lead role on LinkedIn, 3 days ago" gives the email agent something verifiable
- The problem hypothesis must connect signal to pain. "They're hiring" is an observation. "Hiring RevOps usually means attribution is broken and pipeline reporting is manual" is a hypothesis the email agent can use
- If the research agent returns no signal, do not pass to the email agent. No signal = no Email 1. Route to nurture instead

### Pipeline 2: Research to Scoring to Email

```
Research Agent → Scoring Agent → Email Writer Agent

Research Agent produces: full prospect research
  ↓
Scoring Agent consumes research, produces:
  - fit_score: 1-100 (firmographic match to ICP)
  - timing_score: 1-100 (signal recency and relevance)
  - tier: 1, 2, or 3
  - recommended_action: "sequence", "nurture", or "skip"
  ↓
Router (not an agent — just logic):
  tier 1 + "sequence" → Email Writer Agent (high-touch)
  tier 2 + "sequence" → Email Writer Agent (standard)
  tier 3 or "nurture" → Nurture queue
  "skip" → Disqualified list
```

**Handoff rules for scoring-to-routing:**
- The scoring agent outputs a recommendation, not a decision. The router logic makes the decision. This separation means you can change routing rules without changing the scoring prompt
- Tier assignment must include reasoning. `tier: 1, reason: "Series B + hiring RevOps + 150 employees = classic ICP"`. The reasoning is logged for debugging and prompt improvement
- Never let the email agent see the score. The email agent doesn't need to know the prospect is Tier 2. It needs the signal, hypothesis, and proof points. Leaking internal scores into email prompts causes the agent to adjust tone based on score, which produces worse emails

### Pipeline 3: Email to Reply Classifier to Follow-Up

```
Email Writer Agent → [emails sent] → Reply Classifier Agent → Follow-Up Router

Prospect replies to email
  ↓
Reply Classifier Agent consumes:
  - original_email: what was sent
  - reply_text: what the prospect wrote
  - thread_context: previous emails in the thread
  ↓
Reply Classifier produces:
  - classification: positive | negative | question | ooo | referral | unsubscribe
  - confidence: 0.0-1.0
  - sub_classification: (e.g., positive → "interested_now" | "interested_later")
  - recommended_action: specific next step
  - extracted_info: any new data from the reply (new contact, timeline, objection)
  ↓
Follow-Up Router:
  positive + high confidence → Alert SDR immediately + draft follow-up
  question → Draft answer + queue for SDR review
  ooo → Parse return date, reschedule sequence
  negative → Log, remove from sequence
  referral → Create new prospect record, draft outreach
  low confidence (any) → Queue for human classification
```

---

## Handoff Patterns

### Pattern 1: Strict pass-through

Agent A's output is Agent B's input with no transformation.

| When to use | When to avoid |
|-------------|--------------|
| Agents are tightly coupled and purpose-built for each other | Agents serve multiple upstream/downstream partners |
| Schema is stable and won't change | Schema is evolving or experimental |
| Pipeline is linear (A → B, no branching) | Pipeline has conditional routing |

### Pattern 2: Transform layer

A lightweight function between agents that reshapes output.

```
Research Agent → Transform → Email Agent

Transform:
  - Selects only the fields the email agent needs
  - Truncates signal to 50 words if over
  - Sets defaults for optional fields
  - Validates required fields, rejects if missing
```

**Use when:** Agents were built independently or serve multiple pipelines. The transform adapts one agent's output to another's input without coupling them.

### Pattern 3: Accumulator

Each agent adds to a growing context object that passes through the pipeline.

```
{
  "prospect": { ... },          // original input
  "research": { ... },          // added by Research Agent
  "score": { ... },             // added by Scoring Agent
  "email_sequence": { ... },    // added by Email Agent
  "reply_classification": { ... } // added by Classifier Agent
}
```

**Rules for accumulators:**
- Each agent reads only the fields it needs from the context. Don't dump the entire accumulator into the prompt
- Each agent writes to its own namespace. Research agent writes to `research`, scoring agent writes to `score`. No cross-writing
- The accumulator is the audit trail. Log the full object at each stage for debugging

---

## Error Handling at Handoffs

### What can go wrong

| Failure | Cause | Impact if unhandled |
|---------|-------|-------------------|
| Missing required field | Research agent found no signal | Email agent writes generic email or hallucinates a signal |
| Invalid field value | Employee count is "lots" instead of a number | Scoring agent scores incorrectly or crashes |
| Schema mismatch | Research agent adds a new field, email agent doesn't expect it | Silent data loss (benign) or parsing error (bad) |
| Hallucinated data | Research agent fabricates a funding round | Email references a fake funding round. Prospect knows instantly |
| Timeout | Research agent takes 60 seconds on one prospect | Pipeline stalls. Batch processing hangs |
| Confidence too low | Scoring agent is 40% confident in the tier | Wrong routing. Tier 3 prospect gets Tier 1 treatment |

### Error handling rules

- **Fail fast, fail loud.** If a required field is missing, reject the handoff immediately. Don't pass partial data and hope the next agent handles it. Log the failure with the prospect ID and the missing field
- **Never let Agent B guess.** If Agent A didn't provide a signal, Agent B must not invent one. The handoff validator rejects the record before Agent B ever sees it
- **Set timeouts per agent.** Research agent: 30 seconds max. Email agent: 15 seconds max. If an agent exceeds its timeout, log the failure and skip to the next prospect
- **Route low-confidence outputs to humans.** Any agent output below 0.7 confidence goes to a human review queue, not to the next agent. Automated pipelines run on high-confidence outputs only
- **Dead letter queue.** Failed handoffs go to a queue for human review and retry. Don't silently drop prospects because one agent failed

---

## Context Management

### What to pass vs. what to withhold

| Pass to downstream agent | Withhold from downstream agent |
|-------------------------|-------------------------------|
| Fields the agent needs to do its job | Internal scores and confidence values |
| Source citations (where data came from) | Raw research notes or intermediate reasoning |
| Validated, cleaned data | Unvalidated or conflicting data points |
| The specific prompt context it needs | The full accumulator (pass a filtered view) |

### Context rules

- **Minimal context, maximum relevance.** The email agent needs: signal, hypothesis, proof points, prospect name, title, company. It does not need: employee count, funding amount, tech stack, or the 500-word research brief. Pass only what the prompt references
- **Never pass the upstream agent's prompt.** Agent B should not see Agent A's system prompt. This prevents prompt leakage and keeps agents independent
- **Structured data over prose.** Pass `signal: "Hiring RevOps lead, posted 3 days ago on LinkedIn"` not a paragraph of research. Structured data is easier for the downstream agent to use correctly
- **Include provenance.** Every field should trace back to its source. `signal_source: "LinkedIn job posting"` lets the email agent (or a human reviewer) verify the claim

---

## Testing Handoffs

### What to test

| Test type | What it validates | How to run |
|-----------|------------------|-----------|
| Schema compliance | Does Agent A's output match the contract? | JSON schema validation on every output |
| Required field coverage | Are all required fields populated? | Automated check: null/empty detection |
| Field quality | Are field values useful, not just present? | LLM-as-judge on a sample: "Is this signal specific enough for an email?" |
| End-to-end | Does the full pipeline produce good final output? | Run 20 prospects through the pipeline. Human-review the emails |
| Failure injection | Does the pipeline handle missing fields correctly? | Deliberately pass records with missing required fields. Verify they're rejected |
| Edge cases | Does the pipeline handle unusual inputs? | Companies with no funding data, prospects with no LinkedIn, signals older than 90 days |

### Testing rules

- **Test the handoff, not just the agents.** Each agent can pass its own tests but fail at the handoff. A research agent that produces great research in a format the email agent can't parse is a broken pipeline
- **Run end-to-end tests on every prompt change.** Changing the research agent's prompt may change its output schema subtly. The email agent may not handle the new format. Always test the full pipeline
- **Maintain a golden set per handoff.** 10 known-good handoff examples (Agent A output + Agent B output) that you re-run after every change

---

## Pre-Build Checklist

Before building a multi-agent pipeline:

- [ ] Each agent has a single, clear responsibility (one job per agent)
- [ ] Output schema defined for every agent before implementation
- [ ] Required vs optional fields documented for every handoff
- [ ] Schema validation runs between every agent pair
- [ ] Failure mode defined for every required field (what happens if missing?)
- [ ] Timeout set per agent
- [ ] Low-confidence outputs route to human review, not next agent
- [ ] Dead letter queue exists for failed handoffs
- [ ] Context passed to each agent is minimal and relevant
- [ ] End-to-end golden set exists (10+ examples through full pipeline)
- [ ] Handoff logging captures full input/output at each stage

---

## Anti-Pattern Check

- Passing the entire context to every agent. The email agent gets 2,000 tokens of research when it needs 200. Excess context increases hallucination risk, slows inference, and costs more. Filter to relevant fields only
- No schema validation between agents. Agent A changes its output format slightly. Agent B silently misparses a field. 50 emails go out with wrong company names. Validate at every boundary
- Letting Agent B guess when Agent A fails. Research agent returns no signal. Email agent writes "I noticed your company is doing great things." That's a hallucination dressed as a greeting. Reject at the handoff
- Building agents that depend on each other's prompts. Changing Agent A's prompt breaks Agent B because B assumed a specific output format that wasn't in the contract. Agents should couple to schemas, not prompts
- No dead letter queue. Failed handoffs are silently dropped. 15% of prospects never get emailed because the research agent timed out. Nobody notices for 2 weeks. Log and queue every failure
- Scoring agent output visible to email agent. The email agent sees `tier: 3, fit_score: 28` and writes a half-hearted email. Scores are for routing decisions, not for downstream agents. Filter them out
- Testing agents in isolation only. Each agent passes its unit tests. The pipeline fails because Agent A's "date" field is ISO format and Agent B expects MM/DD/YYYY. Always test end-to-end
- One monolithic agent instead of a pipeline. A single agent that researches, scores, writes emails, and classifies replies. When it fails, you can't tell which step broke. When you fix one step, you risk regressing another. One job per agent