comparison 9 min read May 04, 2026

Single Agent vs Multi-Agent: A 6-Question Decision Framework

Q: Is a multi-agent system always better than a single agent?

No. Anthropic's research system blog reports multi-agent setups burn roughly 15x more tokens than a chat baseline and only pay off when the task value clearly exceeds that cost. For sequential, interdependent work, a single agent is cheaper, faster, and easier to debug.

Q: Are subagents the same as multi-agent systems?

Functionally similar, semantically different. Subagents are short-lived workers spawned by a parent agent for a bounded task. A multi-agent system implies persistent agents with independent state and roles. Subagents inherit most multi-agent costs but with cleaner lifecycle control.

By Peter Foy

Single agent vs multi-agent: a 6-question decision framework with Anthropic and Cognition data. Most teams should ship a single agent first.

TL;DR

Ship a single agent first. Multi-agent systems use ~15x more tokens than chat (vs ~4x for single agents) per Anthropic, and 40%+ of multi-agent pilots fail within six months. Move to multi-agent only when you can answer yes to at least 4 of 6 questions: parallelism need, context pressure, role specialization, reliability budget, observability maturity, and ops cost.

Anthropic's multi-agent research system uses ~15x more tokens than chat but outperforms single-agent Opus 4 by 90.2% on research tasks.
Cognition's 'Don't Build Multi-Agents' argues writes should stay single-threaded; their follow-up confirms only narrow patterns survive production.
Multi-agent failure rates run 41-86.7% in production, mostly from specification ambiguity and coordination breakdowns.
Use the 6-question framework: 4+ yeses justify multi-agent. Otherwise scale the single agent with better tools and context engineering.
Default architecture in 2026: single-threaded linear agent with subagents only for parallel reads, never parallel writes.

Choose a single agent by default. Multi-agent systems consume roughly 15x more tokens than a chat baseline (Anthropic, 2025), fail in production at rates between 41% and 86.7% (Maxim AI, 2026), and add coordination bugs that are hard to debug. They are still the right choice for parallelizable, read-heavy work like open-ended research. This article gives you a 6-question decision framework. Answer yes to 4 or more and multi-agent is justified. Otherwise, scale the single agent.

What is the difference between a single agent and a multi-agent system?

A single agent is one LLM loop that plans, calls tools, and writes results inside one context window. A multi-agent system is a set of LLM loops with independent state, often coordinated by an orchestrator that delegates subtasks to specialized workers and synthesizes their outputs.

The distinction is not just architectural. It changes the failure mode, the bill, and the on-call rotation.

Single agent: one trace to debug, one window to manage, sequential tool calls.
Multi-agent: N traces, N windows, parallel tool calls, plus a coordination layer that becomes its own failure surface.

Anthropic's Building Effective Agents draws an even sharper line: most production systems are workflows (predefined code paths orchestrating LLM calls), not agents (LLMs dynamically directing their own tools). True multi-agent systems sit at the most autonomous and most expensive end of that spectrum.

When should I use a single agent vs multi-agent?

Use a single agent unless the task is genuinely parallelizable, context-bound, and high-value. That is the short answer. The longer answer is the 6-question framework below. Score one point per yes. Four or more justifies multi-agent. Three or fewer means scale the single agent.

Parallelism need. Can the task be split into independent strands that have no data dependency on each other?
Context window pressure. Is one 200K window genuinely insufficient even with aggressive summarization and tool-result compression?
Role specialization. Do subtasks need conflicting system prompts, different toolsets, or different models (e.g., Opus planner + Haiku workers)?
Reliability budget. Can you absorb a measured drop in end-to-end success rate in exchange for breadth, and do you have evals to detect it?
Observability maturity. Do you already have per-trace logging, tool-call replay, and an eval harness running on every change?
Ops cost. Is the dollar value of the output high enough that a 3-4x token bill is a rounding error?

The framework is intentionally biased toward single-agent. Most teams overestimate parallelism (#1) and underestimate observability cost (#5).

What does Anthropic recommend for multi-agent design?

Anthropic recommends an orchestrator-worker pattern with a lead agent that decomposes the task and subagents that explore in parallel, each with their own context window. In their internal evaluation, this design with Claude Opus 4 as lead and Claude Sonnet 4 subagents outperformed single-agent Opus 4 by 90.2% on research tasks (Anthropic, 2025).

The headline result hides three conditions:

The task was open-ended research with breadth-first parallelism built in.
Token usage alone explained 80% of performance variance -- they were buying performance with compute.
They explicitly wrote that multi-agent is for tasks where 'the value of the task is high enough to pay for the increased performance.'

Anthropic's earlier essay Building Effective Agents is more conservative still: 'Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.' Both pieces point in the same direction. Earn the complexity.

What is Cognition's case against multi-agent systems?

Cognition (the team behind Devin) published Don't Build Multi-Agents in June 2025 with two principles:

'Share context, and share full agent traces, not just individual messages.'
'Actions carry implicit decisions, and conflicting decisions carry bad results.'

The argument: parallel subagents make implicit choices about style, edge cases, and code patterns that conflict with each other. Their canonical example is a subagent that builds a Super Mario Bros background while another builds Flappy Bird pipes -- both technically correct, mutually unusable.

Cognition's prescription is a single-threaded linear agent with continuous context. For long horizons, they introduce a dedicated compression model that distills history into key details rather than handing parts to peer agents.

Ten months later, Cognition published Multi-Agents: What's Actually Working. The update narrows the safe zone: multiple agents can contribute reads, but writes must stay single-threaded. That is the architectural rule of thumb for 2026.

What is the operational cost of a multi-agent system?

Multi-agent systems cost roughly 15x a chat baseline in tokens, vs ~4x for single agents (Anthropic, 2025). That is before you count orchestration overhead, eval runs, and per-agent observability.

A realistic 2026 cost stack for a production multi-agent system:

Cost line	Single agent	Multi-agent
Token spend per task	1x	~3-4x
Latency	1-3s per tool call	10-40s end-to-end
Tracing infra	One pipeline	Per-agent + orchestrator
Eval surface	One harness	Per-role harness + integration evals
On-call complexity	Linear	Multiplicative

Latency numbers from Maxim AI's reliability writeup (2026), which also reports pilot accuracy of 95-98% dropping to 80-87% in production.

The economic test is simple: task value > token + ops cost premium. Research at $0.50 per query is a bad fit. Research at $50 per query (legal discovery, M&A diligence, security audits) is a great fit.

Multi-Agent Token Cost vs Single Agent (Anthropic Research)

Chat baseline

Single agent

Multi-agent system

15x

Source: Anthropic Engineering, 'How we built our multi-agent research system' (2025)

How do you know your single agent has hit its limit?

Three signals tell you a single agent has hit its ceiling:

Context overflow despite compression. You have implemented summarization, retrieval, and tool-result truncation, and you still routinely hit the 200K window on real tasks.
Sequential latency on independent work. Your trace shows the agent making 8 tool calls in series where 6 of them have no data dependency. Parallelism would cut wall time materially.
Tool or prompt conflict. You are role-switching system prompts mid-conversation, or the same agent needs incompatible tool definitions for different subtasks.

If two of these are true, run a bounded multi-agent pilot with strict scope: one orchestrator, two to three subagents, full tracing, and an eval harness that measures the same task on the existing single agent for comparison.

If only one is true, the fix is almost always better tools, better context engineering, or a stronger model, not more agents. Multi-agent failure rates run 41-86.7% in production per Maxim AI (2026), and Gartner projects 40%+ of agentic AI projects will be canceled by 2027 -- mostly because teams jumped to coordination before solving capability.

Multi-Agent Production Reliability Drop

Pilot accuracy

96%

Production accuracy

84%

Pilots failing within 6 months

40%

Source: Maxim AI, 'Multi-Agent System Reliability' (2026)

What does the single agent vs multi-agent comparison look like side by side?

A direct comparison across the dimensions that matter for production:

Dimension	Single Agent	Multi-Agent System
Token cost vs chat	~4x	~15x
Best task shape	Sequential, interdependent, write-heavy	Parallelizable, read-heavy, breadth-first
Latency profile	1-3s per tool call	10-40s end-to-end
Context strategy	One window + compression	Per-subagent windows + summarization
Dominant failure mode	Context overflow, drift	Coordination conflicts, divergent decisions
Observability	One trace	Trace per agent + orchestrator + handoffs
Strongest fit	Coding, customer ops, anything with shared state	Open-ended research, parallel search
Public exemplar	Devin (Cognition's linear agent)	Anthropic Research (lead + subagents, 90.2% lift)

The pattern is consistent across both cited teams. Cognition kept Devin single-threaded for code because writes need consistency. Anthropic went multi-agent for research because reads benefit from breadth. Match your task shape to the architecture, not the other way around.

How should you scale a single agent before going multi-agent?

Before splitting into multiple agents, exhaust these single-agent levers in order:

Tool quality. Most 'agent reasoning' problems are tool problems. Better tool descriptions, fewer tools, idempotent operations.
Context engineering. Aggressive summarization of tool outputs, structured memory, retrieval over raw stuffing.
Prompt caching. Cache reads cost 10% of standard input tokens on Claude. Stable system prompts and tool definitions become near-free.
Model tiering inside one loop. Use Haiku for cheap classification calls, Sonnet for the main loop, Opus only on high-stakes turns.
Subagents (not multi-agent). Spawn short-lived workers with bounded scope and explicit return contracts. The Claude Agent SDK supports this without committing to full multi-agent infra.
Compression agent. Cognition's pattern: a dedicated LLM whose only job is compressing history into key details when context pressure spikes.

Most teams skip steps 1-4 and jump to multi-agent because it sounds architecturally serious. The result is the 40% pilot failure rate. Scale the single agent. Earn the complexity.

Dimension	Single Agent	Multi-Agent System
Token cost (vs chat baseline)	~4x	~15x (Anthropic, 2025)
Best task shape	Sequential, interdependent, single-threaded writes	Parallelizable, read-heavy research, breadth-first
Latency	1-3s typical tool calls	10-40s end-to-end (Maxim, 2026)
Context strategy	One window + compression	Per-subagent windows + summarization
Failure mode	Context overflow, drift	Coordination conflicts, divergent decisions
Observability burden	One trace	Trace per agent + orchestrator + handoffs
When it wins	Coding, edits, customer ops, anything with shared state	Open-ended research, parallel search, breadth scans
Public example	Devin (Cognition's single-threaded linear agent)	Anthropic Research (lead + subagents, 90.2% lift)

Frequently asked questions

Is a multi-agent system always better than a single agent?

No. Anthropic's own research system blog reports multi-agent setups burn roughly 15x more tokens than a chat baseline and only pay off when the task value clearly exceeds that cost. For sequential, interdependent work, a single agent is cheaper, faster, and easier to debug.

When should I use a single agent vs multi-agent?

Use a single agent by default. Move to multi-agent only when the task is genuinely parallelizable (independent research strands), context pressure breaks one window, distinct roles need different tools, and you have the observability and budget to run it. The 6-question framework in this piece formalizes the cutover.

What's the operational cost of multi-agent systems?

Per Anthropic (2025), multi-agent systems use ~15x more tokens than chat versus ~4x for a single agent. End-to-end latency commonly jumps from 1-3s to 10-40s (Maxim, 2026), and you pay extra for orchestration, evals, and per-agent tracing.

How do you know your single agent has hit its limit?

Three signals: (1) you are compressing context aggressively and still hitting the window, (2) latency is dominated by sequential tool calls that have no data dependency, and (3) you need conflicting toolsets or system prompts inside one run. If two of these are true and your eval suite is mature, multi-agent is worth piloting.

What does Anthropic recommend for multi-agent design?

Anthropic recommends an orchestrator-worker pattern: a lead researcher agent decomposes the task and dispatches subagents that explore in parallel with their own context windows, then synthesizes the results. They also explicitly warn to add complexity 'only when it demonstrably improves outcomes' (see Building Effective Agents).

What does Cognition recommend instead?

Cognition's Don't Build Multi-Agents argues for a single-threaded linear agent that shares full context end-to-end. Their follow-up Multi-Agents: What's Actually Working refines this: multiple agents can contribute reads, but writes should stay single-threaded to avoid conflicting decisions.

Why do multi-agent systems fail in production?

Research summarized by Maxim AI (2026) attributes failures to specification ambiguity, coordination breakdowns, and verification gaps. Failure rates run 41-86.7%, and Gartner projects 40%+ of agentic AI projects will be canceled by 2027.

Are subagents the same as multi-agent systems?

Functionally similar, semantically different. Subagents (as in the Claude Agent SDK) are short-lived workers spawned by a parent agent for a bounded task. A multi-agent system implies persistent agents with independent state and roles. Subagents inherit most multi-agent costs but with cleaner lifecycle control.

Can I prototype with multi-agent and simplify later?

Reverse it. Prototype with one agent, instrument heavily, find the specific bottleneck, and only then split. Teams that start multi-agent typically inherit coordination bugs they cannot diagnose because they never had a working single-agent baseline to compare against.

Offer at the end of the article for teams trying to decide between single and multi-agent before they overbuild.

Get the AEO Agent Architecture Audit