Choose a single agent by default. Multi-agent systems consume roughly 15x more tokens than a chat baseline (Anthropic, 2025), fail in production at rates between 41% and 86.7% (Maxim AI, 2026), and add coordination bugs that are hard to debug. They are still the right choice for parallelizable, read-heavy work like open-ended research. This article gives you a 6-question decision framework. Answer yes to 4 or more and multi-agent is justified. Otherwise, scale the single agent.
What is the difference between a single agent and a multi-agent system?
A single agent is one LLM loop that plans, calls tools, and writes results inside one context window. A multi-agent system is a set of LLM loops with independent state, often coordinated by an orchestrator that delegates subtasks to specialized workers and synthesizes their outputs.
The distinction is not just architectural. It changes the failure mode, the bill, and the on-call rotation.
- Single agent: one trace to debug, one window to manage, sequential tool calls.
- Multi-agent: N traces, N windows, parallel tool calls, plus a coordination layer that becomes its own failure surface.
Anthropic's Building Effective Agents draws an even sharper line: most production systems are workflows (predefined code paths orchestrating LLM calls), not agents (LLMs dynamically directing their own tools). True multi-agent systems sit at the most autonomous and most expensive end of that spectrum.
When should I use a single agent vs multi-agent?
Use a single agent unless the task is genuinely parallelizable, context-bound, and high-value. That is the short answer. The longer answer is the 6-question framework below. Score one point per yes. Four or more justifies multi-agent. Three or fewer means scale the single agent.
- Parallelism need. Can the task be split into independent strands that have no data dependency on each other?
- Context window pressure. Is one 200K window genuinely insufficient even with aggressive summarization and tool-result compression?
- Role specialization. Do subtasks need conflicting system prompts, different toolsets, or different models (e.g., Opus planner + Haiku workers)?
- Reliability budget. Can you absorb a measured drop in end-to-end success rate in exchange for breadth, and do you have evals to detect it?
- Observability maturity. Do you already have per-trace logging, tool-call replay, and an eval harness running on every change?
- Ops cost. Is the dollar value of the output high enough that a 3-4x token bill is a rounding error?
The framework is intentionally biased toward single-agent. Most teams overestimate parallelism (#1) and underestimate observability cost (#5).
What does Anthropic recommend for multi-agent design?
Anthropic recommends an orchestrator-worker pattern with a lead agent that decomposes the task and subagents that explore in parallel, each with their own context window. In their internal evaluation, this design with Claude Opus 4 as lead and Claude Sonnet 4 subagents outperformed single-agent Opus 4 by 90.2% on research tasks (Anthropic, 2025).
The headline result hides three conditions:
- The task was open-ended research with breadth-first parallelism built in.
- Token usage alone explained 80% of performance variance -- they were buying performance with compute.
- They explicitly wrote that multi-agent is for tasks where 'the value of the task is high enough to pay for the increased performance.'
Anthropic's earlier essay Building Effective Agents is more conservative still: 'Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.' Both pieces point in the same direction. Earn the complexity.
What is Cognition's case against multi-agent systems?
Cognition (the team behind Devin) published Don't Build Multi-Agents in June 2025 with two principles:
- 'Share context, and share full agent traces, not just individual messages.'
- 'Actions carry implicit decisions, and conflicting decisions carry bad results.'
The argument: parallel subagents make implicit choices about style, edge cases, and code patterns that conflict with each other. Their canonical example is a subagent that builds a Super Mario Bros background while another builds Flappy Bird pipes -- both technically correct, mutually unusable.
Cognition's prescription is a single-threaded linear agent with continuous context. For long horizons, they introduce a dedicated compression model that distills history into key details rather than handing parts to peer agents.
Ten months later, Cognition published Multi-Agents: What's Actually Working. The update narrows the safe zone: multiple agents can contribute reads, but writes must stay single-threaded. That is the architectural rule of thumb for 2026.
What is the operational cost of a multi-agent system?
Multi-agent systems cost roughly 15x a chat baseline in tokens, vs ~4x for single agents (Anthropic, 2025). That is before you count orchestration overhead, eval runs, and per-agent observability.
A realistic 2026 cost stack for a production multi-agent system:
| Cost line | Single agent | Multi-agent |
|---|---|---|
| Token spend per task | 1x | ~3-4x |
| Latency | 1-3s per tool call | 10-40s end-to-end |
| Tracing infra | One pipeline | Per-agent + orchestrator |
| Eval surface | One harness | Per-role harness + integration evals |
| On-call complexity | Linear | Multiplicative |
Latency numbers from Maxim AI's reliability writeup (2026), which also reports pilot accuracy of 95-98% dropping to 80-87% in production.
The economic test is simple: task value > token + ops cost premium. Research at $0.50 per query is a bad fit. Research at $50 per query (legal discovery, M&A diligence, security audits) is a great fit.
How do you know your single agent has hit its limit?
Three signals tell you a single agent has hit its ceiling:
- Context overflow despite compression. You have implemented summarization, retrieval, and tool-result truncation, and you still routinely hit the 200K window on real tasks.
- Sequential latency on independent work. Your trace shows the agent making 8 tool calls in series where 6 of them have no data dependency. Parallelism would cut wall time materially.
- Tool or prompt conflict. You are role-switching system prompts mid-conversation, or the same agent needs incompatible tool definitions for different subtasks.
If two of these are true, run a bounded multi-agent pilot with strict scope: one orchestrator, two to three subagents, full tracing, and an eval harness that measures the same task on the existing single agent for comparison.
If only one is true, the fix is almost always better tools, better context engineering, or a stronger model, not more agents. Multi-agent failure rates run 41-86.7% in production per Maxim AI (2026), and Gartner projects 40%+ of agentic AI projects will be canceled by 2027 -- mostly because teams jumped to coordination before solving capability.
What does the single agent vs multi-agent comparison look like side by side?
A direct comparison across the dimensions that matter for production:
| Dimension | Single Agent | Multi-Agent System |
|---|---|---|
| Token cost vs chat | ~4x | ~15x |
| Best task shape | Sequential, interdependent, write-heavy | Parallelizable, read-heavy, breadth-first |
| Latency profile | 1-3s per tool call | 10-40s end-to-end |
| Context strategy | One window + compression | Per-subagent windows + summarization |
| Dominant failure mode | Context overflow, drift | Coordination conflicts, divergent decisions |
| Observability | One trace | Trace per agent + orchestrator + handoffs |
| Strongest fit | Coding, customer ops, anything with shared state | Open-ended research, parallel search |
| Public exemplar | Devin (Cognition's linear agent) | Anthropic Research (lead + subagents, 90.2% lift) |
The pattern is consistent across both cited teams. Cognition kept Devin single-threaded for code because writes need consistency. Anthropic went multi-agent for research because reads benefit from breadth. Match your task shape to the architecture, not the other way around.
How should you scale a single agent before going multi-agent?
Before splitting into multiple agents, exhaust these single-agent levers in order:
- Tool quality. Most 'agent reasoning' problems are tool problems. Better tool descriptions, fewer tools, idempotent operations.
- Context engineering. Aggressive summarization of tool outputs, structured memory, retrieval over raw stuffing.
- Prompt caching. Cache reads cost 10% of standard input tokens on Claude. Stable system prompts and tool definitions become near-free.
- Model tiering inside one loop. Use Haiku for cheap classification calls, Sonnet for the main loop, Opus only on high-stakes turns.
- Subagents (not multi-agent). Spawn short-lived workers with bounded scope and explicit return contracts. The Claude Agent SDK supports this without committing to full multi-agent infra.
- Compression agent. Cognition's pattern: a dedicated LLM whose only job is compressing history into key details when context pressure spikes.
Most teams skip steps 1-4 and jump to multi-agent because it sounds architecturally serious. The result is the 40% pilot failure rate. Scale the single agent. Earn the complexity.
| Dimension | Single Agent | Multi-Agent System |
|---|---|---|
| Token cost (vs chat baseline) | ~4x | ~15x (Anthropic, 2025) |
| Best task shape | Sequential, interdependent, single-threaded writes | Parallelizable, read-heavy research, breadth-first |
| Latency | 1-3s typical tool calls | 10-40s end-to-end (Maxim, 2026) |
| Context strategy | One window + compression | Per-subagent windows + summarization |
| Failure mode | Context overflow, drift | Coordination conflicts, divergent decisions |
| Observability burden | One trace | Trace per agent + orchestrator + handoffs |
| When it wins | Coding, edits, customer ops, anything with shared state | Open-ended research, parallel search, breadth scans |
| Public example | Devin (Cognition's single-threaded linear agent) | Anthropic Research (lead + subagents, 90.2% lift) |