95% per-agent reliability multiplied across 10 chained agents equals 60% end-to-end success. That is reliability compounding, and it is the reason most multi-agent demos die in production. The math comes from Lusser's Law (1957): when independent steps run in series, system reliability is the product of step reliabilities. Berkeley's MAST study measured 41-86.7% failure rates across seven popular multi-agent frameworks. Better models shift the curve. Structure breaks it. This piece shows both: the simulation, and the five fixes production teams use.

What is reliability compounding in multi-agent systems?

Reliability compounding is the multiplicative decay of end-to-end success probability as agents are chained in series. If each agent succeeds with probability p, an n-agent pipeline succeeds with probability p^n. The decay is non-linear and brutal at scale.

The principle is Lusser's Law, formalized by rocketry engineer Robert Lusser in the 1950s while working on Wernher von Braun's program. It is the same math that governs aerospace reliability budgets and now governs your LangGraph DAG.

Applied to a 95% per-agent pipeline:

Pipeline length End-to-end success Failure rate
1 agent 95.0% 5.0%
3 agents 85.7% 14.3%
5 agents 77.4% 22.6%
7 agents 69.8% 30.2%
10 agents 59.9% 40.1%
15 agents 46.3% 53.7%

A 95% step accuracy feels great in isolation. Stack ten of them and four out of ten user requests fail. That is the gap between a working demo and a production-ready system.

End-to-End Reliability vs. Pipeline Length (95% per-agent reliability)
1 agent
95%
3 agents
85.7%
5 agents
77.4%
7 agents
69.8%
10 agents
59.9%
15 agents
46.3%
Source: Lusser's Law applied: R = 0.95^n

Why do multi-agent systems become unreliable?

Multi-agent systems fail because errors compound silently and because most failures are structural, not model quality. The Berkeley Sky Computing Lab's MAST taxonomy (Cemri et al., 2025) studied 1,600+ traces across 7 frameworks (MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, AG2) and found failure rates of 41% to 86.7%.

MAST identifies 14 failure modes clustered into 3 categories:

  • System design failures (~37% of failures): bad role specifications, ambiguous task decomposition, missing termination conditions.
  • Inter-agent misalignment (~31%): format mismatches between agents (planner emits YAML, executor expects JSON), context collapse as windows fill, contradictory state.
  • Task verification failures (~31%): weak or missing checks on intermediate outputs, no detection of hallucinated facts before they propagate.

The insight from MAST and from Anthropic's Building Effective Agents guide: most multi-agent failures are not the LLM being wrong. They are the system letting a wrong intermediate output through to the next step, where it is treated as ground truth.

This is the silent error propagation pattern. A subtly wrong table extraction in step 2 looks fine to step 3, which builds a query on it, which step 4 executes, which step 5 reports. Every downstream agent treats the corrupted state as correct.

How many agents can you safely chain?

At 95% per-agent reliability you can safely chain about 5 agents before end-to-end success drops below 80%. At 99% reliability you can reach 10-15 agents while staying above 85%. Below 90% per-step, every chained agent is a liability.

The table below shows end-to-end reliability across pipeline lengths and per-agent accuracy:

Per-agent reliability 3 agents 5 agents 10 agents 20 agents
99% 97.0% 95.1% 90.4% 81.8%
98% 94.1% 90.4% 81.7% 66.8%
95% 85.7% 77.4% 59.9% 35.8%
90% 72.9% 59.0% 34.9% 12.2%
85% 61.4% 44.4% 19.7% 3.9%

Two practical rules fall out of this:

  1. Treat 99% per step as the production floor. Anything below means you cannot chain more than 5 steps without aggressive guards.
  2. Prefer one agent with tools over five chained agents. Anthropic's guidance is consistent: simple, composable patterns beat orchestration complexity. A single ReAct agent with five tools has one decision point per loop. A linear chain of five agents has five compounding ones.
End-to-End Reliability at Different Per-Agent Accuracy Levels (10-agent pipeline)
99% per agent
90.4%
98% per agent
81.7%
95% per agent
59.9%
90% per agent
34.9%
85% per agent
19.7%
Source: Lusser's Law: R = p^10

Why don't better models solve this?

Frontier models shift the reliability curve but do not eliminate the compound effect. Between January 2025 and January 2026, single-step task completion rates on agentic benchmarks improved meaningfully, but per-step reliability of 99.9% across novel real-world tasks is still rare.

Research summarized in the Stevens Online analysis of LLM agent economics shows single-shot LLM accuracy plateaus at 60-70% on complex tasks. Hitting 95%+ requires multi-turn reasoning, tool use, and verification, the very things that introduce more steps.

This is the trap: improving accuracy means adding steps, and adding steps amplifies compounding. The only way out is structural -- shorter chains, validation between steps, and budgets on retry depth.

A useful mental model: model improvement gives you another digit of reliability roughly every 18 months. Structural improvements (validators, judges, checkpoints) give you another digit this afternoon.

How can you increase the reliability of a multi-agent pipeline?

Five techniques break the compounding curve in production. Each has a different cost-reliability profile. Stack them in order of cost-effectiveness for your task.

Technique Per-step lift Cost multiplier Best for
Schema validator + repair loop +8-15% 1.2-1.6x Structured outputs (JSON, SQL, code)
Scoped retry (transient errors only) +1-3% 1.05-1.3x API failures, parse errors
Self-consistency (k=5 votes) +5-12% 5x Verifiable reasoning steps
Evaluator-optimizer (judge agent) +15-30% 2-4x Subjective quality
Human-in-the-loop checkpoint +30-60% High labor cost High-stakes actions

The biggest production wins come from combining cheap deterministic guards (validators, schemas) with one expensive non-deterministic guard (a judge agent or human checkpoint) at the highest-leverage step. PwC reported a 7x accuracy improvement (10% to 70%) after adding judge agents to their CrewAI code generation pipeline.

Schema validators and repair loops

The cheapest reliability win: validate every structured output (JSON, function call args, SQL) against a schema. On failure, feed the validation error back to the model with a 'fix this' prompt. This catches a large share of malformed-output failures at <2x cost. Tools like Pydantic, Zod, and Instructor implement this pattern. The analysis of common LLM pipeline errors notes that a single malformed JSON can cascade into 5-10 retry attempts without scoped repair, so cap repair budgets at 2-3 attempts.

Scoped retries (not blanket retries)

Retry helps only for transient or parseable failures: rate limits, network timeouts, JSON parse errors, schema violations. Retrying a reasoning failure usually returns the same wrong answer at 2x cost. Anthropic's guidance is explicit: retry on signals, not on vibes. Set per-step retry budgets (3 max), exponential backoff, and a kill switch on the parent task if total retries exceed a threshold.

Self-consistency sampling

Sample the same step k times and majority-vote the answer. Effective for reasoning steps with verifiable outputs (math, classification, structured extraction). Cost scales linearly with k. Recent research on adaptive consistency reports that dynamically adjusting k based on interim agreement reduces samples 7.9x while losing <0.1% accuracy. Use this when correctness is checkable, not for open-ended generation.

Evaluator-optimizer (judge agents)

Documented in Anthropic's Building Effective Agents and the Anthropic cookbook: one model generates, a second model evaluates against criteria and returns structured feedback, the generator iterates. Adds 2-4x token cost. Best for tasks with clear evaluation criteria but subjective quality (writing, code review, plan critique). Avoid for real-time UX or when the evaluation criteria themselves are ambiguous.

Human-in-the-loop checkpoints

Place humans at chain boundaries before irreversible actions: external writes, payments, customer-facing emails, code merges. The LangChain runtime documentation describes durable checkpointing where pipelines pause indefinitely and resume from the exact interruption point. Note that EU AI Act Article 14 requires effective human oversight for high-risk AI systems, so checkpoint placement is a compliance lever, not just a reliability one.

Does retrying actually help or just inflate cost?

Retrying helps when failures are transient. It inflates cost without helping when failures are systematic. This is the most common reliability-spending mistake in production agent stacks.

A failure is transient if a fresh attempt has materially different odds of success: rate-limited API calls, network timeouts, JSON-parse failures, schema violations the model can self-correct given the error. Retry these. Cap at 3 attempts with exponential backoff.

A failure is systematic if the model produced a confidently wrong answer based on its reasoning. Retrying produces the same wrong answer at 2x cost. The fix is not retry, it is: a different prompt, a different model, a validator that detects the error, or a judge that scores the output.

Production guardrails that work:

  • Per-step retry budget (max 3, hard cap)
  • Per-task total retry budget (kill the run at 10 cumulative retries)
  • Distinguish error types: structured exception classes (TransientError, SchemaError, ReasoningError) trigger different paths
  • Track retry rate as a SLO: if any step's retry rate exceeds 15%, that step needs a structural fix, not more retries

The hidden economics of AI agents analysis notes that without these caps, a single malformed response can cascade into 5-10 retry attempts per step, multiplying costs by an order of magnitude with no reliability gain.

What does the simulation show?

We ran a Monte Carlo simulation of 10,000 runs across pipeline configurations and posted the code on GitHub. The simulator models per-step reliability p, retry budget r, validator coverage v, and judge-agent gating, then reports end-to-end success rate and cost.

Key findings from the simulation (95% baseline per-agent reliability, 10-step pipeline):

  • Baseline (no guards): 59.9% success, 1.0x cost
  • + Schema validator on 50% of steps: 74.2% success, 1.18x cost
  • + Scoped retries (max 3, transient only): 79.8% success, 1.31x cost
  • + Judge agent on the highest-risk step: 88.4% success, 1.62x cost
  • + Human checkpoint before irreversible action: 96.1% success, 1.65x token cost (plus labor)

The stacked configuration goes from 60% to 96% reliability at 1.65x token cost. That is the curve-breaking move: cheap deterministic guards on most steps, one expensive non-deterministic guard on the riskiest step, and a checkpoint at the boundary.

Fork the reliability-compounding-sim repo and plug in your own per-step accuracy numbers. The simulation takes <30 seconds to run on your pipeline configuration.

What should you do tomorrow?

Three actions, in order of cost-effectiveness, that any team running multi-agent pipelines can ship this week.

  1. Measure per-step reliability. Pick your top 3 production pipelines. For each step, log the rate of (a) parseable failures and (b) downstream-detected logic errors. Most teams discover at least one step running below 90% that they assumed was at 99%.

  2. Add typed schema validation with a 2-attempt repair loop on every structured output. This is the highest reliability-per-dollar investment. Pydantic, Zod, or Instructor get you 90% of the way in an afternoon.

  3. Insert one judge agent at the highest-leverage step. Pick the step where a wrong output is most expensive downstream. Score outputs against 3-5 explicit criteria. Iterate up to 2 times before escalating to a human or failing the run.

If you do nothing else, do step 1. You cannot break a curve you have not measured.

MitigationReliability lift (per step)Cost multiplierLatency costBest fit
Bare retry on error+1-3%1.05-1.3xLowTransient API failures, JSON parse errors
Self-consistency (k=5 votes)+5-12%5xHigh (parallel)Reasoning steps with verifiable outputs
Schema validator + repair loop+8-15%1.2-1.6xLow-MediumStructured outputs (JSON, SQL, code)
Evaluator-optimizer (judge agent)+15-30%2-4xMedium-HighSubjective quality (writing, code review)
Human-in-the-loop checkpoint+30-60%1.1x token, high $/laborHours-daysHigh-stakes irreversible actions