listicle 14 min read May 03, 2026

The 9 AI Agent Design Patterns Every Builder Should Know

By Peter Foy

The 9 canonical AI agent design patterns -- ReAct, Reflection, Plan-and-Execute, Tool Use, Routing, Orchestrator-Worker, and more -- with code and when to use each.

TL;DR

The 9 canonical AI agent design patterns are ReAct, Reflection, Plan-and-Execute, Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization, and Hierarchical Multi-Agent. ReAct and Tool Use are foundational primitives. The other seven, popularized by Anthropic's 2024 'Building Effective Agents' essay, are composable workflows. Production agents typically layer 3-5 patterns; start simple and add complexity only when evals demand it.

ReAct (Yao et al., 2022) interleaves Thought, Action, Observation -- the default tool-using loop
Reflection (Shinn et al., 2023) lifted GPT-4's HumanEval pass@1 from 80% to 91% via self-critique
Plan-and-Execute cuts a 10-step task from ~10 LLM calls to 1-2, slashing cost vs ReAct
Anthropic's seven workflow patterns -- Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization -- are composable, not exclusive
Hierarchical Multi-Agent is overkill until evals prove a single-agent ReAct loop has failed

The nine canonical AI agent design patterns are ReAct, Reflection, Plan-and-Execute, Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization, and Hierarchical Multi-Agent. Two come from research papers (ReAct, Yao 2022; Reflexion, Shinn 2023). Six were canonized by Anthropic's 'Building Effective Agents' (2024). One -- Hierarchical Multi-Agent -- was popularized by LangGraph's supervisor library. This guide gives you a 2-line definition, code, when to reach for it, and when not to, for each.

What are AI agent design patterns?

AI agent design patterns are reusable architectural templates for building LLM systems that reason, use tools, and act autonomously. Each pattern solves a specific failure mode -- bad planning, bad tool use, bad self-correction -- and most production agents layer 3-5 of them.

The nine patterns split into three categories:

Reasoning loops: ReAct, Reflection, Plan-and-Execute. How a single agent thinks.
Composable workflows: Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization. How an LLM call connects to other calls and systems. All six are documented in Anthropic's 2024 essay.
Multi-agent topology: Hierarchical Multi-Agent. How a supervisor coordinates specialists.

Anthropic's blunt guidance: "start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short." In other words, the right pattern is the smallest one that passes your evals.

Pattern 1: ReAct -- how does it interleave reasoning and acting?

ReAct interleaves a Thought, an Action, and an Observation in a loop until the agent reaches a final answer. It came from Yao et al. (2022) at Princeton and Google, and is the default reasoning loop for tool-using agents in LangGraph, OpenAI Assistants, and most modern frameworks.

loop:
  thought = llm("Reason about state and goal")
  action  = llm("Pick a tool and args")
  obs     = execute(action)
  if done(obs): return final_answer

When to use: open-ended tasks where the path is unknown -- debugging, research, exploratory tool use, customer triage with branching logic.

When NOT to use: tasks with predictable, repeatable steps. ReAct burns tokens on per-step reasoning that adds nothing if you already know the recipe. According to a 2026 LangChain benchmark, ReAct uses ~10 LLM calls on a 10-step task vs 1-2 for Plan-and-Execute.

Production example: our internal SEO research agent uses ReAct to crawl a competitor sitemap, decide which URLs to fetch, summarize each, and stop when it has enough citations. The path is impossible to predefine because each crawled page changes what to fetch next.

Pattern 2: Reflection -- how do agents critique their own output?

Reflection adds a self-critique step where the agent evaluates its own output, generatesfeedback, and revises before returning. The canonical paper is Reflexion (Shinn et al., 2023), which lifted GPT-4's HumanEval pass@1 from 80% to 91% by adding a verbal reinforcement loop.

draft   = llm("Generate answer")
for i in range(max_iters):
  critique = llm("Find flaws in draft", draft)
  if critique == "OK": break
  draft = llm("Revise using critique", draft, critique)
return draft

When to use: code generation, long-form writing, fact-checking, anything where a second pass meaningfully improves quality. Reflexion-style loops are how Cursor, Devin, and most coding agents close the gap to passing tests.

When NOT to use: latency-sensitive surfaces (chat autocomplete, voice). Each reflection iteration adds an LLM round-trip. Two iterations triple your latency budget.

Production example: our blog draft agent runs one reflection pass scoped to a tight rubric (citation density, banned phrases, TL;DR completeness). One pass, not three -- past that, returns drop and the agent starts second-guessing correct outputs.

Reflexion vs GPT-4 Baseline: HumanEval pass@1

GPT-4 baseline

80%

Reflexion (self-critique loop)

91%

Source: Shinn et al., Reflexion (NeurIPS 2023)

Pattern 3: Plan-and-Execute -- when should you plan before acting?

Plan-and-Execute generates a full plan upfront, then executes each step with cheaper, faster models. A planner LLM produces a directed acyclic graph of subtasks; executors run the leaves. The pattern was popularized by LangChain's plan_and_execute agent in 2023 and refined by ReWOO and LLMCompiler.

plan      = planner_llm(goal)         # one expensive call
results   = []
for step in plan.steps:
  results.append(executor_llm(step))  # cheap, parallelizable
return synthesize(results)

When to use: multi-step workflows with clear dependencies (data pipelines, ETL, report generation). Plan-and-Execute on a 10-step task uses 1-2 LLM calls vs ~10 for ReAct, per the 2026 LangChain benchmark, and hit 92% completion accuracy vs 85% for ReAct on the same suite.

When NOT to use: highly dynamic environments where the plan goes stale mid-execution. If observation N invalidates step N+5, you'll re-plan constantly and lose the cost advantage.

Production example: our weekly competitor monitoring agent plans the full crawl-and-diff workflow on Sunday night, then executes 40 cheap parallel scrapes. One Claude Sonnet 4.5 plan call; 40 Haiku scrape calls. Cost is ~85% lower than the equivalent ReAct loop.

ReAct vs Plan-and-Execute: Cost & Accuracy (2026 LangChain Benchmark)

ReAct task completion %

Plan-and-Execute task completion %

ReAct LLM calls (10-step task)

Plan-and-Execute LLM calls (10-step task)

Source: LangChain Agent Benchmarks, 2026

Pattern 4: Tool Use -- how do agents call external functions?

Tool Use is the pattern where an LLM emits a structured function call, an external system executes it, and the result is fed back into the context. It's the foundational primitive every other agent pattern composes around. OpenAI shipped function calling in June 2023; Anthropic's tool_use schema followed; MCP standardized it in late 2024.

tools = [search_web, query_db, send_email]
response = llm(messages, tools=tools)
if response.tool_calls:
  for call in response.tool_calls:
    result = dispatch(call.name, call.args)
    messages.append(tool_result(call.id, result))
  response = llm(messages, tools=tools)
return response.content

When to use: anything that needs fresh data, side effects, or computation outside the LLM. Per BenchLM's 2026 leaderboards, frontier models pick the right tool reliably -- the failure mode has shifted to using the result correctly.

When NOT to use: pure reasoning tasks (summarization, classification on text already in-context). Tools add latency and a failure surface. The Berkeley Function Calling Leaderboard shows non-trivial degradation when irrelevant tools are present in the prompt.

Production example: our pricing-research agent has exactly four tools (search, fetch_url, extract_table, compare_prices). Adding a fifth crashed accuracy 8 points in evals. We deleted it.

Pattern 5: Routing -- how do you direct queries to the right agent?

Routing classifies an input once and dispatches it to a specialized handler. It's the one-to-one dispatcher pattern. Anthropic recommends it whenever you have distinct verticals where conflating them in one prompt would degrade performance.

router   = llm("Classify into: refund | technical | sales", query)
if router == "refund":     return refund_agent(query)
if router == "technical":  return tech_agent(query)
if router == "sales":      return sales_agent(query)

When to use: mixed-intent inboxes (support, sales, billing in one channel), multi-domain assistants, model routing (cheap model for FAQs, expensive for code). LangChain's router docs note four flavors: rule-based, intent, semantic-embedding, and LLM-based.

When NOT to use: single-domain agents. A router on a one-handler agent is a useless extra LLM call. Also avoid when categories overlap heavily -- ambiguous routing is worse than a single broad-skilled agent.

Production example: our inbound DM agent routes between book_call, answer_question, and escalate_to_human. A 4B-parameter classifier handles the route in ~80ms; the downstream Claude Sonnet handles the substance. Replacing the router with a single Sonnet call increased cost 6x with no quality lift.

Pattern 6: Orchestrator-Worker -- how does dynamic delegation work?

Orchestrator-Worker uses a central LLM to dynamically decompose a task into subtasks at runtime, delegate each to a worker LLM, and synthesize the results. Unlike Parallelization, subtasks are not pre-defined -- the orchestrator decides them per input. The pattern is documented in Anthropic's 'Building Effective Agents'.

subtasks = orchestrator_llm("Decompose this task", input)
results  = [worker_llm(s) for s in subtasks]    # often parallel
final    = orchestrator_llm("Synthesize", results)
return final

When to use: compound tasks where you can't predict the subtasks in advance. Anthropic's example: a coding task where the number of files to change depends on the request. Other fits: deep-research agents, refactoring across unknown file counts, multi-source synthesis.

When NOT to use: tasks with fixed, known steps -- use Plan-and-Execute or a hard-coded workflow. The orchestrator's flexibility is also its cost; you pay for runtime decomposition.

Production example: our content-audit agent gets a single URL and an orchestrator decides whether to fetch the page, the sitemap, the schema markup, the backlink profile, or all four. Worker Haiku calls run in parallel; the orchestrator Sonnet synthesizes. We can't hardcode the decomposition because it depends on what the page actually is.

Pattern 7: Evaluator-Optimizer -- how do agents iterate to a higher bar?

Evaluator-Optimizer pairs a generator LLM with an evaluator LLM in a loop until the output passes a quality threshold. Where Reflection has the same model self-critique, Evaluator-Optimizer typically uses a separate, harder-to-please evaluator. Anthropic cites complex search and translation as canonical use cases.

output = generator_llm(input)
for _ in range(max_iters):
  verdict, feedback = evaluator_llm(input, output)
  if verdict == "PASS": return output
  output = generator_llm(input, prior=output, feedback=feedback)
return output  # best effort

When to use: when you have clear, evaluable criteria and the first draft routinely misses them -- literary translation, complex search refinement, code that must pass tests, copy that must meet a brand voice rubric.

When NOT to use: subjective tasks without crisp evaluation criteria. If the evaluator can't reliably distinguishgood from bad, the loop oscillates or rubber-stamps. Also avoid when one good attempt is sufficient -- the second LLM is pure overhead.

Production example: our outbound email agent generates a draft, the evaluator scores it on (a) personalization, (b) CTA clarity, (c) banned-phrase count. If any score is below threshold, the generator revises with that specific feedback. Capped at 2 retries -- past that, returns flatline.

Pattern 8: Parallelization -- when should agents work in parallel?

Parallelization fans the same input out to multiple LLM calls running concurrently, then aggregates. Anthropic splits it into two flavors: Sectioning (different subtasks in parallel) and Voting (same task multiple times for higher confidence).

# Sectioning
results = parallel([
  llm("Check for PII", input),
  llm("Check for toxicity", input),
  llm("Check for off-topic", input),
])
return aggregate(results)

# Voting
votes = parallel([llm("Is this fraud?", input) for _ in range(5)])
return majority(votes)

When to use: independent subtasks (guardrails -- safety check + main response in parallel), high-stakes classifications where ensembling lifts accuracy, and any time wall-clock latency matters more than total tokens.

When NOT to use: sequentially dependent steps. If step B needs step A's output, parallelization is impossible. Voting is also wasteful for low-stakes calls -- 5x cost for a trivial accuracy lift is bad math.

Production example: our content publish flow runs three guardrails in parallel before posting (em-dash detector, hallucination cross-check, banned-phrase scanner). Total wall-clock latency = max of the three, not sum.

Pattern 9: Hierarchical Multi-Agent -- how do supervisors coordinate teams?

Hierarchical Multi-Agent organizes specialist agents under one or more supervisor agents that route work and aggregate results. LangGraph's Supervisor library is the reference implementation; you can stack supervisors-of-supervisors for large enterprise topologies.

supervisor = LLM("Decide which team handles this turn")
teams = {
  "research": research_supervisor,  # has 3 worker agents
  "writing":  writing_supervisor,   # has 2 worker agents
  "qa":       qa_supervisor,        # has 2 worker agents
}
while not done:
  next_team = supervisor(state)
  state = teams[next_team].run(state)

When to use: large workflows with genuinely distinct specialist domains where each domain has its own tools, prompts, and evals -- e.g., research + writing + legal review.

When NOT to use: anything a single ReAct agent can do. Hierarchical systems multiply context-passing failures, debugging surface area, and cost. Anthropic's published guidance is explicit: don't reach for multi-agent until single-agent evals fail.

Production example: our SEO content factory runs a Research supervisor (with 3 search/scrape workers) handing off to a Writing supervisor (drafter + editor) handing off to a QA supervisor (schema validator + brand checker). Total: 8 agents, 3 supervisors, one orchestrator on top. We started with one ReAct agent. Evals forced this topology over 6 months -- not architecture astronaut-ing.

How do you combine multiple patterns in a real agent?

Production agents typically layer 3-5 patterns. The default stack: Routing on the front, Tool Use as the primitive, ReAct as the loop, Reflection on the output, and Parallelization for guardrails. Per the Anthropic essay, "these building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases."

A realistic combined flow:

Router classifies the request into a domain.
Orchestrator decomposes the task into subtasks if needed.
Workers execute each subtask with Tool Use inside a ReAct loop.
Parallelization runs guardrails alongside the main response.
Evaluator-Optimizer retries on quality miss.
Reflection does one final self-critique before returning.

The rule: every pattern in the stack must be justified by an eval that fails without it. Most teams over-engineer. Start with Tool Use + ReAct, run real evals, and add the next pattern only when evals point at a specific failure mode.

Which patterns are overkill for a single-task agent?

Hierarchical Multi-Agent, Orchestrator-Worker, and Evaluator-Optimizer are overkill for single-task agents. If your agent does one thing -- summarize a doc, classify a ticket, draft an email -- a basic LLM call or a tight ReAct loop wins on latency, cost, and debuggability.

Quick decision rules:

One prompt, no tools = no pattern. Just call the LLM.
One or two tool calls, predictable = Tool Use only.
Variable steps, exploratory = ReAct.
Quality gates important, latency tolerant = ReAct + Reflection.
Multi-domain inbox = Routing + Tool Use.
Anything multi-agent = only after single-agent evals plateau.

The 2024 Anthropic essay hammers this: agentic complexity should be earned, not assumed. Most production failures between 2024 and 2026 were architectural over-engineering, not model quality. Start small. Add patterns only when evals prove you must.

Pattern	Best for	Avoid when	Schema/source
ReAct	Open-ended research, debugging, exploratory tool use	Tasks with predictable, repeatable steps	Yao et al., 2022
Reflection	Code generation, long-form writing, fact-checking	Latency-sensitive UX (chat, autocomplete)	Shinn et al., 2023
Plan-and-Execute	Multi-step workflows with clear dependencies	Highly dynamic environments where plans break	LangChain, 2023
Tool Use	Anything needing real-world data or actions	Pure reasoning or summarization tasks	OpenAI function calling, 2023
Routing	Mixed-intent inboxes, multi-domain support	Single-domain agents	Anthropic, 2024
Orchestrator-Worker	Tasks where subtasks can't be predicted upfront	Tasks with fixed, known steps	Anthropic, 2024
Evaluator-Optimizer	Translation, copywriting, search refinement	Tasks without clear evaluation criteria	Anthropic, 2024
Parallelization	Guardrails, voting, independent subtasks	Sequentially dependent steps	Anthropic, 2024
Hierarchical Multi-Agent	Large enterprise workflows with specialist teams	Anything a single agent can do	LangGraph Supervisor, 2025

Frequently asked questions

What are the most common AI agent design patterns?

The nine canonical patterns are ReAct, Reflection, Plan-and-Execute, Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization, and Hierarchical Multi-Agent. ReAct (Yao et al., 2022) and Tool Use are the foundational primitives. The other seven are workflow patterns documented by Anthropic and LangChain that compose around them.

When should you use ReAct vs Plan-and-Execute?

Use ReAct when the path to the answer is unknown and the agent needs to react to intermediate results, like debugging or research. Use Plan-and-Execute when the task is multi-step but predictable, the planner can lay out subtasks upfront, and you want to cut LLM calls. A 10-step task takes 10 calls in ReAct vs 1-2 in Plan-and-Execute.

What is the orchestrator-worker pattern?

Orchestrator-worker is a pattern where a central LLM (orchestrator) dynamically breaks a task into subtasks, delegates each to worker LLMs, and synthesizes their outputs. Unlike parallelization, the subtasks are not pre-defined -- the orchestrator decides them based on input. Anthropic recommends it for tasks like multi-file code edits where the work can't be templated in advance.

How do you combine multiple patterns in a real agent?

Most production agents layer 3-5 patterns. A typical stack: a Router classifies the request, an Orchestrator plans subtasks, Workers use Tool Use to execute, an Evaluator reviews outputs, and Reflection retries on failures. Anthropic's guidance is to start with the simplest pattern (often just Tool Use + ReAct) and add layers only when evals show simpler approaches fail.

Which patterns are overkill for a single-task agent?

Hierarchical Multi-Agent, Orchestrator-Worker, and Evaluator-Optimizer are overkill for single-task agents. If the task fits in one prompt with one or two tool calls, a basic ReAct loop or even a single LLM call wins on latency, cost, and debuggability. Anthropic's published guidance is explicit: 'add multi-step agentic systems only when simpler solutions fall short.'

Is ReAct still production-ready in 2026?

Yes. ReAct remains the default reasoning loop for tool-using agents in 2026 and is the foundation of LangGraph's create_react_agent and OpenAI's Assistants API. The Reflexion and Plan-and-Execute patterns extend it rather than replace it. Most modern agent frameworks ship a ReAct implementation as the baseline.

What's the difference between Routing and Orchestrator-Worker?

Routing classifies an input once and hands it to a single specialist. Orchestrator-Worker breaks one input into multiple subtasks and runs many specialists. Routing is a one-to-one dispatcher; Orchestrator-Worker is one-to-many fan-out with synthesis. Use Routing for help-desk-style triage and Orchestrator-Worker for compound tasks like 'refactor this codebase.'

Where did the agentic design patterns come from?

ReAct comes from Yao et al. (2022). Reflection comes from Shinn et al. (2023). Plan-and-Execute, Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, and Parallelization were canonized in Anthropic's 2024 'Building Effective Agents' post. Hierarchical Multi-Agent was popularized by LangGraph's supervisor library.

Do these patterns work with any LLM?

Yes, but they work best with models trained for tool use and structured output. The Berkeley Function Calling Leaderboard tracks how well frontier models handle the underlying primitives. As of 2026, Claude, GPT, and Gemini frontier tiers all handle tool selection reliably -- the failure mode has shifted from 'wrong tool' to 'using the result wrong.'

After the production combination section

Get our agent architecture blueprints