The nine canonical AI agent design patterns are ReAct, Reflection, Plan-and-Execute, Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization, and Hierarchical Multi-Agent. Two come from research papers (ReAct, Yao 2022; Reflexion, Shinn 2023). Six were canonized by Anthropic's 'Building Effective Agents' (2024). One -- Hierarchical Multi-Agent -- was popularized by LangGraph's supervisor library. This guide gives you a 2-line definition, code, when to reach for it, and when not to, for each.
What are AI agent design patterns?
AI agent design patterns are reusable architectural templates for building LLM systems that reason, use tools, and act autonomously. Each pattern solves a specific failure mode -- bad planning, bad tool use, bad self-correction -- and most production agents layer 3-5 of them.
The nine patterns split into three categories:
- Reasoning loops: ReAct, Reflection, Plan-and-Execute. How a single agent thinks.
- Composable workflows: Tool Use, Routing, Orchestrator-Worker, Evaluator-Optimizer, Parallelization. How an LLM call connects to other calls and systems. All six are documented in Anthropic's 2024 essay.
- Multi-agent topology: Hierarchical Multi-Agent. How a supervisor coordinates specialists.
Anthropic's blunt guidance: "start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short." In other words, the right pattern is the smallest one that passes your evals.
Pattern 1: ReAct -- how does it interleave reasoning and acting?
ReAct interleaves a Thought, an Action, and an Observation in a loop until the agent reaches a final answer. It came from Yao et al. (2022) at Princeton and Google, and is the default reasoning loop for tool-using agents in LangGraph, OpenAI Assistants, and most modern frameworks.
loop:
thought = llm("Reason about state and goal")
action = llm("Pick a tool and args")
obs = execute(action)
if done(obs): return final_answer
When to use: open-ended tasks where the path is unknown -- debugging, research, exploratory tool use, customer triage with branching logic.
When NOT to use: tasks with predictable, repeatable steps. ReAct burns tokens on per-step reasoning that adds nothing if you already know the recipe. According to a 2026 LangChain benchmark, ReAct uses ~10 LLM calls on a 10-step task vs 1-2 for Plan-and-Execute.
Production example: our internal SEO research agent uses ReAct to crawl a competitor sitemap, decide which URLs to fetch, summarize each, and stop when it has enough citations. The path is impossible to predefine because each crawled page changes what to fetch next.
Pattern 2: Reflection -- how do agents critique their own output?
Reflection adds a self-critique step where the agent evaluates its own output, generatesfeedback, and revises before returning. The canonical paper is Reflexion (Shinn et al., 2023), which lifted GPT-4's HumanEval pass@1 from 80% to 91% by adding a verbal reinforcement loop.
draft = llm("Generate answer")
for i in range(max_iters):
critique = llm("Find flaws in draft", draft)
if critique == "OK": break
draft = llm("Revise using critique", draft, critique)
return draft
When to use: code generation, long-form writing, fact-checking, anything where a second pass meaningfully improves quality. Reflexion-style loops are how Cursor, Devin, and most coding agents close the gap to passing tests.
When NOT to use: latency-sensitive surfaces (chat autocomplete, voice). Each reflection iteration adds an LLM round-trip. Two iterations triple your latency budget.
Production example: our blog draft agent runs one reflection pass scoped to a tight rubric (citation density, banned phrases, TL;DR completeness). One pass, not three -- past that, returns drop and the agent starts second-guessing correct outputs.
Pattern 3: Plan-and-Execute -- when should you plan before acting?
Plan-and-Execute generates a full plan upfront, then executes each step with cheaper, faster models. A planner LLM produces a directed acyclic graph of subtasks; executors run the leaves. The pattern was popularized by LangChain's plan_and_execute agent in 2023 and refined by ReWOO and LLMCompiler.
plan = planner_llm(goal) # one expensive call
results = []
for step in plan.steps:
results.append(executor_llm(step)) # cheap, parallelizable
return synthesize(results)
When to use: multi-step workflows with clear dependencies (data pipelines, ETL, report generation). Plan-and-Execute on a 10-step task uses 1-2 LLM calls vs ~10 for ReAct, per the 2026 LangChain benchmark, and hit 92% completion accuracy vs 85% for ReAct on the same suite.
When NOT to use: highly dynamic environments where the plan goes stale mid-execution. If observation N invalidates step N+5, you'll re-plan constantly and lose the cost advantage.
Production example: our weekly competitor monitoring agent plans the full crawl-and-diff workflow on Sunday night, then executes 40 cheap parallel scrapes. One Claude Sonnet 4.5 plan call; 40 Haiku scrape calls. Cost is ~85% lower than the equivalent ReAct loop.
Pattern 4: Tool Use -- how do agents call external functions?
Tool Use is the pattern where an LLM emits a structured function call, an external system executes it, and the result is fed back into the context. It's the foundational primitive every other agent pattern composes around. OpenAI shipped function calling in June 2023; Anthropic's tool_use schema followed; MCP standardized it in late 2024.
tools = [search_web, query_db, send_email]
response = llm(messages, tools=tools)
if response.tool_calls:
for call in response.tool_calls:
result = dispatch(call.name, call.args)
messages.append(tool_result(call.id, result))
response = llm(messages, tools=tools)
return response.content
When to use: anything that needs fresh data, side effects, or computation outside the LLM. Per BenchLM's 2026 leaderboards, frontier models pick the right tool reliably -- the failure mode has shifted to using the result correctly.
When NOT to use: pure reasoning tasks (summarization, classification on text already in-context). Tools add latency and a failure surface. The Berkeley Function Calling Leaderboard shows non-trivial degradation when irrelevant tools are present in the prompt.
Production example: our pricing-research agent has exactly four tools (search, fetch_url, extract_table, compare_prices). Adding a fifth crashed accuracy 8 points in evals. We deleted it.
Pattern 5: Routing -- how do you direct queries to the right agent?
Routing classifies an input once and dispatches it to a specialized handler. It's the one-to-one dispatcher pattern. Anthropic recommends it whenever you have distinct verticals where conflating them in one prompt would degrade performance.
router = llm("Classify into: refund | technical | sales", query)
if router == "refund": return refund_agent(query)
if router == "technical": return tech_agent(query)
if router == "sales": return sales_agent(query)
When to use: mixed-intent inboxes (support, sales, billing in one channel), multi-domain assistants, model routing (cheap model for FAQs, expensive for code). LangChain's router docs note four flavors: rule-based, intent, semantic-embedding, and LLM-based.
When NOT to use: single-domain agents. A router on a one-handler agent is a useless extra LLM call. Also avoid when categories overlap heavily -- ambiguous routing is worse than a single broad-skilled agent.
Production example: our inbound DM agent routes between book_call, answer_question, and escalate_to_human. A 4B-parameter classifier handles the route in ~80ms; the downstream Claude Sonnet handles the substance. Replacing the router with a single Sonnet call increased cost 6x with no quality lift.
Pattern 6: Orchestrator-Worker -- how does dynamic delegation work?
Orchestrator-Worker uses a central LLM to dynamically decompose a task into subtasks at runtime, delegate each to a worker LLM, and synthesize the results. Unlike Parallelization, subtasks are not pre-defined -- the orchestrator decides them per input. The pattern is documented in Anthropic's 'Building Effective Agents'.
subtasks = orchestrator_llm("Decompose this task", input)
results = [worker_llm(s) for s in subtasks] # often parallel
final = orchestrator_llm("Synthesize", results)
return final
When to use: compound tasks where you can't predict the subtasks in advance. Anthropic's example: a coding task where the number of files to change depends on the request. Other fits: deep-research agents, refactoring across unknown file counts, multi-source synthesis.
When NOT to use: tasks with fixed, known steps -- use Plan-and-Execute or a hard-coded workflow. The orchestrator's flexibility is also its cost; you pay for runtime decomposition.
Production example: our content-audit agent gets a single URL and an orchestrator decides whether to fetch the page, the sitemap, the schema markup, the backlink profile, or all four. Worker Haiku calls run in parallel; the orchestrator Sonnet synthesizes. We can't hardcode the decomposition because it depends on what the page actually is.
Pattern 7: Evaluator-Optimizer -- how do agents iterate to a higher bar?
Evaluator-Optimizer pairs a generator LLM with an evaluator LLM in a loop until the output passes a quality threshold. Where Reflection has the same model self-critique, Evaluator-Optimizer typically uses a separate, harder-to-please evaluator. Anthropic cites complex search and translation as canonical use cases.
output = generator_llm(input)
for _ in range(max_iters):
verdict, feedback = evaluator_llm(input, output)
if verdict == "PASS": return output
output = generator_llm(input, prior=output, feedback=feedback)
return output # best effort
When to use: when you have clear, evaluable criteria and the first draft routinely misses them -- literary translation, complex search refinement, code that must pass tests, copy that must meet a brand voice rubric.
When NOT to use: subjective tasks without crisp evaluation criteria. If the evaluator can't reliably distinguishgood from bad, the loop oscillates or rubber-stamps. Also avoid when one good attempt is sufficient -- the second LLM is pure overhead.
Production example: our outbound email agent generates a draft, the evaluator scores it on (a) personalization, (b) CTA clarity, (c) banned-phrase count. If any score is below threshold, the generator revises with that specific feedback. Capped at 2 retries -- past that, returns flatline.
Pattern 8: Parallelization -- when should agents work in parallel?
Parallelization fans the same input out to multiple LLM calls running concurrently, then aggregates. Anthropic splits it into two flavors: Sectioning (different subtasks in parallel) and Voting (same task multiple times for higher confidence).
# Sectioning
results = parallel([
llm("Check for PII", input),
llm("Check for toxicity", input),
llm("Check for off-topic", input),
])
return aggregate(results)
# Voting
votes = parallel([llm("Is this fraud?", input) for _ in range(5)])
return majority(votes)
When to use: independent subtasks (guardrails -- safety check + main response in parallel), high-stakes classifications where ensembling lifts accuracy, and any time wall-clock latency matters more than total tokens.
When NOT to use: sequentially dependent steps. If step B needs step A's output, parallelization is impossible. Voting is also wasteful for low-stakes calls -- 5x cost for a trivial accuracy lift is bad math.
Production example: our content publish flow runs three guardrails in parallel before posting (em-dash detector, hallucination cross-check, banned-phrase scanner). Total wall-clock latency = max of the three, not sum.
Pattern 9: Hierarchical Multi-Agent -- how do supervisors coordinate teams?
Hierarchical Multi-Agent organizes specialist agents under one or more supervisor agents that route work and aggregate results. LangGraph's Supervisor library is the reference implementation; you can stack supervisors-of-supervisors for large enterprise topologies.
supervisor = LLM("Decide which team handles this turn")
teams = {
"research": research_supervisor, # has 3 worker agents
"writing": writing_supervisor, # has 2 worker agents
"qa": qa_supervisor, # has 2 worker agents
}
while not done:
next_team = supervisor(state)
state = teams[next_team].run(state)
When to use: large workflows with genuinely distinct specialist domains where each domain has its own tools, prompts, and evals -- e.g., research + writing + legal review.
When NOT to use: anything a single ReAct agent can do. Hierarchical systems multiply context-passing failures, debugging surface area, and cost. Anthropic's published guidance is explicit: don't reach for multi-agent until single-agent evals fail.
Production example: our SEO content factory runs a Research supervisor (with 3 search/scrape workers) handing off to a Writing supervisor (drafter + editor) handing off to a QA supervisor (schema validator + brand checker). Total: 8 agents, 3 supervisors, one orchestrator on top. We started with one ReAct agent. Evals forced this topology over 6 months -- not architecture astronaut-ing.
How do you combine multiple patterns in a real agent?
Production agents typically layer 3-5 patterns. The default stack: Routing on the front, Tool Use as the primitive, ReAct as the loop, Reflection on the output, and Parallelization for guardrails. Per the Anthropic essay, "these building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases."
A realistic combined flow:
- Router classifies the request into a domain.
- Orchestrator decomposes the task into subtasks if needed.
- Workers execute each subtask with Tool Use inside a ReAct loop.
- Parallelization runs guardrails alongside the main response.
- Evaluator-Optimizer retries on quality miss.
- Reflection does one final self-critique before returning.
The rule: every pattern in the stack must be justified by an eval that fails without it. Most teams over-engineer. Start with Tool Use + ReAct, run real evals, and add the next pattern only when evals point at a specific failure mode.
Which patterns are overkill for a single-task agent?
Hierarchical Multi-Agent, Orchestrator-Worker, and Evaluator-Optimizer are overkill for single-task agents. If your agent does one thing -- summarize a doc, classify a ticket, draft an email -- a basic LLM call or a tight ReAct loop wins on latency, cost, and debuggability.
Quick decision rules:
- One prompt, no tools = no pattern. Just call the LLM.
- One or two tool calls, predictable = Tool Use only.
- Variable steps, exploratory = ReAct.
- Quality gates important, latency tolerant = ReAct + Reflection.
- Multi-domain inbox = Routing + Tool Use.
- Anything multi-agent = only after single-agent evals plateau.
The 2024 Anthropic essay hammers this: agentic complexity should be earned, not assumed. Most production failures between 2024 and 2026 were architectural over-engineering, not model quality. Start small. Add patterns only when evals prove you must.
| Pattern | Best for | Avoid when | Schema/source |
|---|---|---|---|
| ReAct | Open-ended research, debugging, exploratory tool use | Tasks with predictable, repeatable steps | Yao et al., 2022 |
| Reflection | Code generation, long-form writing, fact-checking | Latency-sensitive UX (chat, autocomplete) | Shinn et al., 2023 |
| Plan-and-Execute | Multi-step workflows with clear dependencies | Highly dynamic environments where plans break | LangChain, 2023 |
| Tool Use | Anything needing real-world data or actions | Pure reasoning or summarization tasks | OpenAI function calling, 2023 |
| Routing | Mixed-intent inboxes, multi-domain support | Single-domain agents | Anthropic, 2024 |
| Orchestrator-Worker | Tasks where subtasks can't be predicted upfront | Tasks with fixed, known steps | Anthropic, 2024 |
| Evaluator-Optimizer | Translation, copywriting, search refinement | Tasks without clear evaluation criteria | Anthropic, 2024 |
| Parallelization | Guardrails, voting, independent subtasks | Sequentially dependent steps | Anthropic, 2024 |
| Hierarchical Multi-Agent | Large enterprise workflows with specialist teams | Anything a single agent can do | LangGraph Supervisor, 2025 |