A production AI agent stack is the set of nine layers that turn a working prototype into a system you can run for real users at real volume: a model, a runtime like the Claude Agent SDK, a tool layer built on MCP, memory in Postgres, observability through OpenTelemetry and Langfuse, evals wired into CI, guardrails, a deploy target (Lambda, Cloud Run, or Modal), and cost controls. Six layers are non-negotiable. Three you can defer. This guide shows the picks, the alternatives, and the order to add them.

What does a production AI agent stack look like in 2026?

A production AI agent stack is a layered architecture where each layer solves one failure mode: the model produces tokens, the runtime executes the agent loop, MCP exposes tools, Postgres stores the conversation log, OpenTelemetry exports traces, evals gate deploys, guardrails block injection, the deploy target hosts the runtime, and cost controls cap runaway loops.

The shift from 2025 to 2026 is consolidation. Per the 2026 MCP Roadmap, MCP became the default tool protocol after being donated to the Linux Foundation in December 2025. OpenTelemetry won observability. Postgres won memory. The Claude Agent SDK, OpenAI Agents SDK, and Google ADK are the three runtimes shipping in production.

The stakes are high. According to Gartner (June 2025), more than 40% of agentic AI projects will be canceled by the end of 2027, primarily because of escalating costs, unclear business value, and inadequate risk controls. The teams that ship are the ones who build the boring layers: state, evals, cost caps, guardrails.

The nine layers, in order of who-touches-what-on-every-request:

# Layer Our pick Strong alternatives Skip until...
1 Model Claude Sonnet 4.5 + Haiku router GPT-5, Gemini 2.5 Pro Never. Required.
2 Runtime Claude Agent SDK OpenAI Agents SDK, LangGraph, Google ADK Never. Required.
3 Tools MCP (Streamable HTTP) Native function calling Never. Required.
4 Memory Postgres + Mem0 Redis, pgvector, Zep After 2-turn agents
5 Observability OpenTelemetry + Langfuse LangSmith, Braintrust, Arize Day 1. Don't skip.
6 Evals DeepEval or Braintrust in CI Confident AI, Galileo First production user
7 Guardrails Input/output filters + LLM judge Lakera, Cortex, Straiker First external input
8 Deploy Modal (solo) / Lambda + Fargate (enterprise) Cloud Run, Vercel, Railway Never. Required.
9 Cost controls Per-tenant token caps + circuit breakers LiteLLM gateway, Prefactor $1k/month spend
Production AI Agent Project Outcomes (2026)
Reach production
12%
Canceled by 2027 (forecast)
40%
Stall before ROI
72%
Source: Gartner 2025-2026 + Digital Applied

Which model layer should production agents use?

Production agents in 2026 run a two-tier model strategy: a flagship model (Claude Sonnet 4.5, GPT-5, or Gemini 2.5 Pro) for planning and tool selection, and a cheap model (Haiku, GPT-5 mini, Flash) for parsing, classification, and follow-up turns. Routing 30-50% of calls to the cheap tier is the single highest-leverage cost lever.

The data backs this up. Per Datagrid's cost optimization research, one team cut per-task cost from $0.15 to $0.054 by routing 40% of queries to a cheaper model -- a 64% reduction with no quality loss on the routed tasks.

Pick by use case:

  • Code, long agentic loops, computer use: Claude Sonnet 4.5
  • Open-ended reasoning, tool-heavy multi-step: GPT-5
  • Long context (1M+ tokens), search-grounded answers: Gemini 2.5 Pro
  • Cheap parsing, classification, summarization: Claude Haiku, GPT-5 mini, Gemini Flash

What to skip: Hosting your own open-weights model. Unless you're at >$50k/month in API spend or have hard data residency requirements, the operational cost of running Llama 3 or Qwen on H100s exceeds API cost once you account for engineering time, idle GPU burn, and cold starts.

Which runtime should you use to execute the agent loop?

The runtime is the code that runs the loop: prompt the model, parse the tool call, execute the tool, append the result, repeat until done. In 2026 you have three credible options: the Claude Agent SDK, the OpenAI Agents SDK, and Google's ADK. Frameworks like LangGraph sit a layer above and add graph orchestration.

Our pick: Claude Agent SDK. It ships the agent loop, tool registry, streaming protocol, sub-agent support, and Skills out of the box. Per Anthropic's engineering team, it's the same harness that powers Claude Code in production.

Critical pattern from production deployments:

"Treat the SDK session as ephemeral and the conversation log as the source of truth." -- Autoolize Production Playbook

This matters because every agent runtime crashes eventually. If your durability story is "the SDK is running," you'll lose user state. If it's "the SDK is replaying from Postgres," you survive container restarts, deploys, and OOMs.

When to choose the alternatives:

  • OpenAI Agents SDK: You're already invested in GPT models and Responses API.
  • LangGraph: You need explicit graph-based control flow with checkpoints.
  • Google ADK: You're on Vertex AI and need first-class Gemini grounding.
  • Build your own loop: Almost never. The runtimes above are 6 months of engineering you don't need to do.

Which tool layer should production agents use?

Use MCP (Model Context Protocol) for every tool the agent calls. MCP is the open standard, donated to the Linux Foundation in December 2025, that lets you expose tools, resources, and prompts to any agent runtime over a uniform JSON-RPC interface. You write the server once. Claude, ChatGPT, Cursor, and every other agent runtime can use it.

The adoption curve is real, not hype. Per Digital Applied's MCP Adoption Statistics 2026:

  • 78% of enterprise AI teams have at least one MCP-backed agent in production (April 2026)
  • 97 million monthly SDK downloads (March 2026, a 970x increase in 18 months)
  • 5,800+ public MCP servers
  • 67% of CTOs name MCP their default agent-integration standard within 12 months

Use Streamable HTTP transport, not stdio, for production. Streamable HTTP, finalized in the 2025 spec, lets you run MCP servers as remote services with auth, scaling, and observability. Stdio is a developer-laptop pattern.

The proxy pattern is the security default. Per Anthropic's secure deployment docs, instead of giving the agent a raw API key, run a proxy outside the agent's environment that injects credentials into outbound requests. The agent can call the API; it never sees the key. This is non-negotiable for multi-tenant deployments.

MCP Adoption Among Enterprise AI Teams (2026)
Have MCP-backed agent in production
78
Naming MCP as default integration standard
67
Monthly MCP SDK downloads (millions)
97
Source: Digital Applied MCP Adoption Statistics, 2026

How should you architect agent memory in production?

Production agent memory is a Postgres conversation log plus a structured memory layer. Postgres holds the source-of-truth message history (durable, ACID, queryable). On top of it, Mem0 or pgvector handles long-term episodic and semantic memory: facts the agent needs to recall across sessions.

Why Mem0: per the Mem0 production paper (arXiv:2504.19413), it outperforms OpenAI's native memory by 26% on the LOCOMO benchmark with lower latency and reduced token usage. It supports 24+ vector backends, but pgvector keeps your stack to one database.

Three memory types every agent needs:

  1. Episodic -- what happened in past sessions ("user asked about X last Tuesday"). Stored as embedded summaries.
  2. Semantic -- what the system knows about the user or domain ("user works at Acme, prefers SQL examples"). Stored as structured facts.
  3. Procedural -- how to do recurring tasks ("to refund an order, call X then Y"). Stored as reusable Skills or prompt templates.

What to skip until you need it: A dedicated vector database. Pinecone, Weaviate, Qdrant are excellent, but pgvector handles 10M+ vectors fine and saves you a service. Promote to a dedicated vector DB when query latency on pgvector exceeds 200ms p95 or when you cross 50M vectors.

Why is observability the layer you cannot skip?

Agent observability is structured tracing of every model call, tool invocation, and decision point -- not log files. Without it, debugging a misbehaving agent in production is impossible because you can't replay non-deterministic LLM outputs from logs alone.

The industry has converged on OpenTelemetry as the wire format and tools like Langfuse, LangSmith, and Braintrust as the backend. Per Langfuse's OTEL integration docs, Pydantic AI, smolagents, Strands Agents, and Amazon Bedrock AgentCore all emit OTEL traces natively. Pick the backend; the wire format is settled.

Our pick: OpenTelemetry + Langfuse (self-hosted or cloud). Reasons: open source, OTEL-native, ships LLM-specific helpers (token cost, prompt linking, scoring), and the data model maps to traces you can replay. AWS uses Langfuse as the reference observability stack for Bedrock AgentCore.

Wire it day one. Five fields per span are enough to start: input tokens, output tokens, latency, cost in USD, and tool name. Add user_id and tenant_id immediately so you can filter by who. Without these, your first production incident takes 8 hours instead of 20 minutes.

How do evals function as a CI gate for agent changes?

An eval CI gate runs your agent's golden test set on every pull request and fails the build if the score drops below threshold. This is how you ship prompt or model changes without regressions reaching users. It's the difference between "deploy and pray" and an actual engineering process.

Per Anthropic's evals guidance, automated evals are most valuable in CI/CD as the first line of defense against quality regressions. Galileo's CI/CD framework describes regression gates that block deployments reducing quality below defined thresholds.

Three eval layers you wire in order:

  1. Unit evals (10-30 examples per critical task). Run on every PR. <30 seconds. Hard fail.
  2. Integration evals (50-200 examples covering tool sequences). Run nightly + on release branches. Soft fail with PR comment.
  3. Online evals (sampled production traffic, scored by LLM-judge). Run continuously. Alert on drift.

Tool picks: DeepEval for open-source CI integration, Braintrust if you want a managed platform with native GitHub Action support. Both let you set per-metric thresholds that block the build.

What to skip: Building your own eval framework. The boilerplate (running parallel completions, scoring with LLM-judges, aggregating, reporting) is exactly what these tools solve.

What guardrails do production agents actually need?

Production agents need four guardrail types: input filtering (catch prompt injection before the model sees it), output filtering (catch data exfiltration before the user sees it), tool permission scoping (least-privilege per tenant), and a behavioral monitor (LLM-judge or rules engine sampling production traffic).

Prompt injection is the #1 vulnerability. Per Galileo's prompt injection research and Snowflake's Cortex AI Guardrails research, pattern-based filters miss encoded instructions, emoji-based bypasses, and multi-step hijacks -- which are the attacks actually used against agents in 2026. You need a semantic detector, not regex.

NVIDIA's AI Red Team has documented multimodal injections using emoji rebus puzzles that bypass existing guardrails. Even Claude Code, Gemini CLI, and GitHub Copilot have shipped with prompt injection vulnerabilities via comments in code.

Practical defaults:

  • Input filter: Lakera Guard or a Claude Haiku judge with a tight system prompt. <100ms added latency.
  • Output filter: Regex for PII + LLM judge for data exfiltration patterns.
  • Tool scoping: Per-tenant allowlists. The CRM agent for tenant A cannot access tenant B's data, ever.
  • Behavioral monitor: Sample 1-5% of production traffic, judge with a stronger model, alert on anomaly score.

What to skip on day one: A full red-teaming program. Wait until you have paying customers. But ship the four guardrails above before the first external user hits the system.

How do you deploy an AI agent in production?

Deploy shape depends on the agent's runtime profile. Short-lived agents (<60s, no streaming) go on Lambda or Cloud Run. Long-running, streaming, or stateful agents go on Fargate, Modal, or a sandboxed container platform. Forcing a long-running agent into Lambda is the most common deploy mistake teams make.

Per Anthropic's Agent SDK hosting docs, the SDK should run inside a sandboxed container with process isolation, resource limits, network egress control, and ephemeral filesystems. Container minimum cost is roughly $0.05/hour idle, so tune idle timeouts aggressively.

Deploy target by profile:

Profile Best target Why
Webhook-triggered, short tool agent AWS Lambda, Vercel Pay-per-invocation, scales to zero
Streaming chat with 30s+ turns Cloud Run, Modal, Fargate No 15-min Lambda timeout, WebSocket support
Long-running coding agent Modal, sandboxed containers Filesystem persistence, GPU access
Multi-tenant SaaS at scale Fargate + ALB, EKS Predictable cost curve, fine-grained networking
Solo dev shipping fast Modal Decorate a Python function, deploy in seconds

Modal is the solo-dev pick. Per Modal's docs, you decorate a Python function and deploy. No Dockerfile, no Terraform. Cold starts are seconds, scaling is automatic, and you get GPU access when needed. For solo founders, this saves a week of platform engineering.

How do you control AI agent costs in production?

Control agent costs with four layered caps: per-request token budget (kill runaway loops), per-task budget (prevent infinite tool retries), per-user daily limit (cap individual abuse), and per-tenant monthly budget (protect margin). All four enforce in the agent harness, not after the fact.

Per Runyard's cost control research, production teams use circuit breakers that halt agents when token consumption exceeds threshold, with real-time tracking on every step. Teams that actively track cost metrics reduce per-output cost by 20-40% within the first month (Datagrid).

Other costs you'll discover after you ship. Per Automation Labs' production cost analysis, tokens are usually not the biggest line item. Vector DB hosting, observability ingestion, container idle time, third-party tool API calls, and the engineer hours debugging non-determinism each often exceed the LLM bill. Expect $3,200-$13,000/month in operational spend for a single production agent past launch (Azilen).

Five cost moves with the highest ROI:

  1. Route 30-50% of calls to a cheap model with a router (Haiku, Flash, GPT-5 mini)
  2. Cache prompts >1024 tokens that repeat (Claude prompt caching = 90% off cached tokens)
  3. Compact conversation history before context fills, not after
  4. Set hard per-tenant monthly caps with auto-pause + email alert
  5. Use spot/preemptible instances for non-streaming background agents

What's the right stack for a solo developer vs an enterprise?

Solo dev: optimize for shipping. Skip every layer you can defer. Enterprise: optimize for survival. Wire every layer day one because the cost of a missing guardrail or runaway loop in a multi-tenant deployment is six figures, not a Twitter screenshot.

Solo developer minimal stack (ship in a weekend):

  • Model: Claude Sonnet 4.5 via API
  • Runtime: Claude Agent SDK
  • Tools: 1-2 MCP servers (build or pull from registry)
  • Memory: Postgres on Neon or Supabase, no Mem0 yet
  • Observability: Langfuse cloud (free tier)
  • Evals: 10 hand-written test cases in DeepEval
  • Guardrails: Skip. Add when you onboard external users.
  • Deploy: Modal
  • Cost controls: A single per-request token cap. Hard limit at $50/day.

Enterprise stack (multi-tenant SaaS, regulated industry):

  • Model: Multi-provider via gateway (Anthropic + OpenAI + Bedrock fallback)
  • Runtime: Claude Agent SDK in sandboxed Fargate containers
  • Tools: MCP servers behind a credential proxy with per-tenant scoping
  • Memory: Postgres + Mem0 + pgvector, encrypted at rest, per-tenant isolation
  • Observability: Self-hosted Langfuse + OpenTelemetry collector + SIEM export
  • Evals: DeepEval or Braintrust gating every PR + nightly integration evals
  • Guardrails: Lakera or Cortex input/output + LLM judge on 5% of traffic
  • Deploy: ECS Fargate behind ALB, Lambda for short tool agents
  • Cost controls: 4-layer cap (request, task, user, tenant) + LiteLLM gateway for unified billing

The gap between these two stacks is where most teams die. They build the solo stack, get traction, then try to retrofit guardrails and cost controls under load. Wire the enterprise layers before you sign the first enterprise contract.

Which components are non-negotiable vs optional?

Six layers are non-negotiable for any agent serving real users: model, runtime, tool layer, conversation log persistence, observability, and a basic cost cap. Three layers can be deferred: dedicated memory beyond Postgres, full eval CI, and advanced guardrails. But the deferral has expiration dates -- don't push them past the events listed below.

Non-negotiable from day one:

  1. Model + runtime -- you have no agent without these
  2. MCP tool layer -- raw function calls don't survive a model swap
  3. Postgres conversation log -- no log = no replay = no debugging
  4. OpenTelemetry tracing -- 80% of production debugging happens here
  5. Per-request token cap -- one runaway loop can cost $500 in 5 minutes
  6. Deploy target with sandbox isolation -- never run agents in your main app process

Defer until trigger event:

Layer Defer until
Mem0 / vector memory Agent has 3+ turns or cross-session recall requirement
Eval CI gate First paying customer or first prompt regression incident
LLM-judge guardrails First external (untrusted) input source
Multi-provider model gateway Vendor outage causes a real incident or $5k+/month spend
Per-tenant cost dashboards 10+ tenants or first "why is my bill so high" support ticket

What to never build:

  • Your own agent runtime (use Claude Agent SDK, OpenAI Agents SDK, or LangGraph)
  • Your own observability backend (use Langfuse, LangSmith, or Braintrust)
  • Your own MCP transport (use Streamable HTTP from the spec)
  • Your own eval scoring framework (use DeepEval or Braintrust)

The pattern: buy or open-source the infrastructure, write the business logic. Teams that invert this die in maintenance.

LayerOur pickStrong alternativesSkip until...
ModelClaude Sonnet 4.5 + Haiku routerGPT-5, Gemini 2.5 Pro, Llama 3.3 self-hostedNever -- required
RuntimeClaude Agent SDKOpenAI Agents SDK, LangGraph, Google ADKNever -- required
Tool layerMCP via Streamable HTTPNative function calling, OpenAPI pluginsNever -- required
MemoryPostgres conversation log + Mem0pgvector, Redis, Zep, PineconeDefer Mem0 until 3-turn agents
ObservabilityOpenTelemetry + LangfuseLangSmith, Braintrust, Arize, HeliconeDay 1 -- do not skip
Evals (CI gate)DeepEval or BraintrustConfident AI, Galileo, PromptfooFirst paying customer
GuardrailsLakera + LLM-judge samplingSnowflake Cortex, Straiker, AWS Bedrock GuardrailsFirst external input
Deploy targetModal (solo) / Fargate (enterprise)Lambda, Cloud Run, Vercel, Railway, EKSNever -- required
Cost controls4-layer caps + LiteLLM gatewayPrefactor, Helicone, custom middleware$1k/month spend