Most "AI agent tools" articles are framework lists. LangGraph, CrewAI, AutoGen, repeat. Useful, but a framework is the conductor, not the orchestra. The tools below are what we actually call into when an agent runs in production at Growth Engineer. Browser runtimes, code sandboxes, search APIs, auth-aware tool registries, memory layers, observability, durable workflow engines, deploy targets. For each one: what it does, when it earned its spot in our stack, and what it replaced. If you've already shipped your first agent with the Claude SDK, this is the next layer.
What tools do AI agents actually need (and why frameworks aren't the answer)?
Production agents need six tooling layers beyond the framework: browser, sandbox, search/scrape, auth, memory, and observability. A framework like LangGraph or the Claude Agent SDK is a conductor. These tools are the orchestra.
The gap is real. According to Databricks' 2026 State of AI Agents report, 51% of enterprises now run agents in production and 23% are scaling them. But LangChain's 2026 State of Agent Engineering data, summarized in Digital Applied's 2026 enterprise breakdown, shows 88% of agent pilots never reach production -- and 64% of teams cite evaluation and observability as the single largest blocker.
That is not a framework problem. It is an infrastructure problem.
The 15 tools below are organized by the question they answer in a real run. Each entry includes when it earned its spot in our stack and what it replaced. We are not getting paid by any of them.
If you want the deeper wiring of how they interconnect, see the full production agent stack.
What is the best browser automation tool for AI agents?
Browserbase and Browserless are the two browser runtimes worth considering in 2026. Pick Browserbase if you want the cleanest agent SDK on top of Playwright. Pick Browserless if you want self-hostability or BrowserQL.
The market consolidated fast. Per browser-use's GitHub, browser-use went from zero to 78K stars in months, and managed cloud browsers are now the default because anti-bot detection has outpaced what most teams can maintain in-house.
1. Browserbase + Stagehand
What it does. Browserbase runs headless Chrome sessions in the cloud with persistent state, anti-bot defenses, and CDP access. Their open-source SDK Stagehand gives you four primitives -- act, extract, observe, agent -- so an LLM can drive a browser with natural language instead of brittle selectors.
When it earned its spot. The first time we shipped an agent that scraped a JS-heavy SPA behind login. Our self-hosted Playwright fleet on EC2 was breaking weekly on Cloudflare challenges.
What it replaced. A homegrown Playwright cluster, sticky proxy rotation, and a Slack channel for failure alerts.
2. Browserless
What it does. Browserless provides headless Chrome as a managed service that speaks Puppeteer, Playwright, Selenium, and their own GraphQL-style query language BrowserQL. Their 2026 State of Web Scraping report is a useful primary source on detection trends.
When it earned its spot. When a client required self-hosting for compliance and Browserbase's hosted-only model was a non-starter.
What it replaced. A custom Puppeteer cluster maintained by one engineer who eventually quit.
Which code execution sandbox should you use?
E2B is the default. Daytona is the answer when latency matters. Both let you run LLM-generated Python or JS in isolated cloud sandboxes and connect to any model -- unlike OpenAI's hosted Code Interpreter, which locks you to GPT.
3. E2B
What it does. E2B provides secure cloud sandboxes for AI-generated code with Python and JS/TS SDKs and a desktop variant for computer-use agents. Per E2B's homepage, the platform is used by 88% of Fortune 100 companies for frontier agentic workflows.
When it earned its spot. The day we let an agent execute pandas code on user-uploaded CSVs. Running that locally in Docker-in-Docker was reckless.
What it replaced. Docker-in-Docker on our own infra and a long list of CVE alerts we were ignoring.
4. Daytona
What it does. Daytona is an open-source secure infrastructure for running AI-generated code with sandboxes that, per Daytona's docs, spin up in under 90ms and support stateful snapshots across runs.
When it earned its spot. A real-time coding agent where 800ms of cold-start latency from our previous setup was the difference between magical and broken.
What it replaced. Fly.io machines we kept warm by hand and a Redis cache of half-finished runs.
How do you give an AI agent a useful view of the web?
You replace Google + custom scrapers with Exa for search and Firecrawl for fetch. Both are designed around the constraint that LLM context windows are expensive and most pages are 90% HTML noise.
5. Exa
What it does. Exa is an embeddings-first web search API built for agents. Per Exa's 2.0 announcement, Exa Fast hits sub-350ms P50 latency, roughly 30% faster than the next-fastest API. It returns token-efficient highlights -- the most relevant excerpts from each result -- using ~10x fewer tokens than full text.
When it earned its spot. When our research agent's prompts started hitting Google's rate limits and the parsed snippets were too short to ground answers.
What it replaced. Google Custom Search Engine plus a custom snippet extractor that broke every other Tuesday.
6. Firecrawl
What it does. Firecrawl turns any URL into clean LLM-ready markdown or JSON, handles JS rendering automatically, and exposes /scrape, /crawl, /search, and /extract endpoints. Pass a JSON schema and it returns structured data with no parsing.
When it earned its spot. The third time we wrote a Cheerio + Readability fallback chain to clean up scraped HTML for context.
What it replaced. A 600-line scraping library, a graveyard of CSS selectors, and one engineer's weekend.
How do you handle OAuth and tool calling for AI agents?
You don't write it yourself. You use Composio or Arcade.dev. OAuth, token refresh, and per-user permissions are the most demoralizing code in any agent codebase.
7. Composio
What it does. Composio is a tool registry with 1,000+ toolkits across Gmail, Slack, GitHub, Notion, Google Workspace, and more. Per Composio's AgentAuth page, it handles OAuth2, API keys, JWT, and token lifecycle across 250+ apps so your agent code can focus on reasoning.
When it earned its spot. The third time a different engineer rewrote a Google OAuth flow.
What it replaced. Hand-rolled OAuth callbacks, a Postgres oauth_tokens table, and a refresh cron job.
8. Arcade.dev
What it does. Arcade is an MCP runtime for production agents. Per Arcade's docs, agents act with user-specific permissions rather than service accounts, and Arcade injects credentials at call time so the LLM never sees the OAuth token. They claim 7,000+ pre-built integrations.
When it earned its spot. A multi-tenant Slack agent where one shared bot token would have been a security incident waiting to happen.
What it replaced. A service-account token in our .env file and an audit-trail conversation we did not want to have.
How do AI agents remember things across runs?
Letta is an agent runtime with memory built in. Mem0 is a memory layer you bolt onto whatever framework you already use. Pick by how much of your stack you want to own.
9. Letta
What it does. Letta (formerly MemGPT) is a stateful agent platform with tiered memory: core (in-context), recall (recent), and archival (vector-searchable). Per Letta's benchmarking post, Letta agents maintain task context across 500+ interactions, where typical RAG baselines fragment after 50.
When it earned its spot. A long-running customer-support agent where the same user came back two weeks later and we were tired of pretending we remembered them.
What it replaced. Stuffing everything into the system prompt and watching token costs explode.
10. Mem0
What it does. Mem0 is a memory layer that adds persistent recall to any agent framework -- LangGraph, CrewAI, the OpenAI SDK, the Anthropic SDK. Smaller surface area than Letta, faster to integrate.
When it earned its spot. A side-project agent where we did not want to migrate runtimes just to add memory.
What it replaced. A Postgres table called notes and a SQL query I refuse to share.
How do you observe and debug AI agent runs?
Use Langfuse for traces and evals. Use Helicone for proxying and caching. They are not interchangeable. Per groundcover's 2026 observability guide, only 4% of organizations have reached full agent observability maturity, even as 80% of Fortune 500 companies have agents in production. Most teams are flying blind.
11. Langfuse
What it does. Langfuse is an open-source LLM engineering platform with tracing, prompt management, evaluations, and datasets. Per Langfuse's homepage, it is used by 2,300+ companies and processes billions of observations per month. It organizes agent runs into trace trees with typed nodes for generations, retrievals, and tool calls.
When it earned its spot. The first multi-step agent we shipped where a single failure took 40 minutes to root-cause from logs.
What it replaced. console.log(), a Notion doc of bad outputs, and learned helplessness.
12. Helicone
What it does. Helicone is a one-line LLM proxy that logs requests and responses across OpenAI, Anthropic, Gemini, and others, with edge caching on Cloudflare per Helicone's caching docs. Cached responses serve in milliseconds and skip the provider entirely.
When it earned its spot. The day we noticed an agent was re-asking the same model the same question 20 times per run.
What it replaced. Direct provider SDK calls and an OpenAI bill that was 3x what it should have been.
How do you ship and scale agents in production?
Modal handles serverless GPU compute. Inngest handles durable, multi-step workflows. Most agent runs need both -- compute that scales to zero and execution that survives a 30-minute LLM timeout.
13. Modal
What it does. Modal is a serverless cloud where you decorate a Python function and get GPU access, autoscaling, and per-second billing. Per Modal's pricing page, GPUs run from $0.000164/sec for a T4 to $0.001736/sec for a B200, and the free tier includes $30/month of compute.
When it earned its spot. When we needed to fine-tune a small model for an agent classifier and refused to pay for an always-on H100.
What it replaced. An EC2 g5.xlarge that cost $700 in a month we forgot to turn it off.
14. Inngest
What it does. Inngest is a durable execution platform for background jobs, AI workflows, and long-running agents. Per Inngest's blog, durable execution lets agents pause for hours awaiting human approval without losing state, with checkpointing between tool calls. Their engine handles 100M+ daily executions.
A strong alternative is Trigger.dev, which is also Apache 2.0 and supports no-timeout tasks plus human-in-the-loop pauses out of the box.
When it earned its spot. An agent run that took 12 minutes and kept dying when our serverless host hit its 5-minute limit.
What it replaced. A BullMQ queue, a Redis lock, and a state machine I would like to forget.
How do you let other developers use your agent's API?
Stainless generates idiomatic SDKs in 8 languages from a single OpenAPI spec. This sounds boring. It is the highest-leverage tool on the list once you start exposing your agent to other developers or to other agents via MCP.
15. Stainless
What it does. Stainless generates SDKs, docs, a CLI, and an MCP server from your OpenAPI spec. Per Stainless's announcement post, supported languages are TypeScript, Python, Go, Java, Kotlin, Ruby, PHP, and C#. The generated SDKs include retries, pagination, and structured errors by default. Stainless powers OpenAI's official client libraries, among others.
When it earned its spot. When external developers started building on our agent's API and we did not want to maintain a hand-written TypeScript client.
What it replaced. A TypeScript SDK maintained by one engineer and a Python client that lagged six versions behind.
How does this stack compare to typical 'AI agent tools' lists?
Most lists conflate three layers: orchestration frameworks, agent products, and infrastructure. This list is only the third.
| Layer | Examples | What it does |
|---|---|---|
| Agent products | Cursor, Claude Code, Replit Agent, Comet | End-user agents you don't build |
| Orchestration frameworks | LangGraph, CrewAI, Anthropic Agent SDK, OpenAI Agents SDK | Coordinate LLM + tool calls |
| Infrastructure (this list) | Browserbase, E2B, Composio, Langfuse, etc. | The tools the framework actually calls |
Frameworks are commoditizing fast. Per StackOne's 2026 agentic tools landscape, 120+ agentic AI tools now span 11 categories, and most teams swap their framework once or twice in a year. The infrastructure layer churns less because the integrations are stickier -- you do not casually rewrite OAuth.
If you want the framework comparison, that is a different article. If you want to actually ship, you need the layer above.
Which AI agent tools should you add first as your stack matures?
Start small. Add one tool per real failure mode. A 15-tool stack on day one is a recipe for paying for things you do not use.
A practical sequence we have seen work, in order:
- Day 1. Framework + LLM provider + one tool (usually Firecrawl or Exa). Ship something narrow.
- Day 7. Add Langfuse the moment a second user reports a bug. Without traces, you are guessing.
- Day 14. Add Helicone for caching when your provider bill jumps. One line of code, immediate ROI.
- Day 21. Add E2B or Daytona the day you let the agent run code. Don't run untrusted code on your infra.
- Day 30. Add Composio or Arcade.dev the third time you write OAuth. The third time, not the first.
- Day 45. Add Browserbase or Browserless when scraping fails the third week in a row.
- Day 60. Add Inngest or Trigger.dev when a single run regularly exceeds 5 minutes.
- Day 90. Add Letta or Mem0 when you can articulate exactly what the agent is supposed to remember.
- Day 120+. Add Modal for custom compute, Stainless when external developers show up.
The goal is not to use all 15. The goal is to know which one you reach for when a specific thing breaks.
| Tool | Category | What it does | What it replaced in our stack |
|---|---|---|---|
| Browserbase | Browser automation | Managed Chrome runtime + Stagehand SDK (act/extract/observe) | Self-hosted Playwright fleet on EC2 |
| Browserless | Browser automation | Headless Chrome as a service with BrowserQL | Custom Puppeteer cluster with sticky sessions |
| E2B | Code sandbox | Secure cloud sandboxes for LLM-generated code | Docker-in-Docker on our own infra |
| Daytona | Code sandbox | Sub-90ms sandbox cold starts with stateful snapshots | Fly.io machines we kept warm by hand |
| Exa | Search API | Embeddings-first web search built for agents | Google Custom Search + brittle scraping |
| Firecrawl | Scraping | Clean markdown / JSON output from any URL | Cheerio + readability + a graveyard of selectors |
| Composio | Tool registry | 1,000+ toolkits with managed OAuth + token refresh | Hand-rolled OAuth flows per integration |
| Arcade.dev | Tool runtime | User-scoped OAuth; LLM never sees the token | Service-account tokens shared across users |
| Letta | Memory / runtime | Stateful agents with tiered memory (core/recall/archival) | Stuffing the context window and praying |
| Mem0 | Memory layer | Bolt-on memory for any framework | A Postgres table called `notes` |
| Langfuse | Tracing | Open-source traces, evals, prompt mgmt | console.log() and a Notion doc |
| Helicone | LLM proxy | One-line proxy with edge caching + rate limits | Direct provider SDK calls + duplicate spend |
| Modal | Compute | Serverless GPU + per-second billing | Always-on EC2 GPU we forgot to turn off |
| Inngest | Durable workflows | Step functions with retries, sleeps, human-in-the-loop | Bull queues + a state machine we regretted |
| Stainless | SDK generation | Idiomatic SDKs in 8 languages from one OpenAPI spec | A TypeScript client maintained by one engineer |