Median cost to run an AI agent task in 2026 is $0.026 to $0.241, depending on the model. We ran the same five agentic workflows -- web research, code review, email triage, log analysis, and content QA -- 100 times each across Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5, GPT-5-mini, and Gemini 2.5 Pro. This article reports median tokens, P95 latency, dollars per task, and accuracy. The harness is open source so you can re-run it. The headline: GPT-5-mini is 9.3x cheaper than Sonnet 4.5 but 11.2 points less accurate. Pick the cheapest model that clears your accuracy bar.
What did we benchmark and how?
We ran five agentic tasks 100 times each across five frontier models in April 2026, capturing tokens, latency, dollars, and accuracy for every run. All models used identical prompts, identical tool definitions (web_search, file_read, code_exec, http_get, summarize), and a hard 12-step ceiling per task.
The five tasks
- Web research: "Find three recent peer-reviewed studies on X, extract methodology and N, return JSON."
- Code review: 200-LOC pull request, model returns severity-ranked issue list with line numbers.
- Email triage: 25 emails in, classified into 6 buckets, draft reply for the top-priority one.
- Log analysis: 8K-line Kubernetes log, identify the root-cause request ID and explain failure.
- Content QA: 1,500-word draft, return list of factual errors with citations.
Harness setup
- All models accessed via official APIs (Anthropic, OpenAI, Google AI). No proxy, no caching enabled (we measure caching separately below).
- Tokens captured from API response usage fields. Latency measured wall-clock from request start to final tool-completed response.
- Accuracy graded by a held-out GPT-5 judge against a human-labeled gold set of 25 examples per task, with 50 human-spot-checked.
- Code, prompts, and raw run logs are public on the agent-cost-benchmarks repo. Tracing was captured with agent observability via OpenTelemetry.
How much does it cost to run an AI agent per task?
Median cost per task ranges from $0.026 to $0.241 across frontier models in 2026. That is a 9.3x spread for workloads that look identical from outside the agent. Token usage was surprisingly consistent across models -- 45K to 52K input tokens and 7K to 9K output tokens median. Cost differences are almost entirely a pricing function, not a behavior function.
Per-task cost (median across 5 tasks, n=500 runs)
| Model | $/task | Input $/M | Output $/M |
|---|---|---|---|
| GPT-5-mini | $0.026 | $0.25 | $2.00 |
| Claude Haiku 4.5 | $0.082 | $1.00 | $5.00 |
| Gemini 2.5 Pro | $0.131 | $1.25 | $10.00 |
| GPT-5 | $0.134 | $1.25 | $10.00 |
| Claude Sonnet 4.5 | $0.241 | $3.00 | $15.00 |
Pricing data: Anthropic pricing,OpenAI pricing, Google AI pricing.
For scale: the SWE-bench cost study reports unconstrained agents cost $5-$8 per software-engineering task and 35.5 API calls with 440K input tokens. Our task suite is shorter and cheaper because we cap turns at 12 and the tasks are not full SWE-bench resolutions.
Which model is cheapest for AI agents in 2026?
GPT-5-mini is the cheapest frontier-adjacent model at $0.026 per task, 3.2x cheaper than Claude Haiku 4.5 and 9.3x cheaper than Claude Sonnet 4.5. But cheap is only useful if it clears your accuracy bar. In our benchmark GPT-5-mini hit 80.6% accuracy versus 93.1% for GPT-5 -- a gap that matters for high-stakes workflows.
Cost-per-correct-task (the metric that actually matters)
Adjust for accuracy and the picture changes:
| Model | $/task | Accuracy | Effective $/correct task |
|---|---|---|---|
| GPT-5-mini | $0.026 | 80.6% | $0.032 |
| Claude Haiku 4.5 | $0.082 | 85.2% | $0.096 |
| Gemini 2.5 Pro | $0.131 | 88.4% | $0.148 |
| GPT-5 | $0.134 | 93.1% | $0.144 |
| Claude Sonnet 4.5 | $0.241 | 91.8% | $0.262 |
GPT-5 actually delivers more correct answers per dollar than Sonnet 4.5 in this benchmark. The cheapest option overall remains GPT-5-mini, but only if a 19.4% failure rate is acceptable. As NVIDIA's TCO analysis argues, total cost-per-correct-output is the only metric that matters.
What is the latency of a typical agent run?
P95 end-to-end latency ranges from 8.9 seconds (GPT-5-mini) to 27.8 seconds (Claude Sonnet 4.5) for 5-tool agent workflows. P95 inflated 1.6x to 3.2x over median in our runs, matching the 2026 P95 inflation finding that production SLOs need to be P95-anchored.
Latency breakdown (5-task average)
| Model | Median latency | P95 latency | Output tokens/sec |
|---|---|---|---|
| GPT-5-mini | 4.1s | 8.9s | ~85 t/s |
| Claude Haiku 4.5 | 5.6s | 11.4s | ~70 t/s |
| Gemini 2.5 Pro | 9.4s | 21.7s | ~52 t/s |
| GPT-5 | 11.0s | 24.6s | ~48 t/s |
| Claude Sonnet 4.5 | 12.3s | 27.8s | 42.5 t/s |
Claude Sonnet 4.5 throughput of 42.5 tokens/sec from Artificial Analysis matched our measurements. As the CodeAnt latency analysis notes, raw tokens-per-second misleads -- end-to-end task latency is what users feel.
Latency targets by use case
- Interactive chat agent: under 3s P95 for short answers, under 12s P95 for tool-using
- Background workflow agent: 30-60s P95 acceptable
- Batch agent: optimize for total cost, not latency
How does Claude Haiku 4.5 compare to GPT-5-mini for agents?
Claude Haiku 4.5 costs 3.2x more than GPT-5-mini per task ($0.082 vs $0.026) but scores 4.6 points higher on accuracy (85.2% vs 80.6%) and is more reliable on long tool chains. The choice depends on whether your bottleneck is cost or correctness.
Where Haiku 4.5 wins
- Tool-use fidelity: Haiku 4.5 made 0.7 invalid tool calls per 100 runs in our benchmark; GPT-5-mini made 4.2.
- Instruction following on long contexts: at 40K+ input, Haiku 4.5 dropped 1.8 points of accuracy; GPT-5-mini dropped 6.4.
- Sub-agent reliability: when used as a worker model in Claude Agent SDK subagents, Haiku 4.5 returned valid structured output 98.1% of the time vs 91.4% for GPT-5-mini.
Where GPT-5-mini wins
- Raw cost: 3.2x cheaper per task -- decisive for high-volume classification or extraction.
- TTFT: 0.31s median time-to-first-token vs Haiku 4.5's 0.48s.
- Long output: 128K output window vs Haiku 4.5's 64K.
Rule of thumb: GPT-5-mini for stateless high-volume calls (email classification, content moderation, simple extraction). Haiku 4.5 for stateful sub-agents inside production multi-agent systems where one bad tool call breaks the chain. Anthropic's own Haiku 4.5 launch post reports Sonnet-4-level coding at 1/3 the price.
How do you reduce AI agent costs?
The five highest-leverage levers cut agent costs 60-95% without dropping accuracy. Most teams leave 70% of their bill on the table because they pay rack-rate input tokens on every step.
1. Prompt caching (80-90% savings on cached input)
Anthropic and OpenAI both offer prompt caching. Anthropic's caching docs report cached reads at 10% of base price. ProjectDiscovery cut LLM costs 59% using caching on stable system prompts. For agents, cache the system prompt + tool definitions; expect 50-75% cache hit rate.
2. Model routing (40-70% savings)
Route easy steps (intent detection, tool selection) to GPT-5-mini or Haiku 4.5 and only escalate the hard reasoning step to Sonnet 4.5 or GPT-5. Our internal stack does this and runs 4.1x cheaper at equal accuracy. See the production agent stack writeup.
3. Tool result truncation (20-40% savings)
Log files, web pages, and DB results balloon context. Cap each tool result at 2-4K tokens, summarize before re-feeding to the model. This is the single biggest fix for runaway costs.
4. Plan caching (50% cost, 27% latency reduction)
The Agentic Plan Caching paper (arXiv 2506.14852) shows extracting and reusing plan templates across semantically similar tasks cuts cost 50.31% and latency 27.28%.
5. Dynamic turn limits (24% savings)
Replace hard turn caps with confidence-based termination. Stevens Online's analysis reports a 24% cost reduction at equal solve rate. Combine these and a 2026 production agent should run at 10-30% of its naive baseline cost. For framework selection that supports these patterns, see how to evaluate an AI agent framework.
What does this mean for your agent budget?
Three takeaways for engineering leaders sizing an agent budget in 2026.
1. Token cost is no longer the constraint -- token efficiency is. Sticker prices have dropped 40-60% YoY. The teams overspending in 2026 are running naive multi-step loops without caching, routing, or truncation. Fix the architecture before negotiating with a vendor.
2. Pick on cost-per-correct-output, not cost-per-token. GPT-5 looks more expensive than Gemini 2.5 Pro until you adjust for accuracy -- then it is cheaper per correct answer. Run your own task-specific benchmark; vendor benchmarks generalize poorly to your prompt and tools.
3. Budget P95, not P50. A user-facing agent that median-times-out 1% of the time and P95-times-out 8% of the time is broken UX and a refund liability. P95 inflated 1.6-3.2x over P50 in our runs and matches the 2026 industry pattern.
Sample 2026 budget for a production agent serving 100K tasks/month
- Naive (Sonnet 4.5, no caching, no routing): $24,100/month
- Caching only (74% hit rate): ~$8,400/month
- Caching + routing + truncation: ~$2,900/month
- All five levers + plan caching: ~$1,500/month
The difference between those two endpoints is one engineering sprint.
| Model | Input $/M | Output $/M | Median tokens/task | P95 latency | $/task (median) | Accuracy |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | 48K in / 9.1K out | 27.8s | $0.241 | 91.8% |
| Claude Haiku 4.5 | $1.00 | $5.00 | 49K in / 7.6K out | 11.4s | $0.082 | 85.2% |
| GPT-5 | $1.25 | $10.00 | 46K in / 8.2K out | 24.6s | $0.134 | 93.1% |
| GPT-5-mini | $0.25 | $2.00 | 45K in / 7.4K out | 8.9s | $0.026 | 80.6% |
| Gemini 2.5 Pro | $1.25 | $10.00 | 52K in / 7.7K out | 21.7s | $0.131 | 88.4% |