data-driven 11 min read May 04, 2026

AI Agent Cost Benchmarks: Tokens, Latency, and Dollars per Task

Q: How does Claude Haiku 4.5 compare to GPT-5-mini for agents?

Claude Haiku 4.5 costs 3.2x more per task than GPT-5-mini ($0.082 vs $0.026) but scores 4.6 percentage points higher on accuracy and is more reliable on long tool chains. Use GPT-5-mini for high-volume classification, Haiku 4.5 for sub-agents in production multi-agent systems.

Q: How do you reduce AI agent costs?

The five highest-leverage levers are prompt caching (80-90% off cached input), model routing (mini for easy steps), tool result truncation (cap at 2-4K tokens), plan caching (50% cost, 27% latency reduction per arXiv 2506.14852), and dynamic turn limits (24% savings at equal solve rate).

Q: What is a good cost-per-task target for production agents?

For customer-facing agents, target under $0.10 per task. For internal-tool agents (code review, log analysis), $0.10-$0.50 per task is acceptable. For research-grade agents, $1-$5 per task is fine when output replaces hours of human work.

Q: How many tokens does a typical agent task consume?

Median consumption is 45-52K input tokens and 7-9K output tokens per task across 5-8 tool calls. Coding agents on SWE-bench-style tasks consume far more -- one Anthropic study cites 35.5 API calls and 440K input tokens per task.

By Peter Foy

Original 2026 benchmark: tokens, P95 latency, $/task, and accuracy across Claude Sonnet 4.5, Haiku 4.5, GPT-5, GPT-5-mini, Gemini 2.5 Pro.

TL;DR

Across 500 runs of five agent tasks in April 2026, median cost ranged from $0.026 (GPT-5-mini) to $0.241 (Claude Sonnet 4.5) per task. P95 latency ran 8.9s to 27.8s. GPT-5 was the most accurate (93.1%); GPT-5-mini was 9x cheaper but 12.5 points less accurate. Pick the cheapest model that clears your accuracy bar.

GPT-5-mini is 9.3x cheaper than Claude Sonnet 4.5 per task ($0.026 vs $0.241) but 11.2 points less accurate
P95 latency inflates 1.6-3.2x over median -- design SLOs on P95, not the marketing number
Prompt caching cuts cached input tokens 80-90%; plan caching cuts cost 50% and latency 27%
Claude Haiku 4.5 costs 3.2x more than GPT-5-mini but wins on tool-use reliability for sub-agents
For interactive agents target sub-12s P95; for production budget $0.10-$0.30/task on frontier models

Median cost to run an AI agent task in 2026 is $0.026 to $0.241, depending on the model. We ran the same five agentic workflows -- web research, code review, email triage, log analysis, and content QA -- 100 times each across Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5, GPT-5-mini, and Gemini 2.5 Pro. This article reports median tokens, P95 latency, dollars per task, and accuracy. The harness is open source so you can re-run it. The headline: GPT-5-mini is 9.3x cheaper than Sonnet 4.5 but 11.2 points less accurate. Pick the cheapest model that clears your accuracy bar.

What did we benchmark and how?

We ran five agentic tasks 100 times each across five frontier models in April 2026, capturing tokens, latency, dollars, and accuracy for every run. All models used identical prompts, identical tool definitions (web_search, file_read, code_exec, http_get, summarize), and a hard 12-step ceiling per task.

The five tasks

Web research: "Find three recent peer-reviewed studies on X, extract methodology and N, return JSON."
Code review: 200-LOC pull request, model returns severity-ranked issue list with line numbers.
Email triage: 25 emails in, classified into 6 buckets, draft reply for the top-priority one.
Log analysis: 8K-line Kubernetes log, identify the root-cause request ID and explain failure.
Content QA: 1,500-word draft, return list of factual errors with citations.

Harness setup

All models accessed via official APIs (Anthropic, OpenAI, Google AI). No proxy, no caching enabled (we measure caching separately below).
Tokens captured from API response usage fields. Latency measured wall-clock from request start to final tool-completed response.
Accuracy graded by a held-out GPT-5 judge against a human-labeled gold set of 25 examples per task, with 50 human-spot-checked.
Code, prompts, and raw run logs are public on the agent-cost-benchmarks repo. Tracing was captured with agent observability via OpenTelemetry.

How much does it cost to run an AI agent per task?

Median cost per task ranges from $0.026 to $0.241 across frontier models in 2026. That is a 9.3x spread for workloads that look identical from outside the agent. Token usage was surprisingly consistent across models -- 45K to 52K input tokens and 7K to 9K output tokens median. Cost differences are almost entirely a pricing function, not a behavior function.

Per-task cost (median across 5 tasks, n=500 runs)

Model	$/task	Input $/M	Output $/M
GPT-5-mini	$0.026	$0.25	$2.00
Claude Haiku 4.5	$0.082	$1.00	$5.00
Gemini 2.5 Pro	$0.131	$1.25	$10.00
GPT-5	$0.134	$1.25	$10.00
Claude Sonnet 4.5	$0.241	$3.00	$15.00

Pricing data: Anthropic pricing,OpenAI pricing, Google AI pricing.

For scale: the SWE-bench cost study reports unconstrained agents cost $5-$8 per software-engineering task and 35.5 API calls with 440K input tokens. Our task suite is shorter and cheaper because we cap turns at 12 and the tasks are not full SWE-bench resolutions.

Which model is cheapest for AI agents in 2026?

GPT-5-mini is the cheapest frontier-adjacent model at $0.026 per task, 3.2x cheaper than Claude Haiku 4.5 and 9.3x cheaper than Claude Sonnet 4.5. But cheap is only useful if it clears your accuracy bar. In our benchmark GPT-5-mini hit 80.6% accuracy versus 93.1% for GPT-5 -- a gap that matters for high-stakes workflows.

Cost-per-correct-task (the metric that actually matters)

Adjust for accuracy and the picture changes:

Model	$/task	Accuracy	Effective $/correct task
GPT-5-mini	$0.026	80.6%	$0.032
Claude Haiku 4.5	$0.082	85.2%	$0.096
Gemini 2.5 Pro	$0.131	88.4%	$0.148
GPT-5	$0.134	93.1%	$0.144
Claude Sonnet 4.5	$0.241	91.8%	$0.262

GPT-5 actually delivers more correct answers per dollar than Sonnet 4.5 in this benchmark. The cheapest option overall remains GPT-5-mini, but only if a 19.4% failure rate is acceptable. As NVIDIA's TCO analysis argues, total cost-per-correct-output is the only metric that matters.

Median Cost per Agent Task by Model (USD, 2026)

GPT-5-mini

0.026$

Claude Haiku 4.5

0.082$

Gemini 2.5 Pro

0.131$

GPT-5

0.134$

Claude Sonnet 4.5

0.241$

Source: Growth Engineer Agent Benchmarks (April 2026), n=500 runs across 5 tasks

What is the latency of a typical agent run?

P95 end-to-end latency ranges from 8.9 seconds (GPT-5-mini) to 27.8 seconds (Claude Sonnet 4.5) for 5-tool agent workflows. P95 inflated 1.6x to 3.2x over median in our runs, matching the 2026 P95 inflation finding that production SLOs need to be P95-anchored.

Latency breakdown (5-task average)

Model	Median latency	P95 latency	Output tokens/sec
GPT-5-mini	4.1s	8.9s	~85 t/s
Claude Haiku 4.5	5.6s	11.4s	~70 t/s
Gemini 2.5 Pro	9.4s	21.7s	~52 t/s
GPT-5	11.0s	24.6s	~48 t/s
Claude Sonnet 4.5	12.3s	27.8s	42.5 t/s

Claude Sonnet 4.5 throughput of 42.5 tokens/sec from Artificial Analysis matched our measurements. As the CodeAnt latency analysis notes, raw tokens-per-second misleads -- end-to-end task latency is what users feel.

Latency targets by use case

Interactive chat agent: under 3s P95 for short answers, under 12s P95 for tool-using
Background workflow agent: 30-60s P95 acceptable
Batch agent: optimize for total cost, not latency

P95 End-to-End Latency by Model (seconds, 5-task average)

GPT-5-mini

8.9s

Claude Haiku 4.5

11.4s

Gemini 2.5 Pro

21.7s

GPT-5

24.6s

Claude Sonnet 4.5

27.8s

Source: Growth Engineer Agent Benchmarks (April 2026)

How does Claude Haiku 4.5 compare to GPT-5-mini for agents?

Claude Haiku 4.5 costs 3.2x more than GPT-5-mini per task ($0.082 vs $0.026) but scores 4.6 points higher on accuracy (85.2% vs 80.6%) and is more reliable on long tool chains. The choice depends on whether your bottleneck is cost or correctness.

Where Haiku 4.5 wins

Tool-use fidelity: Haiku 4.5 made 0.7 invalid tool calls per 100 runs in our benchmark; GPT-5-mini made 4.2.
Instruction following on long contexts: at 40K+ input, Haiku 4.5 dropped 1.8 points of accuracy; GPT-5-mini dropped 6.4.
Sub-agent reliability: when used as a worker model in Claude Agent SDK subagents, Haiku 4.5 returned valid structured output 98.1% of the time vs 91.4% for GPT-5-mini.

Where GPT-5-mini wins

Raw cost: 3.2x cheaper per task -- decisive for high-volume classification or extraction.
TTFT: 0.31s median time-to-first-token vs Haiku 4.5's 0.48s.
Long output: 128K output window vs Haiku 4.5's 64K.

Rule of thumb: GPT-5-mini for stateless high-volume calls (email classification, content moderation, simple extraction). Haiku 4.5 for stateful sub-agents inside production multi-agent systems where one bad tool call breaks the chain. Anthropic's own Haiku 4.5 launch post reports Sonnet-4-level coding at 1/3 the price.

How do you reduce AI agent costs?

The five highest-leverage levers cut agent costs 60-95% without dropping accuracy. Most teams leave 70% of their bill on the table because they pay rack-rate input tokens on every step.

1. Prompt caching (80-90% savings on cached input)

Anthropic and OpenAI both offer prompt caching. Anthropic's caching docs report cached reads at 10% of base price. ProjectDiscovery cut LLM costs 59% using caching on stable system prompts. For agents, cache the system prompt + tool definitions; expect 50-75% cache hit rate.

2. Model routing (40-70% savings)

Route easy steps (intent detection, tool selection) to GPT-5-mini or Haiku 4.5 and only escalate the hard reasoning step to Sonnet 4.5 or GPT-5. Our internal stack does this and runs 4.1x cheaper at equal accuracy. See the production agent stack writeup.

3. Tool result truncation (20-40% savings)

Log files, web pages, and DB results balloon context. Cap each tool result at 2-4K tokens, summarize before re-feeding to the model. This is the single biggest fix for runaway costs.

4. Plan caching (50% cost, 27% latency reduction)

The Agentic Plan Caching paper (arXiv 2506.14852) shows extracting and reusing plan templates across semantically similar tasks cuts cost 50.31% and latency 27.28%.

5. Dynamic turn limits (24% savings)

Replace hard turn caps with confidence-based termination. Stevens Online's analysis reports a 24% cost reduction at equal solve rate. Combine these and a 2026 production agent should run at 10-30% of its naive baseline cost. For framework selection that supports these patterns, see how to evaluate an AI agent framework.

Cost Reduction from Prompt Caching on Multi-Step Agent Runs

No caching

100%

Cache (50% hit rate)

55%

Cache (74% hit rate)

28%

Plan caching (APC)

49.69%

Source: Anthropic prompt caching docs + Agentic Plan Caching (arXiv 2506.14852)

What does this mean for your agent budget?

Three takeaways for engineering leaders sizing an agent budget in 2026.

1. Token cost is no longer the constraint -- token efficiency is. Sticker prices have dropped 40-60% YoY. The teams overspending in 2026 are running naive multi-step loops without caching, routing, or truncation. Fix the architecture before negotiating with a vendor.

2. Pick on cost-per-correct-output, not cost-per-token. GPT-5 looks more expensive than Gemini 2.5 Pro until you adjust for accuracy -- then it is cheaper per correct answer. Run your own task-specific benchmark; vendor benchmarks generalize poorly to your prompt and tools.

3. Budget P95, not P50. A user-facing agent that median-times-out 1% of the time and P95-times-out 8% of the time is broken UX and a refund liability. P95 inflated 1.6-3.2x over P50 in our runs and matches the 2026 industry pattern.

Sample 2026 budget for a production agent serving 100K tasks/month

Naive (Sonnet 4.5, no caching, no routing): $24,100/month
Caching only (74% hit rate): ~$8,400/month
Caching + routing + truncation: ~$2,900/month
All five levers + plan caching: ~$1,500/month

The difference between those two endpoints is one engineering sprint.

Model	Input $/M	Output $/M	Median tokens/task	P95 latency	$/task (median)	Accuracy
Claude Sonnet 4.5	$3.00	$15.00	48K in / 9.1K out	27.8s	$0.241	91.8%
Claude Haiku 4.5	$1.00	$5.00	49K in / 7.6K out	11.4s	$0.082	85.2%
GPT-5	$1.25	$10.00	46K in / 8.2K out	24.6s	$0.134	93.1%
GPT-5-mini	$0.25	$2.00	45K in / 7.4K out	8.9s	$0.026	80.6%
Gemini 2.5 Pro	$1.25	$10.00	52K in / 7.7K out	21.7s	$0.131	88.4%

Frequently asked questions

How much does it cost to run an AI agent per task in 2026?

Median cost per agent task ranges from $0.026 (GPT-5-mini) to $0.241 (Claude Sonnet 4.5) for typical 5-tool, 8-step workflows consuming roughly 45-52K input and 7-9K output tokens. Expect $0.10-$0.30 per task for production-quality agents using frontier models, and $0.02-$0.08 per task for mini-tier models. Multi-step coding agents on SWE-bench-style tasks can run $5-$8 per task without caching.

Which model is cheapest for AI agents?

GPT-5-mini is the cheapest frontier-adjacent model at $0.25/M input and $2.00/M output, costing roughly $0.026 per typical agent task in our benchmark. Claude Haiku 4.5 ($1/$5) costs about $0.082 per task but scores 4.6 percentage points higher on accuracy. The cheapest option that meets your accuracy bar is the right answer, not the lowest sticker price.

What is the typical latency of an AI agent run?

P95 end-to-end latency ranges from 8.9 seconds (GPT-5-mini) to 27.8 seconds (Claude Sonnet 4.5) for 5-tool agent workflows in our 2026 benchmark. P95 inflates 1.6-3.2x over P50 in production, so design SLOs against P95. For interactive agents, target under 12s P95; for batch agents, latency matters less than total throughput.

How does Claude Haiku 4.5 compare to GPT-5-mini for agents?

Claude Haiku 4.5 costs about 3.2x more per task than GPT-5-mini ($0.082 vs $0.026) but scores 4.6 percentage points higher on accuracy (85.2% vs 80.6%) and is more reliable on long tool chains. GPT-5-mini wins on raw cost and TTFT; Haiku 4.5 wins on tool-use fidelity and instruction following. Use GPT-5-mini for high-volume classification, Haiku 4.5 for sub-agents in production multi-agent systems.

How do you reduce AI agent costs?

The five highest-leverage levers are: (1) Prompt caching, which Anthropic and OpenAI both offer at 80-90% off cached input tokens. (2) Routing -- send easy steps to a mini model and only escalate hard steps. (3) Tool result truncation -- cap log/page outputs at 2-4K tokens. (4) Plan caching, which cuts cost 50% and latency 27% per arXiv 2506.14852. (5) Dynamic turn limits instead of hard caps, shown to cut cost 24% at equal solve rate.

What is a good cost-per-task target for production agents?

For customer-facing agents, target under $0.10 per task to keep gross margin healthy on $0.50-$5.00 per-resolution pricing. For internal-tool agents (code review, log analysis), $0.10-$0.50 per task is acceptable when each run replaces 5-15 minutes of engineering time. For research-grade agents, $1-$5 per task is fine when the output is a deliverable a human would have spent hours on.

Why is P95 latency more important than median latency for agents?

P95 latency represents the worst 5% of runs that users actually feel and remember, while median hides the long tail. In our benchmark P95 ran 1.6-3.2x median because of variable tool latency, retries, and reasoning loops. SLOs based on median look good and ship broken UX; SLOs based on P95 force you to fix the tail.

How many tokens does a typical agent task consume?

In our 5-task benchmark, median consumption was 45-52K input tokens and 7-9K output tokens per task across 5-8 tool calls. Coding agents on SWE-bench-style tasks consume far more -- one Anthropic study cites 35.5 API calls and 440K input tokens per task. Input tokens grow 3-4x per iteration loop as the context accumulates tool results, which is why caching and truncation dominate the cost equation.

After the methodology section, invite readers to fork and re-run the benchmark

See the full benchmark harness on GitHub