how-to 10 min read May 03, 2026

How to Evaluate an AI Agent: A Practical Framework

Q: What metrics matter most for agent evaluation?

Task completion, tool selection accuracy, trajectory quality, and cost/latency. tau-bench's end-state DB diff is the cleanest task-completion signal; tool-level exact-match plus argument F1 is the standard for tool accuracy.

Q: How big should an agent eval golden set be?

Start with 50-100 hand-labelled traces, scale to 200-500 to cover edge cases, and grow past 1,000 once you mine production failures. Anthropic and Microsoft both recommend 100-150 examples as the minimum for non-trivial domains.

Q: How do you run agent evals in CI?

Use pytest-evals or DeepEval to wrap each golden-set case as a parametrized pytest, run them on every PR via GitHub Actions, cache LLM responses, and fail the build when the aggregate score drops below threshold.

Q: What is the difference between trajectory and outcome metrics?

Outcome metrics measure final task success. Trajectory metrics inspect every reasoning step, tool call, and argument along the way. You need both: outcome metrics catch failures, trajectory metrics tell you why.

By Peter Foy

A shippable 5-step framework for evaluating AI agents: golden sets, 4-axis metrics, LLM-as-judge calibration, eval-on-PR. With YAML, Python, and CI examples.

TL;DR

Evaluate an AI agent in five steps: (1) define one narrow task, (2) build a 50-200 trace golden set, (3) score across four axes -- task completion, tool selection accuracy, trajectory quality, and cost, (4) run the suite in CI on every PR, (5) gate merges on a pass threshold. Skip any step and you are shipping vibes-based evals into production.

80.3% of enterprise AI projects fail to deliver value (RAND, 2025). Most failures are eval failures.
Start with 50-100 labelled traces. Diversity beats volume. Mine production for new cases weekly.
Score four axes: task completion, tool selection accuracy, trajectory quality, cost/latency.
Use LLM-as-judge only for subjective metrics, with binary outputs and human calibration.
Run evals as a pytest suite in GitHub Actions. Fail the PR if the aggregate score drops.

Most teams ship AI agents with vibes-based evals: a few cherry-picked prompts, a thumbs-up from the founder, and a deploy. Then production happens. According to RAND's 2025 study, 80.3% of enterprise AI projects fail to deliver business value, with 33.8% abandoned before reaching production. The fix is not better models. It is a real evaluation harness. This guide gives you the exact 5-step framework we use to ship agents to production with confidence: golden sets, 4-axis metrics, LLM-as-judge calibration, and eval-on-PR.

Why do most AI agents fail in production?

Most agents fail because teams confuse demos with evidence. A demo proves the agent can do the task once. An eval proves it does the task reliably across the distribution of real inputs.

RAND (2025) analysed 65 enterprise AI initiatives and found that 80.3% failed to deliver business value, twice the failure rate of non-AI IT projects. The breakdown:

33.8% abandoned before production
28.4% reach production but underdeliver
18.1% run but never recoup costs

The Sierra team's tau-bench paper (2024) shows the technical version of the same story: even GPT-4o solves fewer than 50% of retail tasks at pass^1, and fewer than 25% at pass^8 (eight independent attempts at the same task). Frontier agents are not reliable by default. Without an eval harness, you cannot detect regressions, you cannot defend a prompt change, and you cannot tell a stakeholder why your agent broke. The framework below fixes that.

Why AI Agent Projects Fail in Production (RAND, 2025)

Abandoned before production

33.8%

Reach production but underdeliver

28.4%

Run but never recoup costs

18.1%

Deliver expected value

19.7%

Source: RAND Corporation, 'Why AI Projects Fail and How They Can Succeed' (2025)

What is the 5-step framework for evaluating an AI agent?

Run these five steps in order. Each one is non-negotiable.

Define one narrow task. A single agent, a single user goal, a single success definition. Multi-task agents need multi-task suites; do not fake it with one bucket.
Build a golden set of 50-200 real traces. Capture from production or seed from staging. Label expected tool calls and end states.
Pick metrics across four axes: task completion, tool selection accuracy, trajectory quality, cost/latency. Each axis has its own pass threshold.
Run the suite in CI on every PR. Use pytest-evals or DeepEval. Cache LLM responses. Fail the build if scores drop below threshold.
Gate merges on the threshold. No green eval, no merge. Track score history per metric so you can diff PR-over-PR.

The rest of this article walks each step, with the YAML config, the Python harness, and the CI job we actually ship.

Step 1: How do you define the task an agent is being evaluated on?

Pick one task with one success state. The task definition is the spec. If you cannot write the success criterion in two sentences, your agent does not know what success means either.

A good task definition has four fields:

Inputs: schema of the user message and any context (user_id, account state, locale).
Tools available: the exact tool list the agent is given at runtime.
Policy constraints: things the agent must never do (charge twice, leak PII, exceed refund limits).
Success state: a deterministic check, ideally a database diff or a structured output match.

Example for a refund agent:

Task: Issue a refund per the policy in refund_policy.md. Success: orders.status == 'refunded' AND transactions.amount_cents matches the requested amount AND no PII appears in the agent's final message.

This is the tau-bench approach: grade by end-state DB diff, not by judging the conversation. End-state grading is cheap, deterministic, and immune to LLM-judge drift.

Step 2: How do you build a golden set for AI agent evals?

Build a golden set of 50-200 hand-labelled traces before you write a single metric. This is the most important step, and the one teams skip.

Anthropic's eval team and Microsoft's Copilot team both recommend 100-150 examples as the minimum for a non-trivial domain. Diversity beats volume: a 100-case set covering 10 distinct failure modes is worth more than 1,000 cases of the same easy path.

How to source the set:

Pull 30-50 real traces from production (or staging if you are pre-launch).
Add 20-30 adversarial cases your team brainstormed: jailbreaks, ambiguous inputs, broken tool responses.
Add 20-30 edge cases discovered from incidents or bug reports.
Re-label every example with: task_id, inputs, expected_tools (ordered list), expected_end_state, policy_flags.

The Maxim AI golden-set guide (2026) recommends versioning the dataset in git and reviewing it like code. Treat it as living code: add a new case within 48 hours of every production incident, retire stale cases quarterly.

A minimal golden-set CSV row looks like:

task_id,user_input,expected_tools,expected_end_state_json,policy_flags
refund_001,"I want a refund for order 7782","[lookup_order,issue_refund]","{\"orders.status\":\"refunded\"}","no_pii"

What are the four metrics every AI agent eval needs?

Score every run across four orthogonal axes. Skipping an axis hides a failure mode.

Eval Axis	What it measures	Metric	Pass threshold
Task completion	Did the agent achieve the goal?	End-state diff vs golden state	>=85%
Tool selection accuracy	Right tool, right args?	Tool-name exact match + arg F1	>=90% / >=80%
Trajectory quality	Was the path efficient and policy-compliant?	LLM-as-judge, binary per turn	>=80% turns correct
Cost & latency	Is it economically shippable?	p50/p95 tokens, $/task, p95 ms	Set per task type

Why all four? A 2025 TRAJECT-Bench analysis found that agents with high outcome accuracy can still have low tool-selection accuracy: they brute-force the right answer through redundant tool calls, blowing latency and cost. Outcome alone does not catch this. Trajectory metrics do.

For tool selection, follow Galileo's 2026 framework: exact-match on the tool name plus an argument-level F1 score. For trajectory, score per turn with a binary pass/fail rubric, not a 1-5 scale. Microsoft's Azure AI Foundry team (2026) found LLM judges hit human-level agreement on binary calls and drift on fine-grained scales.

Should you use LLM-as-a-judge for agent evals?

Yes, but only for subjective dimensions, only with binary outputs, and only after calibrating against humans. Microsoft's evaluation-of-judges study (2026) found that frontier models hit human-level agreement on objective metrics like tool-call accuracy but become unreliable on fine-grained 1-5 scales.

The rules we follow:

Use deterministic checks first. End-state diffs, JSON schema match, regex on PII -- always cheaper and more reliable than a judge.
Use LLM-as-judge only for trajectory quality and policy compliance, where the answer is genuinely subjective.
Force binary outputs. Pass/fail beats 1-5 for stability.
Calibrate against 30-50 human-labelled examples. Compute Cohen's Kappa. If kappa < 0.7, rewrite the rubric.
Use a 3-judge consensus for high-stakes evals. Three independent judges plus majority vote hits ~97% macro F1 in published research.

A solid judge prompt is short, has the rubric inline, gives one positive and one negative example, and asks for the verdict before the rationale. Long judge prompts produce verbose, biased rationales.

How do you configure an agent eval suite (YAML)?

Keep the eval config in a single YAML file checked into the repo. This is the contract between the eval harness and the team. We use this shape:

# evals/refund_agent.yaml
name: refund_agent_v3
task: refund
agent:
  entrypoint: agents.refund:run
  model: claude-sonnet-4.5
  tools: [lookup_order, issue_refund, send_email]

golden_set: evals/datasets/refund_v3.csv
min_cases: 120

metrics:
  - id: task_completion
    type: end_state_diff
    threshold: 0.85
  - id: tool_selection
    type: tool_match
    name_threshold: 0.90
    args_f1_threshold: 0.80
  - id: trajectory
    type: llm_judge
    judge_model: claude-sonnet-4.5
    rubric: evals/rubrics/refund_trajectory.md
    output: binary
    threshold: 0.80
  - id: cost
    type: budget
    p95_tokens: 8000
    p95_latency_ms: 12000
    max_dollars_per_task: 0.04

ci:
  fail_on: any_metric_below_threshold
  baseline_branch: main
  regression_tolerance: 0.02

The regression_tolerance matters: an eval can be noisy run-to-run, so we let scores drop up to 2 points absolute before failing the build. Anything bigger is a real regression.

How do you write an agent eval harness in Python?

Use pytest-evals or DeepEval. Both wrap the golden set as parametrized pytest cases and emit a structured report. A minimal harness using pytest-evals:

# evals/test_refund_agent.py
import pytest
from pytest_evals import eval_bag
from agents.refund import run as run_agent
from evals.judges import judge_trajectory
from evals.dataset import load_golden_set

GOLDEN = load_golden_set("evals/datasets/refund_v3.csv")

@pytest.mark.eval(name="refund_agent")
@pytest.mark.parametrize("case", GOLDEN, ids=[c.task_id for c in GOLDEN])
def test_refund_agent(case, eval_bag):
    result = run_agent(case.user_input, ctx=case.ctx)

    eval_bag.task_completion = result.end_state == case.expected_end_state
    eval_bag.tool_name_match = (
        [t.name for t in result.tool_calls] == case.expected_tools
    )
    eval_bag.args_f1 = arg_level_f1(result.tool_calls, case.expected_tool_args)
    eval_bag.trajectory_pass = judge_trajectory(
        result.trace, rubric_path="evals/rubrics/refund_trajectory.md"
    )
    eval_bag.tokens = result.usage.total_tokens
    eval_bag.latency_ms = result.latency_ms

Key patterns:

Cache LLM responses to disk keyed by (model, prompt-hash, tool-call-hash). Without caching, every CI run costs the same as a full re-run. LangSmith's pytest integration does this out of the box.
Run cases in parallel with pytest-xdist. Most agent calls are I/O-bound.
Emit one JSON report per run with per-case scores, aggregate scores, and cost. This is what the dashboard reads.

How do you run agent evals in CI?

Wire the eval harness as a GitHub Action that runs on every PR touching agents/, prompts/, or evals/. Block the merge when any axis drops below threshold.

# .github/workflows/eval-on-pr.yml
name: eval-on-pr
on:
  pull_request:
    paths: ["agents/**", "prompts/**", "evals/**"]
jobs:
  evals:
    runs-on: ubuntu-latest
    timeout-minutes: 25
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install -e ".[evals]"
      - name: Restore eval cache
        uses: actions/cache@v4
        with:
          path: .eval_cache
          key: evals-${{ hashFiles('evals/datasets/**') }}
      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: pytest evals/ -n 8 --eval-report=eval_report.json
      - name: Diff against main baseline
        run: python evals/scripts/diff_baseline.py eval_report.json
      - name: Comment results on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: eval_report.md

The diff_baseline.py step is the gate: it pulls the latest main eval report from S3, computes the per-axis delta, and exits non-zero if any axis regressed beyond regression_tolerance. The PR comment renders the per-axis scores as a table next to main so reviewers see exactly what changed.

This is what "eval-on-PR" actually means. Not a Notion checklist. Not a manual run. A blocking CI check, every time.

What does an agent eval dashboard track?

Three things: per-axis score over time, cost per task, and the failing-cases drill-down. Everything else is decoration.

The minimum dashboard:

Score history, one line per axis (task completion, tool selection, trajectory, cost-pass-rate), x-axis is git SHA. The y-axis is the % of golden-set cases passing that axis. Annotate model and prompt changes.
Cost-per-task distribution. Histogram of $/task, labelled with median and p95. This is the metric finance asks about.
Failing-cases list. For the latest run, the 10 cases that failed any axis, with the trace, the expected outcome, and the actual outcome side by side. This is where debugging happens.

We build this in a thin Streamlit app that reads the JSON eval reports from S3, but Braintrust, LangSmith, and DeepEval all give you this out of the box. The vendor choice is less important than the rule: every eval run produces a structured report, every report ends up in the dashboard, every regression has an owner.

If you cannot answer "what was our task-completion score last Tuesday?" in 30 seconds, your dashboard is broken.

How do you keep an agent eval suite from rotting?

Eval rot is the silent killer. Suites pass forever, then production breaks, then nobody trusts the suite. Three rules prevent this.

1. Mine production for new cases weekly. Pull the 20 lowest-confidence traces, the 20 longest traces, and any trace flagged by an incident. Label them. Add them to the golden set. Anthropic's eval guide calls this "closing the loop" and identifies it as the single biggest predictor of long-term eval value.

2. Re-baseline whenever the world moves. Re-run the full suite on a fresh model (Claude 4.5 -> 5, GPT-4o -> 5) within 24 hours of release. Re-baseline whenever you change the prompt, the tool schema, or the policy doc. Without re-baselines, you are comparing against a ghost.

3. Assign an eval owner. One person owns the harness and the dataset. They review new cases, kill flaky ones, and report aggregate trends in weekly engineering review. No owner means no eval. Anthropic's recommendation: a dedicated evals team owns infrastructure, while domain experts contribute cases.

The goal is not a perfect score. The goal is a trusted signal that survives team turnover, model upgrades, and prompt rewrites.

Eval Axis	What it measures	Recommended Metric	Pass threshold (starting point)
Task completion	Did the agent achieve the user's goal?	End-state DB diff vs golden state (tau-bench style)	>=85% on golden set
Tool selection accuracy	Did the agent call the right tool with the right args?	Exact-match on tool name + arg-level F1	>=90% tool name, >=80% args
Trajectory quality	Was the path efficient and policy-compliant?	LLM-as-judge rubric, binary pass/fail per turn	>=80% turns rated correct
Cost & latency	Is the agent shippable economically?	p50/p95 tokens, $/task, p95 wall-clock	$ and ms budget per task type

Frequently asked questions

How do you evaluate an AI agent?

Define a single, narrow task; build a golden set of 50-200 real traces with labelled expected outcomes; score each run across four axes (task completion, tool selection accuracy, trajectory quality, cost); run the suite as a CI job on every PR; gate merges on a pass threshold. Skip any of these steps and you are shipping vibes.

What metrics matter most for agent evaluation?

Task completion (did the agent achieve the goal?), tool selection accuracy (did it call the right tool with the right arguments?), trajectory quality (was the path policy-compliant and efficient?), and cost/latency. tau-bench's end-state DB diff is the cleanest task-completion signal; tool-level exact-match plus argument F1 is the standard for tool accuracy.

How big should an agent eval golden set be?

Start with 50-100 hand-labelled traces to catch obvious regressions, scale to 200-500 to cover edge cases, and grow past 1,000 once you start mining production failures. Anthropic's evals team and Microsoft Copilot both recommend 100-150 examples as the minimum for non-trivial domains. Diversity beats volume.

Should you use LLM-as-a-judge for agent evals?

Yes, but only for subjective dimensions like trajectory quality and rubric-based pass/fail, and only after calibrating against 30-50 human-labelled examples. Microsoft's Azure AI Foundry team found judges hit human-level agreement on objective metrics like tool-call accuracy but drift on fine-grained 1-5 scales. Stick to binary judgments and use a 3-judge consensus when stakes are high.

How do you run agent evals in CI?

Use pytest-evals or DeepEval's pytest integration to wrap each golden-set case as a parametrized test, run them on every PR via GitHub Actions, cache LLM responses to control cost, and fail the build when the aggregate score drops below threshold. LangSmith and Braintrust both support eval-on-PR with diff views againstthe main branch baseline.

What is tau-bench and why does it matter?

tau-bench is Sierra's 2024 benchmark for tool-using agents that simulates dynamic user-agent conversations and grades success by comparing the final database state to a golden state. It introduced the pass^k metric for reliability and revealed that GPT-4o solves under 50% of retail tasks at pass^1 and under 25% at pass^8. It is the closest public proxy for production agent reliability.

How often should you refresh agent evals?

Treat the golden set as living code. Add a new case for every production incident within 48 hours, retire stale cases quarterly, and re-baseline whenever you change the model, prompt, or tool schema. Anthropic recommends dedicated eval owners because eval rot is the single biggest reason teams stop trusting their numbers.

What is the difference between trajectory and outcome metrics?

Outcome metrics ask 'did the agent succeed?' and look only at the final state. Trajectory metrics ask 'how did it get there?' and inspect every reasoning step, tool call, and tool argument along the way. You need both: outcome metrics catch failures, trajectory metrics tell you why so you can fix the prompt, the router, or the tool schema.

Want the exact CSV template we use for golden sets, plus the eval harness as a starter repo?

Get the golden-set CSV template