general agent-evaluation

agent-evaluation

This skill should be used when the user asks to "evaluate agent output", "measure agent quality", "build an eval for an AI agent", "assess LLM agent performance", "create evaluation criteria for agents", "test agent output quality", "build an agent eval framework", "score agent responses", "measure AI agent accuracy", or any variation of evaluating and measuring the quality of AI agent output for B2B SaaS GTM tasks.
Download .md

Agent Evaluation

Agent evaluation measures whether an AI agent produces output that meets a defined quality bar. Without evaluation, you're guessing whether the agent works. With evaluation, you know: the agent passes 94% of accuracy checks, violates zero banned-phrase rules, and produces output rated 4.2/5 by human reviewers. Evaluation turns "it seems fine" into a measurable quality system.

The principle: define what "good" means before building the agent. Write the eval before the prompt. If you can't describe what a good output looks like in measurable terms, you can't evaluate whether the agent produces it.

The Evaluation Framework

4 dimensions of agent quality

Dimension What it measures How to measure Non-negotiable?
Accuracy Are the facts correct? No hallucinations, no fabricated data Cross-check output against input data. Human spot-check Yes. 95%+ required
Compliance Does output follow all rules? Word limits, banned phrases, format, style Automated rule checker Yes. 100% required
Completeness Are all required fields populated? No missing data Automated schema check Yes. 90%+ required
Quality Is the output genuinely good? Would a human use it without editing? Human rating (1-5 scale) Target ≥ 4.0 average

Dimension priority

Accuracy > Compliance > Completeness > Quality. In that order.

  • An accurate output that's slightly too long (compliance miss) is fixable
  • A compliant output that contains fabricated data (accuracy miss) is dangerous
  • An output that's missing one field (completeness miss) is incomplete but safe
  • A low-quality output that's accurate and compliant just needs prompt improvement

Building Evals for GTM Agents

Eval structure

Every agent eval has three components:

1. TEST SET: A curated set of inputs with known-good outputs
2. EVAL CRITERIA: Specific, measurable checks applied to each output
3. SCORING: How to aggregate individual checks into an overall score

Creating the test set

Test set type What it contains Size When to use
Golden set Inputs with human-verified ideal outputs. The "ground truth" 20-50 examples Primary eval. Used for every prompt change
Edge case set Inputs that represent unusual or difficult scenarios 10-20 examples Catches failure modes. Run after golden set passes
Regression set Inputs where the agent previously failed (fixed and added here) Grows over time Prevents re-introducing old bugs
Production sample Random sample from real production runs 20-50 per eval cycle Measures real-world quality, not just test-set quality

Test set rules:

  • Build the golden set from real data. Export 20-50 real prospect records. Have a human write the ideal output for each. This is the ground truth the agent is measured against
  • Include bad data in the test set. Prospects with missing fields, incomplete LinkedIn profiles, companies with no funding data. The agent must handle bad input gracefully
  • Add every failure to the regression set. When the agent produces a bad output in production, add that input + the correct output to the regression set. The test set grows stronger over time
  • 20 examples minimum for any meaningful eval. Below 20, individual outliers dominate the score. Above 50, you get diminishing returns (unless the agent handles very diverse tasks)

Eval Criteria by Agent Type

Research Agent eval

Criterion Check type How to measure Pass threshold
Company name correct Accuracy Exact match against input 100%
Employee count within 20% of actual Accuracy Compare to enrichment ground truth 90%
Funding data correct (round, amount, date) Accuracy Cross-check against Crunchbase 95%
Industry correct Accuracy Match against ground truth 95%
Signals are real and verifiable Accuracy Human spot-check: can the signal be verified? 90%
Problem hypothesis is grounded in evidence Quality Human rating: is the hypothesis supported by the data, not fabricated? 4.0/5 average
All required fields populated Completeness Automated schema check 90%
Output format matches schema Compliance Automated format check 100%
No hallucinated data Accuracy Cross-check every claim against input 99%+

Email Writer Agent eval

Criterion Check type How to measure Pass threshold
Word count within limit Compliance Automated count. Email 1 ≤ 80, Email 2 ≤ 90, Email 3 ≤ 30 100%
No banned phrases Compliance Regex check against banned phrase list 100%
No em-dashes Compliance Character check for "—" 100%
Subject line ≤ 5 words, lowercase Compliance Automated check 100%
First word is not "I" Compliance Automated check 100%
Signal reference is from input data (no hallucination) Accuracy Cross-check signal in output against input signal field 99%+
Personalization token present and accurate Accuracy Verify the personalized element matches the input data 95%
Email reads naturally (not robotic) Quality Human rating 1-5 4.0/5 average
Each email uses a different opener pattern Quality Human review: are the openers genuinely different? 90%
Proof point is specific (named company or stat) Quality Check for named company or specific number in Email 2 90%

Reply Classifier Agent eval

Criterion Check type How to measure Pass threshold
Classification matches ground truth Accuracy Compare to human-labeled classification 92%+
Confidence score correlates with accuracy Accuracy High-confidence classifications should be more accurate than low Correlation > 0.7
Low-confidence outputs flagged for human review Compliance Outputs below confidence threshold are flagged 100%
All classification categories covered in test set Completeness At least 3 examples per category in the test set N/A (test set design)
Edge cases handled (multi-intent replies, sarcasm) Quality Human review of edge case set 80%+ on edge cases

Lead Scorer Agent eval

Criterion Check type How to measure Pass threshold
Score matches manual scoring within ±10 points Accuracy Compare to human-scored ground truth 85%
Tier assignment matches (Tier 1 vs 2 vs 3) Accuracy Compare tier to human assignment 90%
Missing data handled correctly (not scored as 0, noted as missing) Completeness Check handling of null/empty input fields 100%
Score reasoning is documented Completeness Output includes breakdown per dimension 95%
Anti-ICP flags correctly applied Accuracy Competitors, disqualified verticals get negative scores 100%

Automated vs Human Evaluation

Eval type What it checks Speed Cost Reliability
Automated (programmatic) Rule compliance, format, word count, banned phrases, schema Instant. Every output Free (compute only) 100% for binary rules. Can't assess quality/naturalness
LLM-as-judge Quality, naturalness, relevance. Another LLM evaluates the output Fast (seconds per eval) $0.01-0.05 per eval 70-85% agreement with humans. Good for screening, not final judgment
Human review Everything. The gold standard Slow (1-3 min per output) $0.50-2.00 per eval (reviewer time) Highest. The ground truth all other methods are calibrated against

When to use each

Phase Automated LLM-as-judge Human
Every output (production) Yes (100%) Optional (for quality screening) No (too slow/expensive)
Daily quality check Yes (100% of daily output) Yes (flag low-quality for human review) 10-20% spot-check
Prompt change validation Yes (full golden set) Yes (full golden set) Yes (20-50 examples from golden set)
Monthly quality audit Yes (full regression set) Optional Yes (50 production samples)

Eval layer rules

  • Automated checks run on every output. Zero cost, instant feedback. Word count, banned phrases, format, schema. These are binary pass/fail. No reason to skip
  • LLM-as-judge is a middle layer. Use a second LLM (can be the same model with a different prompt) to rate quality 1-5. Useful for flagging outputs that pass automated checks but feel robotic or irrelevant
  • Human review is the calibration layer. Everything else is calibrated against human judgment. Without regular human review, automated checks may pass outputs that are technically compliant but qualitatively bad

The Eval Pipeline

How to run evals continuously

Agent produces output
  ↓
Layer 1: Automated checks (instant)
  - Format compliance (schema, word count, structure)
  - Rule compliance (banned phrases, em-dashes, subject line rules)
  - Accuracy cross-check (proper nouns in output exist in input)
  ↓
  PASS → Layer 2
  FAIL → Reject. Log failure. Regenerate or flag for human fix

Layer 2: LLM-as-judge (optional, seconds)
  - Quality rating 1-5
  - Naturalness check
  - Relevance to the prospect's situation
  ↓
  Score ≥ 3.5 → PASS → Output goes to production
  Score < 3.5 → Flag for human review

Layer 3: Human spot-check (daily/weekly)
  - Random sample of 10-20% of passed outputs
  - Human rates 1-5 on quality
  - Flags hallucinations, tone issues, irrelevant content
  - Results feed back into prompt improvement

Tracking Eval Results Over Time

The eval dashboard

Metric What to track Frequency
Automated pass rate % of outputs passing all automated checks Per batch
LLM-as-judge average score Mean quality score from the judge model Daily
Human review average score Mean quality score from human reviewers Weekly
Hallucination rate % of outputs containing fabricated claims Per batch
Rule violation rate % of outputs violating at least one rule Per batch
Completeness rate % of outputs with all required fields populated Per batch
Regression test pass rate % of regression set inputs that produce correct outputs Per prompt change
Golden set accuracy % of golden set inputs that match the expected output quality Per prompt change

Quality trend analysis

Trend What it means Action
Automated pass rate declining New edge cases or data quality issues Add failing cases to the regression set. Fix the prompt
Human scores declining Output quality is drifting. Prompt may be stale Review recent outputs vs 30-day-ago outputs. Update prompt
Hallucination rate increasing Data pipeline change or prompt regression Check input data quality. Review prompt for hallucination-prone patterns
LLM-as-judge and human scores diverging The judge model is miscalibrated Re-calibrate the judge prompt against recent human ratings

Eval-Driven Prompt Improvement

The improvement cycle

1. Run eval on current prompt (golden set + production sample)
2. Identify weakest criterion (lowest pass rate or score)
3. Analyze failures: what pattern causes the failures?
4. Modify the prompt to address the pattern
5. Re-run eval on the SAME test set
6. Compare results: did the change improve the weak criterion
   WITHOUT degrading other criteria?
7. If yes: deploy the new prompt
8. If no: revert and try a different approach

Improvement rules

  • One prompt change per eval cycle. Changing 5 things at once makes it impossible to know which change helped. Change one instruction, one example, or one rule per iteration
  • Always re-run the full golden set after a change. A change that fixes one problem may introduce another. The golden set catches regressions
  • Track every prompt version with its eval results. Version 1: accuracy 91%, quality 3.8. Version 2: accuracy 94%, quality 4.1. Version 3: accuracy 93%, quality 4.3. The version history shows the improvement trajectory
  • Don't optimize for the test set at the expense of production. If the prompt scores perfectly on the golden set but poorly on production samples, the golden set isn't representative. Add more diverse examples

Pre-Deploy Eval Checklist

Before deploying any agent or prompt change to production:

  • [ ] Golden set (20+ examples) created with human-verified ideal outputs
  • [ ] Edge case set (10+ examples) covering missing data, unusual inputs
  • [ ] Automated checks implemented for all compliance rules
  • [ ] Agent passes 95%+ on accuracy checks against golden set
  • [ ] Agent passes 100% on compliance checks (word count, banned phrases, format)
  • [ ] Agent passes 90%+ on completeness checks
  • [ ] Human review of 20+ outputs scores ≥ 4.0/5 average quality
  • [ ] Hallucination rate < 2% on golden set
  • [ ] Regression set passes at 100% (no re-introduced old failures)
  • [ ] Eval results documented with prompt version number

Anti-Pattern Check

  • No eval before deployment. "The prompt looks good in testing" is not an eval. Build a golden set. Run automated checks. Score with human reviewers. Then deploy
  • Eval set too small (5 examples). 5 examples means one failure changes the pass rate by 20%. Not meaningful. Minimum 20 examples for any eval
  • No automated checks on production output. Automated checks are free and instant. Every output should pass word count, banned phrases, and format checks before being used. There's no reason to skip this
  • Evaluating accuracy without ground truth. "The output seems correct" is not an accuracy check. Compare every factual claim in the output against the input data or a verified source. Ground truth or it's not an eval
  • Golden set never updated. The golden set from 3 months ago doesn't include the new ICP segment, the updated messaging, or the recent edge cases. Update quarterly with fresh real-world examples
  • Optimizing for test set at the expense of production. Agent passes 100% on the golden set but produces mediocre output on real prospects. The golden set is too easy or not representative. Add harder, more diverse examples
  • No regression testing. A prompt change fixes one problem and re-introduces two old ones. Without a regression set, old failures silently return. Add every failure to the regression set and re-run after every change
  • LLM-as-judge without human calibration. The judge model rates everything 4.5/5. Humans rate the same outputs 3.2/5. The judge is miscalibrated. Calibrate the judge prompt by comparing its ratings to human ratings on 50+ examples
Want agents that use skill files like this?
We customize skill files for your brand voice and methodology, then run content agents against them.
Book a call