Agent Evaluation
Agent evaluation measures whether an AI agent produces output that meets a defined quality bar. Without evaluation, you're guessing whether the agent works. With evaluation, you know: the agent passes 94% of accuracy checks, violates zero banned-phrase rules, and produces output rated 4.2/5 by human reviewers. Evaluation turns "it seems fine" into a measurable quality system.
The principle: define what "good" means before building the agent. Write the eval before the prompt. If you can't describe what a good output looks like in measurable terms, you can't evaluate whether the agent produces it.
The Evaluation Framework
4 dimensions of agent quality
| Dimension |
What it measures |
How to measure |
Non-negotiable? |
| Accuracy |
Are the facts correct? No hallucinations, no fabricated data |
Cross-check output against input data. Human spot-check |
Yes. 95%+ required |
| Compliance |
Does output follow all rules? Word limits, banned phrases, format, style |
Automated rule checker |
Yes. 100% required |
| Completeness |
Are all required fields populated? No missing data |
Automated schema check |
Yes. 90%+ required |
| Quality |
Is the output genuinely good? Would a human use it without editing? |
Human rating (1-5 scale) |
Target ≥ 4.0 average |
Dimension priority
Accuracy > Compliance > Completeness > Quality. In that order.
- An accurate output that's slightly too long (compliance miss) is fixable
- A compliant output that contains fabricated data (accuracy miss) is dangerous
- An output that's missing one field (completeness miss) is incomplete but safe
- A low-quality output that's accurate and compliant just needs prompt improvement
Building Evals for GTM Agents
Eval structure
Every agent eval has three components:
1. TEST SET: A curated set of inputs with known-good outputs
2. EVAL CRITERIA: Specific, measurable checks applied to each output
3. SCORING: How to aggregate individual checks into an overall score
Creating the test set
| Test set type |
What it contains |
Size |
When to use |
| Golden set |
Inputs with human-verified ideal outputs. The "ground truth" |
20-50 examples |
Primary eval. Used for every prompt change |
| Edge case set |
Inputs that represent unusual or difficult scenarios |
10-20 examples |
Catches failure modes. Run after golden set passes |
| Regression set |
Inputs where the agent previously failed (fixed and added here) |
Grows over time |
Prevents re-introducing old bugs |
| Production sample |
Random sample from real production runs |
20-50 per eval cycle |
Measures real-world quality, not just test-set quality |
Test set rules:
- Build the golden set from real data. Export 20-50 real prospect records. Have a human write the ideal output for each. This is the ground truth the agent is measured against
- Include bad data in the test set. Prospects with missing fields, incomplete LinkedIn profiles, companies with no funding data. The agent must handle bad input gracefully
- Add every failure to the regression set. When the agent produces a bad output in production, add that input + the correct output to the regression set. The test set grows stronger over time
- 20 examples minimum for any meaningful eval. Below 20, individual outliers dominate the score. Above 50, you get diminishing returns (unless the agent handles very diverse tasks)
Eval Criteria by Agent Type
Research Agent eval
| Criterion |
Check type |
How to measure |
Pass threshold |
| Company name correct |
Accuracy |
Exact match against input |
100% |
| Employee count within 20% of actual |
Accuracy |
Compare to enrichment ground truth |
90% |
| Funding data correct (round, amount, date) |
Accuracy |
Cross-check against Crunchbase |
95% |
| Industry correct |
Accuracy |
Match against ground truth |
95% |
| Signals are real and verifiable |
Accuracy |
Human spot-check: can the signal be verified? |
90% |
| Problem hypothesis is grounded in evidence |
Quality |
Human rating: is the hypothesis supported by the data, not fabricated? |
4.0/5 average |
| All required fields populated |
Completeness |
Automated schema check |
90% |
| Output format matches schema |
Compliance |
Automated format check |
100% |
| No hallucinated data |
Accuracy |
Cross-check every claim against input |
99%+ |
Email Writer Agent eval
| Criterion |
Check type |
How to measure |
Pass threshold |
| Word count within limit |
Compliance |
Automated count. Email 1 ≤ 80, Email 2 ≤ 90, Email 3 ≤ 30 |
100% |
| No banned phrases |
Compliance |
Regex check against banned phrase list |
100% |
| No em-dashes |
Compliance |
Character check for "—" |
100% |
| Subject line ≤ 5 words, lowercase |
Compliance |
Automated check |
100% |
| First word is not "I" |
Compliance |
Automated check |
100% |
| Signal reference is from input data (no hallucination) |
Accuracy |
Cross-check signal in output against input signal field |
99%+ |
| Personalization token present and accurate |
Accuracy |
Verify the personalized element matches the input data |
95% |
| Email reads naturally (not robotic) |
Quality |
Human rating 1-5 |
4.0/5 average |
| Each email uses a different opener pattern |
Quality |
Human review: are the openers genuinely different? |
90% |
| Proof point is specific (named company or stat) |
Quality |
Check for named company or specific number in Email 2 |
90% |
Reply Classifier Agent eval
| Criterion |
Check type |
How to measure |
Pass threshold |
| Classification matches ground truth |
Accuracy |
Compare to human-labeled classification |
92%+ |
| Confidence score correlates with accuracy |
Accuracy |
High-confidence classifications should be more accurate than low |
Correlation > 0.7 |
| Low-confidence outputs flagged for human review |
Compliance |
Outputs below confidence threshold are flagged |
100% |
| All classification categories covered in test set |
Completeness |
At least 3 examples per category in the test set |
N/A (test set design) |
| Edge cases handled (multi-intent replies, sarcasm) |
Quality |
Human review of edge case set |
80%+ on edge cases |
Lead Scorer Agent eval
| Criterion |
Check type |
How to measure |
Pass threshold |
| Score matches manual scoring within ±10 points |
Accuracy |
Compare to human-scored ground truth |
85% |
| Tier assignment matches (Tier 1 vs 2 vs 3) |
Accuracy |
Compare tier to human assignment |
90% |
| Missing data handled correctly (not scored as 0, noted as missing) |
Completeness |
Check handling of null/empty input fields |
100% |
| Score reasoning is documented |
Completeness |
Output includes breakdown per dimension |
95% |
| Anti-ICP flags correctly applied |
Accuracy |
Competitors, disqualified verticals get negative scores |
100% |
Automated vs Human Evaluation
| Eval type |
What it checks |
Speed |
Cost |
Reliability |
| Automated (programmatic) |
Rule compliance, format, word count, banned phrases, schema |
Instant. Every output |
Free (compute only) |
100% for binary rules. Can't assess quality/naturalness |
| LLM-as-judge |
Quality, naturalness, relevance. Another LLM evaluates the output |
Fast (seconds per eval) |
$0.01-0.05 per eval |
70-85% agreement with humans. Good for screening, not final judgment |
| Human review |
Everything. The gold standard |
Slow (1-3 min per output) |
$0.50-2.00 per eval (reviewer time) |
Highest. The ground truth all other methods are calibrated against |
When to use each
| Phase |
Automated |
LLM-as-judge |
Human |
| Every output (production) |
Yes (100%) |
Optional (for quality screening) |
No (too slow/expensive) |
| Daily quality check |
Yes (100% of daily output) |
Yes (flag low-quality for human review) |
10-20% spot-check |
| Prompt change validation |
Yes (full golden set) |
Yes (full golden set) |
Yes (20-50 examples from golden set) |
| Monthly quality audit |
Yes (full regression set) |
Optional |
Yes (50 production samples) |
Eval layer rules
- Automated checks run on every output. Zero cost, instant feedback. Word count, banned phrases, format, schema. These are binary pass/fail. No reason to skip
- LLM-as-judge is a middle layer. Use a second LLM (can be the same model with a different prompt) to rate quality 1-5. Useful for flagging outputs that pass automated checks but feel robotic or irrelevant
- Human review is the calibration layer. Everything else is calibrated against human judgment. Without regular human review, automated checks may pass outputs that are technically compliant but qualitatively bad
The Eval Pipeline
How to run evals continuously
Agent produces output
↓
Layer 1: Automated checks (instant)
- Format compliance (schema, word count, structure)
- Rule compliance (banned phrases, em-dashes, subject line rules)
- Accuracy cross-check (proper nouns in output exist in input)
↓
PASS → Layer 2
FAIL → Reject. Log failure. Regenerate or flag for human fix
Layer 2: LLM-as-judge (optional, seconds)
- Quality rating 1-5
- Naturalness check
- Relevance to the prospect's situation
↓
Score ≥ 3.5 → PASS → Output goes to production
Score < 3.5 → Flag for human review
Layer 3: Human spot-check (daily/weekly)
- Random sample of 10-20% of passed outputs
- Human rates 1-5 on quality
- Flags hallucinations, tone issues, irrelevant content
- Results feed back into prompt improvement
Tracking Eval Results Over Time
The eval dashboard
| Metric |
What to track |
Frequency |
| Automated pass rate |
% of outputs passing all automated checks |
Per batch |
| LLM-as-judge average score |
Mean quality score from the judge model |
Daily |
| Human review average score |
Mean quality score from human reviewers |
Weekly |
| Hallucination rate |
% of outputs containing fabricated claims |
Per batch |
| Rule violation rate |
% of outputs violating at least one rule |
Per batch |
| Completeness rate |
% of outputs with all required fields populated |
Per batch |
| Regression test pass rate |
% of regression set inputs that produce correct outputs |
Per prompt change |
| Golden set accuracy |
% of golden set inputs that match the expected output quality |
Per prompt change |
Quality trend analysis
| Trend |
What it means |
Action |
| Automated pass rate declining |
New edge cases or data quality issues |
Add failing cases to the regression set. Fix the prompt |
| Human scores declining |
Output quality is drifting. Prompt may be stale |
Review recent outputs vs 30-day-ago outputs. Update prompt |
| Hallucination rate increasing |
Data pipeline change or prompt regression |
Check input data quality. Review prompt for hallucination-prone patterns |
| LLM-as-judge and human scores diverging |
The judge model is miscalibrated |
Re-calibrate the judge prompt against recent human ratings |
Eval-Driven Prompt Improvement
The improvement cycle
1. Run eval on current prompt (golden set + production sample)
2. Identify weakest criterion (lowest pass rate or score)
3. Analyze failures: what pattern causes the failures?
4. Modify the prompt to address the pattern
5. Re-run eval on the SAME test set
6. Compare results: did the change improve the weak criterion
WITHOUT degrading other criteria?
7. If yes: deploy the new prompt
8. If no: revert and try a different approach
Improvement rules
- One prompt change per eval cycle. Changing 5 things at once makes it impossible to know which change helped. Change one instruction, one example, or one rule per iteration
- Always re-run the full golden set after a change. A change that fixes one problem may introduce another. The golden set catches regressions
- Track every prompt version with its eval results. Version 1: accuracy 91%, quality 3.8. Version 2: accuracy 94%, quality 4.1. Version 3: accuracy 93%, quality 4.3. The version history shows the improvement trajectory
- Don't optimize for the test set at the expense of production. If the prompt scores perfectly on the golden set but poorly on production samples, the golden set isn't representative. Add more diverse examples
Pre-Deploy Eval Checklist
Before deploying any agent or prompt change to production:
- [ ] Golden set (20+ examples) created with human-verified ideal outputs
- [ ] Edge case set (10+ examples) covering missing data, unusual inputs
- [ ] Automated checks implemented for all compliance rules
- [ ] Agent passes 95%+ on accuracy checks against golden set
- [ ] Agent passes 100% on compliance checks (word count, banned phrases, format)
- [ ] Agent passes 90%+ on completeness checks
- [ ] Human review of 20+ outputs scores ≥ 4.0/5 average quality
- [ ] Hallucination rate < 2% on golden set
- [ ] Regression set passes at 100% (no re-introduced old failures)
- [ ] Eval results documented with prompt version number
Anti-Pattern Check
- No eval before deployment. "The prompt looks good in testing" is not an eval. Build a golden set. Run automated checks. Score with human reviewers. Then deploy
- Eval set too small (5 examples). 5 examples means one failure changes the pass rate by 20%. Not meaningful. Minimum 20 examples for any eval
- No automated checks on production output. Automated checks are free and instant. Every output should pass word count, banned phrases, and format checks before being used. There's no reason to skip this
- Evaluating accuracy without ground truth. "The output seems correct" is not an accuracy check. Compare every factual claim in the output against the input data or a verified source. Ground truth or it's not an eval
- Golden set never updated. The golden set from 3 months ago doesn't include the new ICP segment, the updated messaging, or the recent edge cases. Update quarterly with fresh real-world examples
- Optimizing for test set at the expense of production. Agent passes 100% on the golden set but produces mediocre output on real prospects. The golden set is too easy or not representative. Add harder, more diverse examples
- No regression testing. A prompt change fixes one problem and re-introduces two old ones. Without a regression set, old failures silently return. Add every failure to the regression set and re-run after every change
- LLM-as-judge without human calibration. The judge model rates everything 4.5/5. Humans rate the same outputs 3.2/5. The judge is miscalibrated. Calibrate the judge prompt by comparing its ratings to human ratings on 50+ examples