Home/ Skills/ agent-evaluation

general agent-evaluation

agent-evaluation

This skill should be used when the user asks to "evaluate agent output", "measure agent quality", "build an eval for an AI agent", "assess LLM agent performance", "create evaluation criteria for agents", "test agent output quality", "build an agent eval framework", "score agent responses", "measure AI agent accuracy", or any variation of evaluating and measuring the quality of AI agent output for B2B SaaS GTM tasks.

Download .md

Agent Evaluation

Agent evaluation measures whether an AI agent produces output that meets a defined quality bar. Without evaluation, you're guessing whether the agent works. With evaluation, you know: the agent passes 94% of accuracy checks, violates zero banned-phrase rules, and produces output rated 4.2/5 by human reviewers. Evaluation turns "it seems fine" into a measurable quality system.

The principle: define what "good" means before building the agent. Write the eval before the prompt. If you can't describe what a good output looks like in measurable terms, you can't evaluate whether the agent produces it.

The Evaluation Framework

4 dimensions of agent quality

Dimension	What it measures	How to measure	Non-negotiable?
Accuracy	Are the facts correct? No hallucinations, no fabricated data	Cross-check output against input data. Human spot-check	Yes. 95%+ required
Compliance	Does output follow all rules? Word limits, banned phrases, format, style	Automated rule checker	Yes. 100% required
Completeness	Are all required fields populated? No missing data	Automated schema check	Yes. 90%+ required
Quality	Is the output genuinely good? Would a human use it without editing?	Human rating (1-5 scale)	Target ≥ 4.0 average

Dimension priority

Accuracy > Compliance > Completeness > Quality. In that order.

An accurate output that's slightly too long (compliance miss) is fixable
A compliant output that contains fabricated data (accuracy miss) is dangerous
An output that's missing one field (completeness miss) is incomplete but safe
A low-quality output that's accurate and compliant just needs prompt improvement

Building Evals for GTM Agents

Eval structure

Every agent eval has three components:

1. TEST SET: A curated set of inputs with known-good outputs
2. EVAL CRITERIA: Specific, measurable checks applied to each output
3. SCORING: How to aggregate individual checks into an overall score

Creating the test set

Test set type	What it contains	Size	When to use
Golden set	Inputs with human-verified ideal outputs. The "ground truth"	20-50 examples	Primary eval. Used for every prompt change
Edge case set	Inputs that represent unusual or difficult scenarios	10-20 examples	Catches failure modes. Run after golden set passes
Regression set	Inputs where the agent previously failed (fixed and added here)	Grows over time	Prevents re-introducing old bugs
Production sample	Random sample from real production runs	20-50 per eval cycle	Measures real-world quality, not just test-set quality

Test set rules:

Build the golden set from real data. Export 20-50 real prospect records. Have a human write the ideal output for each. This is the ground truth the agent is measured against
Include bad data in the test set. Prospects with missing fields, incomplete LinkedIn profiles, companies with no funding data. The agent must handle bad input gracefully
Add every failure to the regression set. When the agent produces a bad output in production, add that input + the correct output to the regression set. The test set grows stronger over time
20 examples minimum for any meaningful eval. Below 20, individual outliers dominate the score. Above 50, you get diminishing returns (unless the agent handles very diverse tasks)

Eval Criteria by Agent Type

Research Agent eval

Criterion	Check type	How to measure	Pass threshold
Company name correct	Accuracy	Exact match against input	100%
Employee count within 20% of actual	Accuracy	Compare to enrichment ground truth	90%
Funding data correct (round, amount, date)	Accuracy	Cross-check against Crunchbase	95%
Industry correct	Accuracy	Match against ground truth	95%
Signals are real and verifiable	Accuracy	Human spot-check: can the signal be verified?	90%
Problem hypothesis is grounded in evidence	Quality	Human rating: is the hypothesis supported by the data, not fabricated?	4.0/5 average
All required fields populated	Completeness	Automated schema check	90%
Output format matches schema	Compliance	Automated format check	100%
No hallucinated data	Accuracy	Cross-check every claim against input	99%+

Email Writer Agent eval

Criterion	Check type	How to measure	Pass threshold
Word count within limit	Compliance	Automated count. Email 1 ≤ 80, Email 2 ≤ 90, Email 3 ≤ 30	100%
No banned phrases	Compliance	Regex check against banned phrase list	100%
No em-dashes	Compliance	Character check for "—"	100%
Subject line ≤ 5 words, lowercase	Compliance	Automated check	100%
First word is not "I"	Compliance	Automated check	100%
Signal reference is from input data (no hallucination)	Accuracy	Cross-check signal in output against input signal field	99%+
Personalization token present and accurate	Accuracy	Verify the personalized element matches the input data	95%
Email reads naturally (not robotic)	Quality	Human rating 1-5	4.0/5 average
Each email uses a different opener pattern	Quality	Human review: are the openers genuinely different?	90%
Proof point is specific (named company or stat)	Quality	Check for named company or specific number in Email 2	90%

Reply Classifier Agent eval

Criterion	Check type	How to measure	Pass threshold
Classification matches ground truth	Accuracy	Compare to human-labeled classification	92%+
Confidence score correlates with accuracy	Accuracy	High-confidence classifications should be more accurate than low	Correlation > 0.7
Low-confidence outputs flagged for human review	Compliance	Outputs below confidence threshold are flagged	100%
All classification categories covered in test set	Completeness	At least 3 examples per category in the test set	N/A (test set design)
Edge cases handled (multi-intent replies, sarcasm)	Quality	Human review of edge case set	80%+ on edge cases

Lead Scorer Agent eval

Criterion	Check type	How to measure	Pass threshold
Score matches manual scoring within ±10 points	Accuracy	Compare to human-scored ground truth	85%
Tier assignment matches (Tier 1 vs 2 vs 3)	Accuracy	Compare tier to human assignment	90%
Missing data handled correctly (not scored as 0, noted as missing)	Completeness	Check handling of null/empty input fields	100%
Score reasoning is documented	Completeness	Output includes breakdown per dimension	95%
Anti-ICP flags correctly applied	Accuracy	Competitors, disqualified verticals get negative scores	100%

Automated vs Human Evaluation

Eval type	What it checks	Speed	Cost	Reliability
Automated (programmatic)	Rule compliance, format, word count, banned phrases, schema	Instant. Every output	Free (compute only)	100% for binary rules. Can't assess quality/naturalness
LLM-as-judge	Quality, naturalness, relevance. Another LLM evaluates the output	Fast (seconds per eval)	$0.01-0.05 per eval	70-85% agreement with humans. Good for screening, not final judgment
Human review	Everything. The gold standard	Slow (1-3 min per output)	$0.50-2.00 per eval (reviewer time)	Highest. The ground truth all other methods are calibrated against

When to use each

Phase	Automated	LLM-as-judge	Human
Every output (production)	Yes (100%)	Optional (for quality screening)	No (too slow/expensive)
Daily quality check	Yes (100% of daily output)	Yes (flag low-quality for human review)	10-20% spot-check
Prompt change validation	Yes (full golden set)	Yes (full golden set)	Yes (20-50 examples from golden set)
Monthly quality audit	Yes (full regression set)	Optional	Yes (50 production samples)

Eval layer rules

Automated checks run on every output. Zero cost, instant feedback. Word count, banned phrases, format, schema. These are binary pass/fail. No reason to skip
LLM-as-judge is a middle layer. Use a second LLM (can be the same model with a different prompt) to rate quality 1-5. Useful for flagging outputs that pass automated checks but feel robotic or irrelevant
Human review is the calibration layer. Everything else is calibrated against human judgment. Without regular human review, automated checks may pass outputs that are technically compliant but qualitatively bad

The Eval Pipeline

How to run evals continuously

Agent produces output
  ↓
Layer 1: Automated checks (instant)
  - Format compliance (schema, word count, structure)
  - Rule compliance (banned phrases, em-dashes, subject line rules)
  - Accuracy cross-check (proper nouns in output exist in input)
  ↓
  PASS → Layer 2
  FAIL → Reject. Log failure. Regenerate or flag for human fix

Layer 2: LLM-as-judge (optional, seconds)
  - Quality rating 1-5
  - Naturalness check
  - Relevance to the prospect's situation
  ↓
  Score ≥ 3.5 → PASS → Output goes to production
  Score < 3.5 → Flag for human review

Layer 3: Human spot-check (daily/weekly)
  - Random sample of 10-20% of passed outputs
  - Human rates 1-5 on quality
  - Flags hallucinations, tone issues, irrelevant content
  - Results feed back into prompt improvement

Tracking Eval Results Over Time

The eval dashboard

Metric	What to track	Frequency
Automated pass rate	% of outputs passing all automated checks	Per batch
LLM-as-judge average score	Mean quality score from the judge model	Daily
Human review average score	Mean quality score from human reviewers	Weekly
Hallucination rate	% of outputs containing fabricated claims	Per batch
Rule violation rate	% of outputs violating at least one rule	Per batch
Completeness rate	% of outputs with all required fields populated	Per batch
Regression test pass rate	% of regression set inputs that produce correct outputs	Per prompt change
Golden set accuracy	% of golden set inputs that match the expected output quality	Per prompt change

Quality trend analysis

Trend	What it means	Action
Automated pass rate declining	New edge cases or data quality issues	Add failing cases to the regression set. Fix the prompt
Human scores declining	Output quality is drifting. Prompt may be stale	Review recent outputs vs 30-day-ago outputs. Update prompt
Hallucination rate increasing	Data pipeline change or prompt regression	Check input data quality. Review prompt for hallucination-prone patterns
LLM-as-judge and human scores diverging	The judge model is miscalibrated	Re-calibrate the judge prompt against recent human ratings

Eval-Driven Prompt Improvement

The improvement cycle

1. Run eval on current prompt (golden set + production sample)
2. Identify weakest criterion (lowest pass rate or score)
3. Analyze failures: what pattern causes the failures?
4. Modify the prompt to address the pattern
5. Re-run eval on the SAME test set
6. Compare results: did the change improve the weak criterion
   WITHOUT degrading other criteria?
7. If yes: deploy the new prompt
8. If no: revert and try a different approach

Improvement rules

One prompt change per eval cycle. Changing 5 things at once makes it impossible to know which change helped. Change one instruction, one example, or one rule per iteration
Always re-run the full golden set after a change. A change that fixes one problem may introduce another. The golden set catches regressions
Track every prompt version with its eval results. Version 1: accuracy 91%, quality 3.8. Version 2: accuracy 94%, quality 4.1. Version 3: accuracy 93%, quality 4.3. The version history shows the improvement trajectory
Don't optimize for the test set at the expense of production. If the prompt scores perfectly on the golden set but poorly on production samples, the golden set isn't representative. Add more diverse examples

Pre-Deploy Eval Checklist

Before deploying any agent or prompt change to production:

[ ] Golden set (20+ examples) created with human-verified ideal outputs
[ ] Edge case set (10+ examples) covering missing data, unusual inputs
[ ] Automated checks implemented for all compliance rules
[ ] Agent passes 95%+ on accuracy checks against golden set
[ ] Agent passes 100% on compliance checks (word count, banned phrases, format)
[ ] Agent passes 90%+ on completeness checks
[ ] Human review of 20+ outputs scores ≥ 4.0/5 average quality
[ ] Hallucination rate < 2% on golden set
[ ] Regression set passes at 100% (no re-introduced old failures)
[ ] Eval results documented with prompt version number

Anti-Pattern Check

No eval before deployment. "The prompt looks good in testing" is not an eval. Build a golden set. Run automated checks. Score with human reviewers. Then deploy
Eval set too small (5 examples). 5 examples means one failure changes the pass rate by 20%. Not meaningful. Minimum 20 examples for any eval
No automated checks on production output. Automated checks are free and instant. Every output should pass word count, banned phrases, and format checks before being used. There's no reason to skip this
Evaluating accuracy without ground truth. "The output seems correct" is not an accuracy check. Compare every factual claim in the output against the input data or a verified source. Ground truth or it's not an eval
Golden set never updated. The golden set from 3 months ago doesn't include the new ICP segment, the updated messaging, or the recent edge cases. Update quarterly with fresh real-world examples
Optimizing for test set at the expense of production. Agent passes 100% on the golden set but produces mediocre output on real prospects. The golden set is too easy or not representative. Add harder, more diverse examples
No regression testing. A prompt change fixes one problem and re-introduces two old ones. Without a regression set, old failures silently return. Add every failure to the regression set and re-run after every change
LLM-as-judge without human calibration. The judge model rates everything 4.5/5. Humans rate the same outputs 3.2/5. The judge is miscalibrated. Calibrate the judge prompt by comparing its ratings to human ratings on 50+ examples

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# Agent Evaluation

## The Evaluation Framework

### 4 dimensions of agent quality

| Dimension | What it measures | How to measure | Non-negotiable? |
|-----------|-----------------|---------------|----------------|
| Accuracy | Are the facts correct? No hallucinations, no fabricated data | Cross-check output against input data. Human spot-check | Yes. 95%+ required |
| Compliance | Does output follow all rules? Word limits, banned phrases, format, style | Automated rule checker | Yes. 100% required |
| Completeness | Are all required fields populated? No missing data | Automated schema check | Yes. 90%+ required |
| Quality | Is the output genuinely good? Would a human use it without editing? | Human rating (1-5 scale) | Target ≥ 4.0 average |

### Dimension priority

Accuracy > Compliance > Completeness > Quality. In that order.

- An accurate output that's slightly too long (compliance miss) is fixable
- A compliant output that contains fabricated data (accuracy miss) is dangerous
- An output that's missing one field (completeness miss) is incomplete but safe
- A low-quality output that's accurate and compliant just needs prompt improvement

---

## Building Evals for GTM Agents

### Eval structure

Every agent eval has three components:

```
1. TEST SET: A curated set of inputs with known-good outputs
2. EVAL CRITERIA: Specific, measurable checks applied to each output
3. SCORING: How to aggregate individual checks into an overall score
```

### Creating the test set

| Test set type | What it contains | Size | When to use |
|-------------|-----------------|------|------------|
| Golden set | Inputs with human-verified ideal outputs. The "ground truth" | 20-50 examples | Primary eval. Used for every prompt change |
| Edge case set | Inputs that represent unusual or difficult scenarios | 10-20 examples | Catches failure modes. Run after golden set passes |
| Regression set | Inputs where the agent previously failed (fixed and added here) | Grows over time | Prevents re-introducing old bugs |
| Production sample | Random sample from real production runs | 20-50 per eval cycle | Measures real-world quality, not just test-set quality |

**Test set rules:**
- **Build the golden set from real data.** Export 20-50 real prospect records. Have a human write the ideal output for each. This is the ground truth the agent is measured against
- **Include bad data in the test set.** Prospects with missing fields, incomplete LinkedIn profiles, companies with no funding data. The agent must handle bad input gracefully
- **Add every failure to the regression set.** When the agent produces a bad output in production, add that input + the correct output to the regression set. The test set grows stronger over time
- **20 examples minimum for any meaningful eval.** Below 20, individual outliers dominate the score. Above 50, you get diminishing returns (unless the agent handles very diverse tasks)

---

## Eval Criteria by Agent Type

### Research Agent eval

| Criterion | Check type | How to measure | Pass threshold |
|-----------|-----------|---------------|---------------|
| Company name correct | Accuracy | Exact match against input | 100% |
| Employee count within 20% of actual | Accuracy | Compare to enrichment ground truth | 90% |
| Funding data correct (round, amount, date) | Accuracy | Cross-check against Crunchbase | 95% |
| Industry correct | Accuracy | Match against ground truth | 95% |
| Signals are real and verifiable | Accuracy | Human spot-check: can the signal be verified? | 90% |
| Problem hypothesis is grounded in evidence | Quality | Human rating: is the hypothesis supported by the data, not fabricated? | 4.0/5 average |
| All required fields populated | Completeness | Automated schema check | 90% |
| Output format matches schema | Compliance | Automated format check | 100% |
| No hallucinated data | Accuracy | Cross-check every claim against input | 99%+ |

### Email Writer Agent eval

| Criterion | Check type | How to measure | Pass threshold |
|-----------|-----------|---------------|---------------|
| Word count within limit | Compliance | Automated count. Email 1 ≤ 80, Email 2 ≤ 90, Email 3 ≤ 30 | 100% |
| No banned phrases | Compliance | Regex check against banned phrase list | 100% |
| No em-dashes | Compliance | Character check for "—" | 100% |
| Subject line ≤ 5 words, lowercase | Compliance | Automated check | 100% |
| First word is not "I" | Compliance | Automated check | 100% |
| Signal reference is from input data (no hallucination) | Accuracy | Cross-check signal in output against input signal field | 99%+ |
| Personalization token present and accurate | Accuracy | Verify the personalized element matches the input data | 95% |
| Email reads naturally (not robotic) | Quality | Human rating 1-5 | 4.0/5 average |
| Each email uses a different opener pattern | Quality | Human review: are the openers genuinely different? | 90% |
| Proof point is specific (named company or stat) | Quality | Check for named company or specific number in Email 2 | 90% |

### Reply Classifier Agent eval

| Criterion | Check type | How to measure | Pass threshold |
|-----------|-----------|---------------|---------------|
| Classification matches ground truth | Accuracy | Compare to human-labeled classification | 92%+ |
| Confidence score correlates with accuracy | Accuracy | High-confidence classifications should be more accurate than low | Correlation > 0.7 |
| Low-confidence outputs flagged for human review | Compliance | Outputs below confidence threshold are flagged | 100% |
| All classification categories covered in test set | Completeness | At least 3 examples per category in the test set | N/A (test set design) |
| Edge cases handled (multi-intent replies, sarcasm) | Quality | Human review of edge case set | 80%+ on edge cases |

### Lead Scorer Agent eval

| Criterion | Check type | How to measure | Pass threshold |
|-----------|-----------|---------------|---------------|
| Score matches manual scoring within ±10 points | Accuracy | Compare to human-scored ground truth | 85% |
| Tier assignment matches (Tier 1 vs 2 vs 3) | Accuracy | Compare tier to human assignment | 90% |
| Missing data handled correctly (not scored as 0, noted as missing) | Completeness | Check handling of null/empty input fields | 100% |
| Score reasoning is documented | Completeness | Output includes breakdown per dimension | 95% |
| Anti-ICP flags correctly applied | Accuracy | Competitors, disqualified verticals get negative scores | 100% |

---

## Automated vs Human Evaluation

| Eval type | What it checks | Speed | Cost | Reliability |
|-----------|---------------|-------|------|-------------|
| Automated (programmatic) | Rule compliance, format, word count, banned phrases, schema | Instant. Every output | Free (compute only) | 100% for binary rules. Can't assess quality/naturalness |
| LLM-as-judge | Quality, naturalness, relevance. Another LLM evaluates the output | Fast (seconds per eval) | $0.01-0.05 per eval | 70-85% agreement with humans. Good for screening, not final judgment |
| Human review | Everything. The gold standard | Slow (1-3 min per output) | $0.50-2.00 per eval (reviewer time) | Highest. The ground truth all other methods are calibrated against |

### When to use each

| Phase | Automated | LLM-as-judge | Human |
|-------|----------|-------------|-------|
| Every output (production) | Yes (100%) | Optional (for quality screening) | No (too slow/expensive) |
| Daily quality check | Yes (100% of daily output) | Yes (flag low-quality for human review) | 10-20% spot-check |
| Prompt change validation | Yes (full golden set) | Yes (full golden set) | Yes (20-50 examples from golden set) |
| Monthly quality audit | Yes (full regression set) | Optional | Yes (50 production samples) |

### Eval layer rules

- **Automated checks run on every output.** Zero cost, instant feedback. Word count, banned phrases, format, schema. These are binary pass/fail. No reason to skip
- **LLM-as-judge is a middle layer.** Use a second LLM (can be the same model with a different prompt) to rate quality 1-5. Useful for flagging outputs that pass automated checks but feel robotic or irrelevant
- **Human review is the calibration layer.** Everything else is calibrated against human judgment. Without regular human review, automated checks may pass outputs that are technically compliant but qualitatively bad

---

## The Eval Pipeline

### How to run evals continuously

```
Agent produces output
  ↓
Layer 1: Automated checks (instant)
  - Format compliance (schema, word count, structure)
  - Rule compliance (banned phrases, em-dashes, subject line rules)
  - Accuracy cross-check (proper nouns in output exist in input)
  ↓
  PASS → Layer 2
  FAIL → Reject. Log failure. Regenerate or flag for human fix

Layer 2: LLM-as-judge (optional, seconds)
  - Quality rating 1-5
  - Naturalness check
  - Relevance to the prospect's situation
  ↓
  Score ≥ 3.5 → PASS → Output goes to production
  Score < 3.5 → Flag for human review

Layer 3: Human spot-check (daily/weekly)
  - Random sample of 10-20% of passed outputs
  - Human rates 1-5 on quality
  - Flags hallucinations, tone issues, irrelevant content
  - Results feed back into prompt improvement
```

---

## Tracking Eval Results Over Time

### The eval dashboard

| Metric | What to track | Frequency |
|--------|-------------|-----------|
| Automated pass rate | % of outputs passing all automated checks | Per batch |
| LLM-as-judge average score | Mean quality score from the judge model | Daily |
| Human review average score | Mean quality score from human reviewers | Weekly |
| Hallucination rate | % of outputs containing fabricated claims | Per batch |
| Rule violation rate | % of outputs violating at least one rule | Per batch |
| Completeness rate | % of outputs with all required fields populated | Per batch |
| Regression test pass rate | % of regression set inputs that produce correct outputs | Per prompt change |
| Golden set accuracy | % of golden set inputs that match the expected output quality | Per prompt change |

### Quality trend analysis

| Trend | What it means | Action |
|-------|-------------|--------|
| Automated pass rate declining | New edge cases or data quality issues | Add failing cases to the regression set. Fix the prompt |
| Human scores declining | Output quality is drifting. Prompt may be stale | Review recent outputs vs 30-day-ago outputs. Update prompt |
| Hallucination rate increasing | Data pipeline change or prompt regression | Check input data quality. Review prompt for hallucination-prone patterns |
| LLM-as-judge and human scores diverging | The judge model is miscalibrated | Re-calibrate the judge prompt against recent human ratings |

---

## Eval-Driven Prompt Improvement

### The improvement cycle

```
1. Run eval on current prompt (golden set + production sample)
2. Identify weakest criterion (lowest pass rate or score)
3. Analyze failures: what pattern causes the failures?
4. Modify the prompt to address the pattern
5. Re-run eval on the SAME test set
6. Compare results: did the change improve the weak criterion
   WITHOUT degrading other criteria?
7. If yes: deploy the new prompt
8. If no: revert and try a different approach
```

### Improvement rules

- **One prompt change per eval cycle.** Changing 5 things at once makes it impossible to know which change helped. Change one instruction, one example, or one rule per iteration
- **Always re-run the full golden set after a change.** A change that fixes one problem may introduce another. The golden set catches regressions
- **Track every prompt version with its eval results.** Version 1: accuracy 91%, quality 3.8. Version 2: accuracy 94%, quality 4.1. Version 3: accuracy 93%, quality 4.3. The version history shows the improvement trajectory
- **Don't optimize for the test set at the expense of production.** If the prompt scores perfectly on the golden set but poorly on production samples, the golden set isn't representative. Add more diverse examples

---

## Pre-Deploy Eval Checklist

Before deploying any agent or prompt change to production:

- [ ] Golden set (20+ examples) created with human-verified ideal outputs
- [ ] Edge case set (10+ examples) covering missing data, unusual inputs
- [ ] Automated checks implemented for all compliance rules
- [ ] Agent passes 95%+ on accuracy checks against golden set
- [ ] Agent passes 100% on compliance checks (word count, banned phrases, format)
- [ ] Agent passes 90%+ on completeness checks
- [ ] Human review of 20+ outputs scores ≥ 4.0/5 average quality
- [ ] Hallucination rate < 2% on golden set
- [ ] Regression set passes at 100% (no re-introduced old failures)
- [ ] Eval results documented with prompt version number

---

## Anti-Pattern Check

- No eval before deployment. "The prompt looks good in testing" is not an eval. Build a golden set. Run automated checks. Score with human reviewers. Then deploy
- Eval set too small (5 examples). 5 examples means one failure changes the pass rate by 20%. Not meaningful. Minimum 20 examples for any eval
- No automated checks on production output. Automated checks are free and instant. Every output should pass word count, banned phrases, and format checks before being used. There's no reason to skip this
- Evaluating accuracy without ground truth. "The output seems correct" is not an accuracy check. Compare every factual claim in the output against the input data or a verified source. Ground truth or it's not an eval
- Golden set never updated. The golden set from 3 months ago doesn't include the new ICP segment, the updated messaging, or the recent edge cases. Update quarterly with fresh real-world examples
- Optimizing for test set at the expense of production. Agent passes 100% on the golden set but produces mediocre output on real prospects. The golden set is too easy or not representative. Add harder, more diverse examples
- No regression testing. A prompt change fixes one problem and re-introduces two old ones. Without a regression set, old failures silently return. Add every failure to the regression set and re-run after every change
- LLM-as-judge without human calibration. The judge model rates everything 4.5/5. Humans rate the same outputs 3.2/5. The judge is miscalibrated. Calibrate the judge prompt by comparing its ratings to human ratings on 50+ examples