Home/ Skills/ llm-eval-for-cold-email

general llm-eval-for-cold-email

llm-eval-for-cold-email

This skill should be used when the user asks to "evaluate AI-generated cold emails", "score cold email quality with AI", "build an eval for cold email agents", "assess LLM cold email output", "create a cold email quality checker", "evaluate AI outbound email quality", "build a cold email critic agent", "test AI-written sales emails", "measure AI cold email performance", or any variation of evaluating the quality of LLM-generated cold outbound email for B2B SaaS.

Download .md

LLM Eval for Cold Email

LLM eval for cold email is a systematic quality check applied to every AI-generated cold email before it reaches a prospect. The eval combines automated rule checks (word count, banned phrases, format) with LLM-as-judge scoring (tone, relevance, naturalness) and human spot-checks. The output is a pass/fail decision per email plus a quality score that tracks agent performance over time.

The principle: a cold email eval should mirror what a skilled SDR manager would check in a 15-second review. Is it short? Is it relevant? Does it sound human? Is the signal real? Is the ask appropriate? Codify that review into automated + LLM + human layers.

The 3-Layer Eval

Layer 1: Automated rule checks (every email, instant)

Check	Rule	How to implement	Pass/fail
Word count (Email 1)	≤ 80 words	`len(email.split()) <= 80`	Fail if over
Word count (Email 2)	≤ 90 words	Same	Fail if over
Word count (Email 3)	≤ 30 words	Same	Fail if over
Banned phrases	No "leveraging", "synergies", "unlock", "in today's fast-paced world", etc.	Regex match against banned list	Fail if any match
Em-dashes	No "—" in body	`"—" not in email_body`	Fail if present
Subject line length	≤ 5 words	`len(subject.split()) <= 5`	Fail if over
Subject line case	All lowercase (except proper nouns)	Check for uppercase words	Warn if Title Case
First word	Not "I"	`not email_body.strip().startswith("I ")`	Fail if starts with "I"
Calendar link (Email 1)	No Calendly/calendar link in Email 1	URL pattern match	Fail if present
"Demo" mention	No "demo" anywhere	`"demo" not in email_body.lower()`	Fail if present
Personalization present	At least one merge variable rendered (not blank)	Check for unrendered `{{tokens}}` or blank personalization	Fail if no personalization
Email signature	First name only. No elaborate signature block	Check for excessive signature content	Warn if > 3 lines

Layer 2: LLM-as-judge (every email or flagged sample)

Use a second LLM call to rate the email on qualitative dimensions the automated checks can't measure.

Judge prompt:

You are a cold email quality reviewer for B2B SaaS outbound.
Rate this email on 5 dimensions, each 1-5.

Dimensions:
1. RELEVANCE: Does the personalization connect to the
   prospect's actual situation based on the input data?
   (5 = clearly relevant, 1 = generic or irrelevant)

2. NATURALNESS: Does it read like a human peer wrote it,
   not a bot or a marketer?
   (5 = completely natural, 1 = obviously AI-generated)

3. SIGNAL QUALITY: Is the signal (the reason for reaching
   out) specific, recent, and verifiable?
   (5 = specific verifiable signal, 1 = no signal or generic)

4. ASK APPROPRIATENESS: Is the CTA reasonable? (15 min,
   not 45. No "demo". No calendar link in Email 1)
   (5 = perfect ask, 1 = aggressive or inappropriate)

5. OVERALL: Would you send this email as-is to a VP at a
   target account?
   (5 = send immediately, 1 = would never send)

Input data provided to the email writer:
{input_data}

Email to evaluate:
Subject: {subject}
Body: {body}

Respond with JSON:
{
  "relevance": N,
  "naturalness": N,
  "signal_quality": N,
  "ask_appropriateness": N,
  "overall": N,
  "issues": ["list of specific issues found"],
  "pass": true/false (pass if overall >= 4 AND no dimension < 3)
}

Judge rules:

Use Sonnet for the judge (fast, accurate enough for scoring). Don't use the same model instance that generated the email
Pass threshold: overall ≥ 4 AND no individual dimension < 3. A 5/5 on naturalness doesn't compensate for a 2/5 on relevance
Log all judge scores. The trend over time shows whether the generator prompt is improving or degrading
Calibrate the judge monthly against human ratings. If the judge consistently scores 0.5+ points higher than human reviewers, adjust the judge prompt

Layer 3: Human spot-check (10-20% sample)

A human reviewer checks a random sample of emails that passed Layers 1 and 2.

Human review checklist (15-20 seconds per email):

[ ] Signal is real (could you verify it with a quick search?)
[ ] Personalization connects to a relevant problem (not just trivia)
[ ] Tone sounds like a peer, not a vendor or a bot
[ ] You would actually send this email to a real prospect
[ ] No claims about the prospect that aren't in the input data

Human review rules:

Review 10-20% of emails that passed automated + LLM checks
Rate each 1-5 on the same dimensions as the LLM judge
Compare human scores to LLM judge scores. They should agree within ±0.5 on average. If they diverge, recalibrate the judge

Hallucination-Specific Checks

Hallucination is the highest-risk failure mode in AI-generated cold email. One fabricated claim ("congrats on the acquisition" when no acquisition happened) permanently damages the relationship.

Hallucination detection methods

Method	What it catches	Implementation
Proper noun cross-check	Company names, person names, event names in the email that aren't in the input data	Extract all proper nouns from output. Check each against input data fields. Flag unmatched nouns
Funding claim verification	"Series B" or "$45M raise" that doesn't match input	Extract funding references. Compare to input funding field
Metric claim verification	"12% improvement" or "3x pipeline" not from input	Extract numbers with context. Check against input proof points
Event reference check	"Your talk at SaaStr" or "the product launch" not from input	Extract event references. Compare to input signals
Role/title verification	"As Head of RevOps" when input says "Director of Sales"	Compare role references to input title field

Hallucination rules

Cross-check every proper noun. If the email mentions a company name, person name, or event that isn't in the input data, it's fabricated. Flag and reject
Cross-check every number. If the email cites "3x improvement" and the input data contains no such number, it's fabricated
"Not found" is safer than guessing. If the input data is missing a field, the email should skip that element, not fill it with a plausible-sounding claim
Hallucination rate target: < 2% per batch. If more than 2% of emails contain fabricated claims, the generator prompt needs anti-hallucination reinforcement

Eval Scoring and Thresholds

Per-email scoring

Automated checks: PASS / FAIL (binary)
  → Must pass all automated checks to proceed

LLM judge score: 1-5 per dimension, 5 dimensions
  → Overall score = average of 5 dimensions
  → Pass: overall ≥ 4.0 AND no dimension < 3.0

Human review (if sampled): 1-5 per dimension
  → Same thresholds as LLM judge

Final decision:
  Automated PASS + Judge PASS → Send
  Automated PASS + Judge FAIL → Regenerate with feedback
  Automated FAIL → Regenerate (don't judge, just fix the rule violation)

Batch-level scoring

Metric	How to calculate	Target
Automated pass rate	% of emails passing all automated checks	> 95%
Judge pass rate	% of emails scoring ≥ 4.0 overall	> 85%
Human approval rate	% of human-reviewed emails approved	> 90%
Hallucination rate	% of emails with fabricated claims	< 2%
Average overall score	Mean LLM judge overall score across batch	≥ 4.0
Regeneration rate	% of emails that needed regeneration (failed auto or judge)	< 15%

Eval-Driven Prompt Improvement

Finding patterns in failures

1. Export all failed emails from the last batch
2. Categorize failures:
   - Rule violations (which rule? how often?)
   - Low relevance scores (is the personalization generic?)
   - Low naturalness scores (what makes it sound like AI?)
   - Hallucinations (what was fabricated? why?)
3. Identify the top 2 failure categories
4. Modify the generator prompt to address those categories
5. Re-run the eval on the golden set
6. Deploy if scores improve without regression

Common failure patterns and prompt fixes

Pattern	Evidence	Prompt fix
Emails are too long	30% fail word count check	Add explicit: "Maximum 80 words. Count before outputting. If over, cut the least essential sentence"
Openers start with "I noticed"	40% of emails start with "I noticed" or "I came across"	Add to banned list: "Never start with 'I noticed', 'I came across', or 'I saw that'. Start with the signal or the prospect's name"
Personalization is generic	LLM judge relevance score averaging 3.2	Add better examples of good vs bad personalization. Include 2 examples of generic (bad) and 2 of specific (good)
Tone is too formal	LLM judge naturalness score averaging 3.5	Add: "Write like you'd message a colleague, not draft a cover letter. Short sentences. Casual. No marketing language"
Hallucinated proof points	3% of emails cite results not in input data	Strengthen: "Only reference facts provided in the input. If no proof point is in the input, omit the proof point. Never fabricate statistics or company names"
Em-dashes appear despite ban	5% of emails contain em-dashes	Add to rules (some models need explicit character-level instruction): "Never use the em-dash character (—). Use periods to separate clauses instead"

Production Monitoring

Weekly eval dashboard

Metric	This week	Last week	Trend	Action needed?
Emails generated	200	180	↑	No
Automated pass rate	96%	97%	↓	Monitor. If drops below 95%, investigate
Judge pass rate	87%	89%	↓	Monitor. If drops below 85%, investigate
Human approval rate	92%	93%	→	Stable. Good
Hallucination rate	1.5%	1.0%	↑	Watch closely. If hits 2%, pause and fix
Average overall score	4.2	4.3	↓	Minor. Within range
Reply rate on sent emails	11%	12%	↓	Normal variation. Monitor over 4 weeks

When to pause and fix

Trigger	Action
Automated pass rate < 90%	Prompt regression. Check for recent changes. Revert if necessary
Judge pass rate < 80%	Quality declining. Review recent outputs vs 30-day-ago outputs. Update prompt
Hallucination rate > 3%	Urgent. Pause AI sending. Review data pipeline for missing fields. Strengthen anti-hallucination rules
Human approval rate < 85%	The judge is miscalibrated or quality has genuinely dropped. Re-calibrate judge. Review prompt
Reply rate declining 4 weeks straight	May be quality issue or list issue. Compare reply rates on AI vs template to isolate

Building the Golden Set for Cold Email

What to include

Example type	Count	Purpose
Ideal outputs (human-written best examples)	10-15	The quality target. What perfect looks like
Good outputs (AI-generated, human-approved)	10	Realistic good. What acceptable looks like
Bad outputs (known failures with labeled issues)	10	What to avoid. Tests whether the eval catches failures
Edge cases (sparse input data, unusual companies)	5-10	Tests graceful degradation when data is thin

Golden set rules

Include the input data alongside each expected output. The eval needs to verify the output against the input. Without input data, accuracy checks are impossible
Update the golden set quarterly. Add new failure modes, new edge cases, and new examples of ideal output as your ICP and messaging evolve
Use real prospect data (anonymized if necessary). Fake data ("Acme Corp, 50 employees") produces different outputs than real data ("Ramp, 180 employees, Series B"). Use real data for realistic eval

Measurement

Metric	Definition	Target	Frequency
Eval coverage	% of AI-generated emails that go through all 3 eval layers	100% automated, 100% judge (or flagged), 10-20% human	Per batch
False positive rate	% of emails the eval approved that humans later identified as bad	< 5%	Monthly
False negative rate	% of emails the eval rejected that humans would have approved	< 10%	Monthly
Eval overhead time	Total time spent on eval (automated + judge + human) per batch of 200	< 30 minutes (automated instant, judge < 5 min, human 20 min for 20% sample)	Per batch
Prompt improvement velocity	How much does quality improve per prompt iteration?	Measurable improvement on golden set per iteration	Per iteration

Anti-Pattern Check

No automated checks before LLM judge. The judge evaluates 200 emails. 30 of them are over word count. The judge spends compute (and money) evaluating emails that should have been caught by a free regex check. Automated checks first, always
LLM judge uses the same prompt as the generator. The judge and the generator have the same blind spots. Use a different prompt with explicit evaluation criteria. The judge should be a critic, not a clone of the writer
No hallucination-specific check. The generic quality eval rates an email 4.5/5. But the email says "congrats on the Series C" when the company raised a Series B. Generic quality scoring doesn't catch factual errors. Cross-check every proper noun and number
Golden set has 5 examples. Not enough to evaluate reliably. 30-40 examples minimum across ideal, good, bad, and edge cases
Eval exists but nobody reads the results. Weekly dashboard goes to a Slack channel nobody monitors. If the scores are declining and nobody notices for 3 weeks, the eval is theater. Assign an owner. Review weekly
Same eval since launch, never updated. The ICP shifted. New banned phrases emerged. The messaging changed. The eval criteria must evolve with the product. Update quarterly
Human spot-check stopped after Month 1. "The agent is good now, we don't need to check." Quality drifts. Prompts degrade. Models update. 10% human spot-check is permanent. Never drop to 0%

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# LLM Eval for Cold Email

## The 3-Layer Eval

### Layer 1: Automated rule checks (every email, instant)

| Check | Rule | How to implement | Pass/fail |
|-------|------|-----------------|-----------|
| Word count (Email 1) | ≤ 80 words | `len(email.split()) <= 80` | Fail if over |
| Word count (Email 2) | ≤ 90 words | Same | Fail if over |
| Word count (Email 3) | ≤ 30 words | Same | Fail if over |
| Banned phrases | No "leveraging", "synergies", "unlock", "in today's fast-paced world", etc. | Regex match against banned list | Fail if any match |
| Em-dashes | No "—" in body | `"—" not in email_body` | Fail if present |
| Subject line length | ≤ 5 words | `len(subject.split()) <= 5` | Fail if over |
| Subject line case | All lowercase (except proper nouns) | Check for uppercase words | Warn if Title Case |
| First word | Not "I" | `not email_body.strip().startswith("I ")` | Fail if starts with "I" |
| Calendar link (Email 1) | No Calendly/calendar link in Email 1 | URL pattern match | Fail if present |
| "Demo" mention | No "demo" anywhere | `"demo" not in email_body.lower()` | Fail if present |
| Personalization present | At least one merge variable rendered (not blank) | Check for unrendered `{{tokens}}` or blank personalization | Fail if no personalization |
| Email signature | First name only. No elaborate signature block | Check for excessive signature content | Warn if > 3 lines |

### Layer 2: LLM-as-judge (every email or flagged sample)

Use a second LLM call to rate the email on qualitative dimensions the automated checks can't measure.

**Judge prompt:**

```
You are a cold email quality reviewer for B2B SaaS outbound.
Rate this email on 5 dimensions, each 1-5.

Dimensions:
1. RELEVANCE: Does the personalization connect to the
   prospect's actual situation based on the input data?
   (5 = clearly relevant, 1 = generic or irrelevant)

2. NATURALNESS: Does it read like a human peer wrote it,
   not a bot or a marketer?
   (5 = completely natural, 1 = obviously AI-generated)

3. SIGNAL QUALITY: Is the signal (the reason for reaching
   out) specific, recent, and verifiable?
   (5 = specific verifiable signal, 1 = no signal or generic)

4. ASK APPROPRIATENESS: Is the CTA reasonable? (15 min,
   not 45. No "demo". No calendar link in Email 1)
   (5 = perfect ask, 1 = aggressive or inappropriate)

5. OVERALL: Would you send this email as-is to a VP at a
   target account?
   (5 = send immediately, 1 = would never send)

Input data provided to the email writer:
{input_data}

Email to evaluate:
Subject: {subject}
Body: {body}

Respond with JSON:
{
  "relevance": N,
  "naturalness": N,
  "signal_quality": N,
  "ask_appropriateness": N,
  "overall": N,
  "issues": ["list of specific issues found"],
  "pass": true/false (pass if overall >= 4 AND no dimension < 3)
}
```

**Judge rules:**
- Use Sonnet for the judge (fast, accurate enough for scoring). Don't use the same model instance that generated the email
- Pass threshold: overall ≥ 4 AND no individual dimension < 3. A 5/5 on naturalness doesn't compensate for a 2/5 on relevance
- Log all judge scores. The trend over time shows whether the generator prompt is improving or degrading
- Calibrate the judge monthly against human ratings. If the judge consistently scores 0.5+ points higher than human reviewers, adjust the judge prompt

### Layer 3: Human spot-check (10-20% sample)

A human reviewer checks a random sample of emails that passed Layers 1 and 2.

**Human review checklist (15-20 seconds per email):**
- [ ] Signal is real (could you verify it with a quick search?)
- [ ] Personalization connects to a relevant problem (not just trivia)
- [ ] Tone sounds like a peer, not a vendor or a bot
- [ ] You would actually send this email to a real prospect
- [ ] No claims about the prospect that aren't in the input data

**Human review rules:**
- Review 10-20% of emails that passed automated + LLM checks
- Rate each 1-5 on the same dimensions as the LLM judge
- Compare human scores to LLM judge scores. They should agree within ±0.5 on average. If they diverge, recalibrate the judge

---

## Hallucination-Specific Checks

Hallucination is the highest-risk failure mode in AI-generated cold email. One fabricated claim ("congrats on the acquisition" when no acquisition happened) permanently damages the relationship.

### Hallucination detection methods

| Method | What it catches | Implementation |
|--------|----------------|---------------|
| Proper noun cross-check | Company names, person names, event names in the email that aren't in the input data | Extract all proper nouns from output. Check each against input data fields. Flag unmatched nouns |
| Funding claim verification | "Series B" or "$45M raise" that doesn't match input | Extract funding references. Compare to input funding field |
| Metric claim verification | "12% improvement" or "3x pipeline" not from input | Extract numbers with context. Check against input proof points |
| Event reference check | "Your talk at SaaStr" or "the product launch" not from input | Extract event references. Compare to input signals |
| Role/title verification | "As Head of RevOps" when input says "Director of Sales" | Compare role references to input title field |

### Hallucination rules

- **Cross-check every proper noun.** If the email mentions a company name, person name, or event that isn't in the input data, it's fabricated. Flag and reject
- **Cross-check every number.** If the email cites "3x improvement" and the input data contains no such number, it's fabricated
- **"Not found" is safer than guessing.** If the input data is missing a field, the email should skip that element, not fill it with a plausible-sounding claim
- **Hallucination rate target: < 2% per batch.** If more than 2% of emails contain fabricated claims, the generator prompt needs anti-hallucination reinforcement

---

## Eval Scoring and Thresholds

### Per-email scoring

```
Automated checks: PASS / FAIL (binary)
  → Must pass all automated checks to proceed

LLM judge score: 1-5 per dimension, 5 dimensions
  → Overall score = average of 5 dimensions
  → Pass: overall ≥ 4.0 AND no dimension < 3.0

Human review (if sampled): 1-5 per dimension
  → Same thresholds as LLM judge

Final decision:
  Automated PASS + Judge PASS → Send
  Automated PASS + Judge FAIL → Regenerate with feedback
  Automated FAIL → Regenerate (don't judge, just fix the rule violation)
```

### Batch-level scoring

| Metric | How to calculate | Target |
|--------|-----------------|--------|
| Automated pass rate | % of emails passing all automated checks | > 95% |
| Judge pass rate | % of emails scoring ≥ 4.0 overall | > 85% |
| Human approval rate | % of human-reviewed emails approved | > 90% |
| Hallucination rate | % of emails with fabricated claims | < 2% |
| Average overall score | Mean LLM judge overall score across batch | ≥ 4.0 |
| Regeneration rate | % of emails that needed regeneration (failed auto or judge) | < 15% |

---

## Eval-Driven Prompt Improvement

### Finding patterns in failures

```
1. Export all failed emails from the last batch
2. Categorize failures:
   - Rule violations (which rule? how often?)
   - Low relevance scores (is the personalization generic?)
   - Low naturalness scores (what makes it sound like AI?)
   - Hallucinations (what was fabricated? why?)
3. Identify the top 2 failure categories
4. Modify the generator prompt to address those categories
5. Re-run the eval on the golden set
6. Deploy if scores improve without regression
```

### Common failure patterns and prompt fixes

| Pattern | Evidence | Prompt fix |
|---------|----------|-----------|
| Emails are too long | 30% fail word count check | Add explicit: "Maximum 80 words. Count before outputting. If over, cut the least essential sentence" |
| Openers start with "I noticed" | 40% of emails start with "I noticed" or "I came across" | Add to banned list: "Never start with 'I noticed', 'I came across', or 'I saw that'. Start with the signal or the prospect's name" |
| Personalization is generic | LLM judge relevance score averaging 3.2 | Add better examples of good vs bad personalization. Include 2 examples of generic (bad) and 2 of specific (good) |
| Tone is too formal | LLM judge naturalness score averaging 3.5 | Add: "Write like you'd message a colleague, not draft a cover letter. Short sentences. Casual. No marketing language" |
| Hallucinated proof points | 3% of emails cite results not in input data | Strengthen: "Only reference facts provided in the input. If no proof point is in the input, omit the proof point. Never fabricate statistics or company names" |
| Em-dashes appear despite ban | 5% of emails contain em-dashes | Add to rules (some models need explicit character-level instruction): "Never use the em-dash character (—). Use periods to separate clauses instead" |

---

## Production Monitoring

### Weekly eval dashboard

| Metric | This week | Last week | Trend | Action needed? |
|--------|-----------|-----------|-------|---------------|
| Emails generated | 200 | 180 | ↑ | No |
| Automated pass rate | 96% | 97% | ↓ | Monitor. If drops below 95%, investigate |
| Judge pass rate | 87% | 89% | ↓ | Monitor. If drops below 85%, investigate |
| Human approval rate | 92% | 93% | → | Stable. Good |
| Hallucination rate | 1.5% | 1.0% | ↑ | Watch closely. If hits 2%, pause and fix |
| Average overall score | 4.2 | 4.3 | ↓ | Minor. Within range |
| Reply rate on sent emails | 11% | 12% | ↓ | Normal variation. Monitor over 4 weeks |

### When to pause and fix

| Trigger | Action |
|---------|--------|
| Automated pass rate < 90% | Prompt regression. Check for recent changes. Revert if necessary |
| Judge pass rate < 80% | Quality declining. Review recent outputs vs 30-day-ago outputs. Update prompt |
| Hallucination rate > 3% | Urgent. Pause AI sending. Review data pipeline for missing fields. Strengthen anti-hallucination rules |
| Human approval rate < 85% | The judge is miscalibrated or quality has genuinely dropped. Re-calibrate judge. Review prompt |
| Reply rate declining 4 weeks straight | May be quality issue or list issue. Compare reply rates on AI vs template to isolate |

---

## Building the Golden Set for Cold Email

### What to include

| Example type | Count | Purpose |
|-------------|-------|---------|
| Ideal outputs (human-written best examples) | 10-15 | The quality target. What perfect looks like |
| Good outputs (AI-generated, human-approved) | 10 | Realistic good. What acceptable looks like |
| Bad outputs (known failures with labeled issues) | 10 | What to avoid. Tests whether the eval catches failures |
| Edge cases (sparse input data, unusual companies) | 5-10 | Tests graceful degradation when data is thin |

### Golden set rules

- **Include the input data alongside each expected output.** The eval needs to verify the output against the input. Without input data, accuracy checks are impossible
- **Update the golden set quarterly.** Add new failure modes, new edge cases, and new examples of ideal output as your ICP and messaging evolve
- **Use real prospect data (anonymized if necessary).** Fake data ("Acme Corp, 50 employees") produces different outputs than real data ("Ramp, 180 employees, Series B"). Use real data for realistic eval

---

## Measurement

| Metric | Definition | Target | Frequency |
|--------|-----------|--------|-----------|
| Eval coverage | % of AI-generated emails that go through all 3 eval layers | 100% automated, 100% judge (or flagged), 10-20% human | Per batch |
| False positive rate | % of emails the eval approved that humans later identified as bad | < 5% | Monthly |
| False negative rate | % of emails the eval rejected that humans would have approved | < 10% | Monthly |
| Eval overhead time | Total time spent on eval (automated + judge + human) per batch of 200 | < 30 minutes (automated instant, judge < 5 min, human 20 min for 20% sample) | Per batch |
| Prompt improvement velocity | How much does quality improve per prompt iteration? | Measurable improvement on golden set per iteration | Per iteration |

---

## Anti-Pattern Check

- No automated checks before LLM judge. The judge evaluates 200 emails. 30 of them are over word count. The judge spends compute (and money) evaluating emails that should have been caught by a free regex check. Automated checks first, always
- LLM judge uses the same prompt as the generator. The judge and the generator have the same blind spots. Use a different prompt with explicit evaluation criteria. The judge should be a critic, not a clone of the writer
- No hallucination-specific check. The generic quality eval rates an email 4.5/5. But the email says "congrats on the Series C" when the company raised a Series B. Generic quality scoring doesn't catch factual errors. Cross-check every proper noun and number
- Golden set has 5 examples. Not enough to evaluate reliably. 30-40 examples minimum across ideal, good, bad, and edge cases
- Eval exists but nobody reads the results. Weekly dashboard goes to a Slack channel nobody monitors. If the scores are declining and nobody notices for 3 weeks, the eval is theater. Assign an owner. Review weekly
- Same eval since launch, never updated. The ICP shifted. New banned phrases emerged. The messaging changed. The eval criteria must evolve with the product. Update quarterly
- Human spot-check stopped after Month 1. "The agent is good now, we don't need to check." Quality drifts. Prompts degrade. Models update. 10% human spot-check is permanent. Never drop to 0%