llm-eval-for-cold-email
LLM Eval for Cold Email
LLM eval for cold email is a systematic quality check applied to every AI-generated cold email before it reaches a prospect. The eval combines automated rule checks (word count, banned phrases, format) with LLM-as-judge scoring (tone, relevance, naturalness) and human spot-checks. The output is a pass/fail decision per email plus a quality score that tracks agent performance over time.
The principle: a cold email eval should mirror what a skilled SDR manager would check in a 15-second review. Is it short? Is it relevant? Does it sound human? Is the signal real? Is the ask appropriate? Codify that review into automated + LLM + human layers.
The 3-Layer Eval
Layer 1: Automated rule checks (every email, instant)
| Check | Rule | How to implement | Pass/fail |
|---|---|---|---|
| Word count (Email 1) | ≤ 80 words | len(email.split()) <= 80 |
Fail if over |
| Word count (Email 2) | ≤ 90 words | Same | Fail if over |
| Word count (Email 3) | ≤ 30 words | Same | Fail if over |
| Banned phrases | No "leveraging", "synergies", "unlock", "in today's fast-paced world", etc. | Regex match against banned list | Fail if any match |
| Em-dashes | No "—" in body | "—" not in email_body |
Fail if present |
| Subject line length | ≤ 5 words | len(subject.split()) <= 5 |
Fail if over |
| Subject line case | All lowercase (except proper nouns) | Check for uppercase words | Warn if Title Case |
| First word | Not "I" | not email_body.strip().startswith("I ") |
Fail if starts with "I" |
| Calendar link (Email 1) | No Calendly/calendar link in Email 1 | URL pattern match | Fail if present |
| "Demo" mention | No "demo" anywhere | "demo" not in email_body.lower() |
Fail if present |
| Personalization present | At least one merge variable rendered (not blank) | Check for unrendered {{tokens}} or blank personalization |
Fail if no personalization |
| Email signature | First name only. No elaborate signature block | Check for excessive signature content | Warn if > 3 lines |
Layer 2: LLM-as-judge (every email or flagged sample)
Use a second LLM call to rate the email on qualitative dimensions the automated checks can't measure.
Judge prompt:
You are a cold email quality reviewer for B2B SaaS outbound.
Rate this email on 5 dimensions, each 1-5.
Dimensions:
1. RELEVANCE: Does the personalization connect to the
prospect's actual situation based on the input data?
(5 = clearly relevant, 1 = generic or irrelevant)
2. NATURALNESS: Does it read like a human peer wrote it,
not a bot or a marketer?
(5 = completely natural, 1 = obviously AI-generated)
3. SIGNAL QUALITY: Is the signal (the reason for reaching
out) specific, recent, and verifiable?
(5 = specific verifiable signal, 1 = no signal or generic)
4. ASK APPROPRIATENESS: Is the CTA reasonable? (15 min,
not 45. No "demo". No calendar link in Email 1)
(5 = perfect ask, 1 = aggressive or inappropriate)
5. OVERALL: Would you send this email as-is to a VP at a
target account?
(5 = send immediately, 1 = would never send)
Input data provided to the email writer:
{input_data}
Email to evaluate:
Subject: {subject}
Body: {body}
Respond with JSON:
{
"relevance": N,
"naturalness": N,
"signal_quality": N,
"ask_appropriateness": N,
"overall": N,
"issues": ["list of specific issues found"],
"pass": true/false (pass if overall >= 4 AND no dimension < 3)
}
Judge rules:
- Use Sonnet for the judge (fast, accurate enough for scoring). Don't use the same model instance that generated the email
- Pass threshold: overall ≥ 4 AND no individual dimension < 3. A 5/5 on naturalness doesn't compensate for a 2/5 on relevance
- Log all judge scores. The trend over time shows whether the generator prompt is improving or degrading
- Calibrate the judge monthly against human ratings. If the judge consistently scores 0.5+ points higher than human reviewers, adjust the judge prompt
Layer 3: Human spot-check (10-20% sample)
A human reviewer checks a random sample of emails that passed Layers 1 and 2.
Human review checklist (15-20 seconds per email):
- [ ] Signal is real (could you verify it with a quick search?)
- [ ] Personalization connects to a relevant problem (not just trivia)
- [ ] Tone sounds like a peer, not a vendor or a bot
- [ ] You would actually send this email to a real prospect
- [ ] No claims about the prospect that aren't in the input data
Human review rules:
- Review 10-20% of emails that passed automated + LLM checks
- Rate each 1-5 on the same dimensions as the LLM judge
- Compare human scores to LLM judge scores. They should agree within ±0.5 on average. If they diverge, recalibrate the judge
Hallucination-Specific Checks
Hallucination is the highest-risk failure mode in AI-generated cold email. One fabricated claim ("congrats on the acquisition" when no acquisition happened) permanently damages the relationship.
Hallucination detection methods
| Method | What it catches | Implementation |
|---|---|---|
| Proper noun cross-check | Company names, person names, event names in the email that aren't in the input data | Extract all proper nouns from output. Check each against input data fields. Flag unmatched nouns |
| Funding claim verification | "Series B" or "$45M raise" that doesn't match input | Extract funding references. Compare to input funding field |
| Metric claim verification | "12% improvement" or "3x pipeline" not from input | Extract numbers with context. Check against input proof points |
| Event reference check | "Your talk at SaaStr" or "the product launch" not from input | Extract event references. Compare to input signals |
| Role/title verification | "As Head of RevOps" when input says "Director of Sales" | Compare role references to input title field |
Hallucination rules
- Cross-check every proper noun. If the email mentions a company name, person name, or event that isn't in the input data, it's fabricated. Flag and reject
- Cross-check every number. If the email cites "3x improvement" and the input data contains no such number, it's fabricated
- "Not found" is safer than guessing. If the input data is missing a field, the email should skip that element, not fill it with a plausible-sounding claim
- Hallucination rate target: < 2% per batch. If more than 2% of emails contain fabricated claims, the generator prompt needs anti-hallucination reinforcement
Eval Scoring and Thresholds
Per-email scoring
Automated checks: PASS / FAIL (binary)
→ Must pass all automated checks to proceed
LLM judge score: 1-5 per dimension, 5 dimensions
→ Overall score = average of 5 dimensions
→ Pass: overall ≥ 4.0 AND no dimension < 3.0
Human review (if sampled): 1-5 per dimension
→ Same thresholds as LLM judge
Final decision:
Automated PASS + Judge PASS → Send
Automated PASS + Judge FAIL → Regenerate with feedback
Automated FAIL → Regenerate (don't judge, just fix the rule violation)
Batch-level scoring
| Metric | How to calculate | Target |
|---|---|---|
| Automated pass rate | % of emails passing all automated checks | > 95% |
| Judge pass rate | % of emails scoring ≥ 4.0 overall | > 85% |
| Human approval rate | % of human-reviewed emails approved | > 90% |
| Hallucination rate | % of emails with fabricated claims | < 2% |
| Average overall score | Mean LLM judge overall score across batch | ≥ 4.0 |
| Regeneration rate | % of emails that needed regeneration (failed auto or judge) | < 15% |
Eval-Driven Prompt Improvement
Finding patterns in failures
1. Export all failed emails from the last batch
2. Categorize failures:
- Rule violations (which rule? how often?)
- Low relevance scores (is the personalization generic?)
- Low naturalness scores (what makes it sound like AI?)
- Hallucinations (what was fabricated? why?)
3. Identify the top 2 failure categories
4. Modify the generator prompt to address those categories
5. Re-run the eval on the golden set
6. Deploy if scores improve without regression
Common failure patterns and prompt fixes
| Pattern | Evidence | Prompt fix |
|---|---|---|
| Emails are too long | 30% fail word count check | Add explicit: "Maximum 80 words. Count before outputting. If over, cut the least essential sentence" |
| Openers start with "I noticed" | 40% of emails start with "I noticed" or "I came across" | Add to banned list: "Never start with 'I noticed', 'I came across', or 'I saw that'. Start with the signal or the prospect's name" |
| Personalization is generic | LLM judge relevance score averaging 3.2 | Add better examples of good vs bad personalization. Include 2 examples of generic (bad) and 2 of specific (good) |
| Tone is too formal | LLM judge naturalness score averaging 3.5 | Add: "Write like you'd message a colleague, not draft a cover letter. Short sentences. Casual. No marketing language" |
| Hallucinated proof points | 3% of emails cite results not in input data | Strengthen: "Only reference facts provided in the input. If no proof point is in the input, omit the proof point. Never fabricate statistics or company names" |
| Em-dashes appear despite ban | 5% of emails contain em-dashes | Add to rules (some models need explicit character-level instruction): "Never use the em-dash character (—). Use periods to separate clauses instead" |
Production Monitoring
Weekly eval dashboard
| Metric | This week | Last week | Trend | Action needed? |
|---|---|---|---|---|
| Emails generated | 200 | 180 | ↑ | No |
| Automated pass rate | 96% | 97% | ↓ | Monitor. If drops below 95%, investigate |
| Judge pass rate | 87% | 89% | ↓ | Monitor. If drops below 85%, investigate |
| Human approval rate | 92% | 93% | → | Stable. Good |
| Hallucination rate | 1.5% | 1.0% | ↑ | Watch closely. If hits 2%, pause and fix |
| Average overall score | 4.2 | 4.3 | ↓ | Minor. Within range |
| Reply rate on sent emails | 11% | 12% | ↓ | Normal variation. Monitor over 4 weeks |
When to pause and fix
| Trigger | Action |
|---|---|
| Automated pass rate < 90% | Prompt regression. Check for recent changes. Revert if necessary |
| Judge pass rate < 80% | Quality declining. Review recent outputs vs 30-day-ago outputs. Update prompt |
| Hallucination rate > 3% | Urgent. Pause AI sending. Review data pipeline for missing fields. Strengthen anti-hallucination rules |
| Human approval rate < 85% | The judge is miscalibrated or quality has genuinely dropped. Re-calibrate judge. Review prompt |
| Reply rate declining 4 weeks straight | May be quality issue or list issue. Compare reply rates on AI vs template to isolate |
Building the Golden Set for Cold Email
What to include
| Example type | Count | Purpose |
|---|---|---|
| Ideal outputs (human-written best examples) | 10-15 | The quality target. What perfect looks like |
| Good outputs (AI-generated, human-approved) | 10 | Realistic good. What acceptable looks like |
| Bad outputs (known failures with labeled issues) | 10 | What to avoid. Tests whether the eval catches failures |
| Edge cases (sparse input data, unusual companies) | 5-10 | Tests graceful degradation when data is thin |
Golden set rules
- Include the input data alongside each expected output. The eval needs to verify the output against the input. Without input data, accuracy checks are impossible
- Update the golden set quarterly. Add new failure modes, new edge cases, and new examples of ideal output as your ICP and messaging evolve
- Use real prospect data (anonymized if necessary). Fake data ("Acme Corp, 50 employees") produces different outputs than real data ("Ramp, 180 employees, Series B"). Use real data for realistic eval
Measurement
| Metric | Definition | Target | Frequency |
|---|---|---|---|
| Eval coverage | % of AI-generated emails that go through all 3 eval layers | 100% automated, 100% judge (or flagged), 10-20% human | Per batch |
| False positive rate | % of emails the eval approved that humans later identified as bad | < 5% | Monthly |
| False negative rate | % of emails the eval rejected that humans would have approved | < 10% | Monthly |
| Eval overhead time | Total time spent on eval (automated + judge + human) per batch of 200 | < 30 minutes (automated instant, judge < 5 min, human 20 min for 20% sample) | Per batch |
| Prompt improvement velocity | How much does quality improve per prompt iteration? | Measurable improvement on golden set per iteration | Per iteration |
Anti-Pattern Check
- No automated checks before LLM judge. The judge evaluates 200 emails. 30 of them are over word count. The judge spends compute (and money) evaluating emails that should have been caught by a free regex check. Automated checks first, always
- LLM judge uses the same prompt as the generator. The judge and the generator have the same blind spots. Use a different prompt with explicit evaluation criteria. The judge should be a critic, not a clone of the writer
- No hallucination-specific check. The generic quality eval rates an email 4.5/5. But the email says "congrats on the Series C" when the company raised a Series B. Generic quality scoring doesn't catch factual errors. Cross-check every proper noun and number
- Golden set has 5 examples. Not enough to evaluate reliably. 30-40 examples minimum across ideal, good, bad, and edge cases
- Eval exists but nobody reads the results. Weekly dashboard goes to a Slack channel nobody monitors. If the scores are declining and nobody notices for 3 weeks, the eval is theater. Assign an owner. Review weekly
- Same eval since launch, never updated. The ICP shifted. New banned phrases emerged. The messaging changed. The eval criteria must evolve with the product. Update quarterly
- Human spot-check stopped after Month 1. "The agent is good now, we don't need to check." Quality drifts. Prompts degrade. Models update. 10% human spot-check is permanent. Never drop to 0%