general llm-eval-for-cold-email

llm-eval-for-cold-email

This skill should be used when the user asks to "evaluate AI-generated cold emails", "score cold email quality with AI", "build an eval for cold email agents", "assess LLM cold email output", "create a cold email quality checker", "evaluate AI outbound email quality", "build a cold email critic agent", "test AI-written sales emails", "measure AI cold email performance", or any variation of evaluating the quality of LLM-generated cold outbound email for B2B SaaS.
Download .md

LLM Eval for Cold Email

LLM eval for cold email is a systematic quality check applied to every AI-generated cold email before it reaches a prospect. The eval combines automated rule checks (word count, banned phrases, format) with LLM-as-judge scoring (tone, relevance, naturalness) and human spot-checks. The output is a pass/fail decision per email plus a quality score that tracks agent performance over time.

The principle: a cold email eval should mirror what a skilled SDR manager would check in a 15-second review. Is it short? Is it relevant? Does it sound human? Is the signal real? Is the ask appropriate? Codify that review into automated + LLM + human layers.

The 3-Layer Eval

Layer 1: Automated rule checks (every email, instant)

Check Rule How to implement Pass/fail
Word count (Email 1) ≤ 80 words len(email.split()) <= 80 Fail if over
Word count (Email 2) ≤ 90 words Same Fail if over
Word count (Email 3) ≤ 30 words Same Fail if over
Banned phrases No "leveraging", "synergies", "unlock", "in today's fast-paced world", etc. Regex match against banned list Fail if any match
Em-dashes No "—" in body "—" not in email_body Fail if present
Subject line length ≤ 5 words len(subject.split()) <= 5 Fail if over
Subject line case All lowercase (except proper nouns) Check for uppercase words Warn if Title Case
First word Not "I" not email_body.strip().startswith("I ") Fail if starts with "I"
Calendar link (Email 1) No Calendly/calendar link in Email 1 URL pattern match Fail if present
"Demo" mention No "demo" anywhere "demo" not in email_body.lower() Fail if present
Personalization present At least one merge variable rendered (not blank) Check for unrendered {{tokens}} or blank personalization Fail if no personalization
Email signature First name only. No elaborate signature block Check for excessive signature content Warn if > 3 lines

Layer 2: LLM-as-judge (every email or flagged sample)

Use a second LLM call to rate the email on qualitative dimensions the automated checks can't measure.

Judge prompt:

You are a cold email quality reviewer for B2B SaaS outbound.
Rate this email on 5 dimensions, each 1-5.

Dimensions:
1. RELEVANCE: Does the personalization connect to the
   prospect's actual situation based on the input data?
   (5 = clearly relevant, 1 = generic or irrelevant)

2. NATURALNESS: Does it read like a human peer wrote it,
   not a bot or a marketer?
   (5 = completely natural, 1 = obviously AI-generated)

3. SIGNAL QUALITY: Is the signal (the reason for reaching
   out) specific, recent, and verifiable?
   (5 = specific verifiable signal, 1 = no signal or generic)

4. ASK APPROPRIATENESS: Is the CTA reasonable? (15 min,
   not 45. No "demo". No calendar link in Email 1)
   (5 = perfect ask, 1 = aggressive or inappropriate)

5. OVERALL: Would you send this email as-is to a VP at a
   target account?
   (5 = send immediately, 1 = would never send)

Input data provided to the email writer:
{input_data}

Email to evaluate:
Subject: {subject}
Body: {body}

Respond with JSON:
{
  "relevance": N,
  "naturalness": N,
  "signal_quality": N,
  "ask_appropriateness": N,
  "overall": N,
  "issues": ["list of specific issues found"],
  "pass": true/false (pass if overall >= 4 AND no dimension < 3)
}

Judge rules:

  • Use Sonnet for the judge (fast, accurate enough for scoring). Don't use the same model instance that generated the email
  • Pass threshold: overall ≥ 4 AND no individual dimension < 3. A 5/5 on naturalness doesn't compensate for a 2/5 on relevance
  • Log all judge scores. The trend over time shows whether the generator prompt is improving or degrading
  • Calibrate the judge monthly against human ratings. If the judge consistently scores 0.5+ points higher than human reviewers, adjust the judge prompt

Layer 3: Human spot-check (10-20% sample)

A human reviewer checks a random sample of emails that passed Layers 1 and 2.

Human review checklist (15-20 seconds per email):

  • [ ] Signal is real (could you verify it with a quick search?)
  • [ ] Personalization connects to a relevant problem (not just trivia)
  • [ ] Tone sounds like a peer, not a vendor or a bot
  • [ ] You would actually send this email to a real prospect
  • [ ] No claims about the prospect that aren't in the input data

Human review rules:

  • Review 10-20% of emails that passed automated + LLM checks
  • Rate each 1-5 on the same dimensions as the LLM judge
  • Compare human scores to LLM judge scores. They should agree within ±0.5 on average. If they diverge, recalibrate the judge

Hallucination-Specific Checks

Hallucination is the highest-risk failure mode in AI-generated cold email. One fabricated claim ("congrats on the acquisition" when no acquisition happened) permanently damages the relationship.

Hallucination detection methods

Method What it catches Implementation
Proper noun cross-check Company names, person names, event names in the email that aren't in the input data Extract all proper nouns from output. Check each against input data fields. Flag unmatched nouns
Funding claim verification "Series B" or "$45M raise" that doesn't match input Extract funding references. Compare to input funding field
Metric claim verification "12% improvement" or "3x pipeline" not from input Extract numbers with context. Check against input proof points
Event reference check "Your talk at SaaStr" or "the product launch" not from input Extract event references. Compare to input signals
Role/title verification "As Head of RevOps" when input says "Director of Sales" Compare role references to input title field

Hallucination rules

  • Cross-check every proper noun. If the email mentions a company name, person name, or event that isn't in the input data, it's fabricated. Flag and reject
  • Cross-check every number. If the email cites "3x improvement" and the input data contains no such number, it's fabricated
  • "Not found" is safer than guessing. If the input data is missing a field, the email should skip that element, not fill it with a plausible-sounding claim
  • Hallucination rate target: < 2% per batch. If more than 2% of emails contain fabricated claims, the generator prompt needs anti-hallucination reinforcement

Eval Scoring and Thresholds

Per-email scoring

Automated checks: PASS / FAIL (binary)
  → Must pass all automated checks to proceed

LLM judge score: 1-5 per dimension, 5 dimensions
  → Overall score = average of 5 dimensions
  → Pass: overall ≥ 4.0 AND no dimension < 3.0

Human review (if sampled): 1-5 per dimension
  → Same thresholds as LLM judge

Final decision:
  Automated PASS + Judge PASS → Send
  Automated PASS + Judge FAIL → Regenerate with feedback
  Automated FAIL → Regenerate (don't judge, just fix the rule violation)

Batch-level scoring

Metric How to calculate Target
Automated pass rate % of emails passing all automated checks > 95%
Judge pass rate % of emails scoring ≥ 4.0 overall > 85%
Human approval rate % of human-reviewed emails approved > 90%
Hallucination rate % of emails with fabricated claims < 2%
Average overall score Mean LLM judge overall score across batch ≥ 4.0
Regeneration rate % of emails that needed regeneration (failed auto or judge) < 15%

Eval-Driven Prompt Improvement

Finding patterns in failures

1. Export all failed emails from the last batch
2. Categorize failures:
   - Rule violations (which rule? how often?)
   - Low relevance scores (is the personalization generic?)
   - Low naturalness scores (what makes it sound like AI?)
   - Hallucinations (what was fabricated? why?)
3. Identify the top 2 failure categories
4. Modify the generator prompt to address those categories
5. Re-run the eval on the golden set
6. Deploy if scores improve without regression

Common failure patterns and prompt fixes

Pattern Evidence Prompt fix
Emails are too long 30% fail word count check Add explicit: "Maximum 80 words. Count before outputting. If over, cut the least essential sentence"
Openers start with "I noticed" 40% of emails start with "I noticed" or "I came across" Add to banned list: "Never start with 'I noticed', 'I came across', or 'I saw that'. Start with the signal or the prospect's name"
Personalization is generic LLM judge relevance score averaging 3.2 Add better examples of good vs bad personalization. Include 2 examples of generic (bad) and 2 of specific (good)
Tone is too formal LLM judge naturalness score averaging 3.5 Add: "Write like you'd message a colleague, not draft a cover letter. Short sentences. Casual. No marketing language"
Hallucinated proof points 3% of emails cite results not in input data Strengthen: "Only reference facts provided in the input. If no proof point is in the input, omit the proof point. Never fabricate statistics or company names"
Em-dashes appear despite ban 5% of emails contain em-dashes Add to rules (some models need explicit character-level instruction): "Never use the em-dash character (—). Use periods to separate clauses instead"

Production Monitoring

Weekly eval dashboard

Metric This week Last week Trend Action needed?
Emails generated 200 180 No
Automated pass rate 96% 97% Monitor. If drops below 95%, investigate
Judge pass rate 87% 89% Monitor. If drops below 85%, investigate
Human approval rate 92% 93% Stable. Good
Hallucination rate 1.5% 1.0% Watch closely. If hits 2%, pause and fix
Average overall score 4.2 4.3 Minor. Within range
Reply rate on sent emails 11% 12% Normal variation. Monitor over 4 weeks

When to pause and fix

Trigger Action
Automated pass rate < 90% Prompt regression. Check for recent changes. Revert if necessary
Judge pass rate < 80% Quality declining. Review recent outputs vs 30-day-ago outputs. Update prompt
Hallucination rate > 3% Urgent. Pause AI sending. Review data pipeline for missing fields. Strengthen anti-hallucination rules
Human approval rate < 85% The judge is miscalibrated or quality has genuinely dropped. Re-calibrate judge. Review prompt
Reply rate declining 4 weeks straight May be quality issue or list issue. Compare reply rates on AI vs template to isolate

Building the Golden Set for Cold Email

What to include

Example type Count Purpose
Ideal outputs (human-written best examples) 10-15 The quality target. What perfect looks like
Good outputs (AI-generated, human-approved) 10 Realistic good. What acceptable looks like
Bad outputs (known failures with labeled issues) 10 What to avoid. Tests whether the eval catches failures
Edge cases (sparse input data, unusual companies) 5-10 Tests graceful degradation when data is thin

Golden set rules

  • Include the input data alongside each expected output. The eval needs to verify the output against the input. Without input data, accuracy checks are impossible
  • Update the golden set quarterly. Add new failure modes, new edge cases, and new examples of ideal output as your ICP and messaging evolve
  • Use real prospect data (anonymized if necessary). Fake data ("Acme Corp, 50 employees") produces different outputs than real data ("Ramp, 180 employees, Series B"). Use real data for realistic eval

Measurement

Metric Definition Target Frequency
Eval coverage % of AI-generated emails that go through all 3 eval layers 100% automated, 100% judge (or flagged), 10-20% human Per batch
False positive rate % of emails the eval approved that humans later identified as bad < 5% Monthly
False negative rate % of emails the eval rejected that humans would have approved < 10% Monthly
Eval overhead time Total time spent on eval (automated + judge + human) per batch of 200 < 30 minutes (automated instant, judge < 5 min, human 20 min for 20% sample) Per batch
Prompt improvement velocity How much does quality improve per prompt iteration? Measurable improvement on golden set per iteration Per iteration

Anti-Pattern Check

  • No automated checks before LLM judge. The judge evaluates 200 emails. 30 of them are over word count. The judge spends compute (and money) evaluating emails that should have been caught by a free regex check. Automated checks first, always
  • LLM judge uses the same prompt as the generator. The judge and the generator have the same blind spots. Use a different prompt with explicit evaluation criteria. The judge should be a critic, not a clone of the writer
  • No hallucination-specific check. The generic quality eval rates an email 4.5/5. But the email says "congrats on the Series C" when the company raised a Series B. Generic quality scoring doesn't catch factual errors. Cross-check every proper noun and number
  • Golden set has 5 examples. Not enough to evaluate reliably. 30-40 examples minimum across ideal, good, bad, and edge cases
  • Eval exists but nobody reads the results. Weekly dashboard goes to a Slack channel nobody monitors. If the scores are declining and nobody notices for 3 weeks, the eval is theater. Assign an owner. Review weekly
  • Same eval since launch, never updated. The ICP shifted. New banned phrases emerged. The messaging changed. The eval criteria must evolve with the product. Update quarterly
  • Human spot-check stopped after Month 1. "The agent is good now, we don't need to check." Quality drifts. Prompts degrade. Models update. 10% human spot-check is permanent. Never drop to 0%
Want agents that use skill files like this?
We customize skill files for your brand voice and methodology, then run content agents against them.
Book a call