Home/ Skills/ human-in-the-loop-review

general human-in-the-loop-review

human-in-the-loop-review

This skill should be used when the user asks to "design human review for AI output", "set up human-in-the-loop", "build a review process for agent output", "create a QA process for AI-generated content", "design human oversight for agents", "set up approval gates for AI", "build a review workflow for AI emails", "design human QA for agent output", "create a human review cadence for AI", or any variation of designing human review and approval processes for AI agent output in B2B SaaS GTM.

Download .md

Human-in-the-Loop Review

Human-in-the-loop (HITL) review is the process of having a human check, edit, or approve AI-generated output before it reaches a customer or enters a system of record. Every production AI agent that produces customer-facing output needs HITL. The question is not whether to include human review. It's how much, at what stage, and how to reduce it over time without reducing quality.

The principle: start at 100% human review. Earn the right to reduce it through demonstrated quality. 100% → 50% → 25% → 10% spot-check. Never 0%. Even the most reliable agent produces occasional bad output. The cost of one bad email to a Tier 1 ABM account exceeds the cost of reviewing 1,000 good emails.

The HITL Spectrum

Level	What the human does	When to use	Review rate
Full review	Human reads every output. Approves, edits, or rejects before it goes out	First 2-4 weeks of any new agent or prompt	100%
Selective review	Human reviews flagged outputs (low confidence, edge cases) + random sample	After 2-4 weeks of stable quality. Automated checks pass consistently	25-50%
Spot-check	Human reviews a random 10-20% sample. The rest goes out automatically	After 30+ days of stable quality with < 2% error rate	10-20%
Exception-only	Human reviews only outputs that fail automated checks	Mature agents with 60+ days of production data and < 1% error rate	1-5%

Progression rules

Start at 100%. Always. No exceptions. The first 100-200 outputs from any new agent should be human-reviewed. This catches prompt issues, hallucination patterns, and edge cases before customers see them
Advance one level at a time. 100% → 50% → 25% → 10%. Don't jump from 100% to 10% because "it's been working great for a week." One week is not enough data
Advancement criteria: Move to the next level when automated check pass rate > 98% AND human review quality score > 4.0/5 AND hallucination rate < 2% for 2+ consecutive weeks
Regression triggers: If quality drops at any level, revert to the previous level. If error rate hits 5%+ at spot-check level, return to selective review until quality stabilizes

Review Workflow Design

The review queue

Agent produces output batch (e.g., 50 personalized emails)
  ↓
Automated checks run (instant)
  - Format compliance (word count, structure)
  - Rule compliance (banned phrases, em-dashes)
  - Accuracy cross-check (proper nouns match input)
  ↓
  PASS → Enters review queue
  FAIL → Rejected. Logged. Sent back for regeneration

Review queue:
  At 100% review: all 50 outputs in queue
  At 25% review: ~13 random + any flagged outputs
  At 10% review: ~5 random + any flagged outputs
  ↓
Human reviewer checks each queued output
  → Approve: output goes to production (sent to prospect)
  → Edit: reviewer fixes the issue, output goes to production
  → Reject: output is discarded. Input sent back for regeneration
  ↓
Results logged: approved, edited, or rejected + reason

Review workflow rules

Automated checks run BEFORE human review. Don't waste human time on outputs that fail automated checks. Fix the obvious errors programmatically, then give humans the nuanced ones
Flagged outputs always get reviewed. Even at 10% spot-check, any output flagged by automated checks (low confidence, unusual input, potential hallucination) gets human review. Spot-check applies to unflagged outputs only
Every review decision is logged. Approve, edit, or reject + the reason. This data feeds prompt improvement. If 20% of outputs are edited for the same reason ("tone too formal"), that's a prompt fix
Set a target review time per output. 15-30 seconds for email review. 1-2 minutes for research briefs. If review takes longer, the output quality is too low or the reviewer is over-editing

What the Reviewer Checks

Review checklist by agent type

Email writer agent:

Check	How to verify	Time	Priority
Factual accuracy	Does the email reference real facts from the input data? No fabricated signals, company details, or proof points	5 sec	P0
Tone and voice	Does it sound like a peer, not a bot? Natural, not robotic?	5 sec	P1
Relevance	Does the personalization connect to the prospect's actual situation?	5 sec	P1
Word count	Within limit? (Should be auto-checked, but verify)	2 sec	P0
CTA appropriate	Is the ask reasonable? 15 minutes, not 45? No "book a demo"?	2 sec	P1
Would you send this?	The ultimate gut check. If you'd be embarrassed to send it, reject	3 sec	P0

Research agent:

Check	How to verify	Time
Company data correct	Quick check: does the company size, funding, industry match what you can verify?	10 sec
Signals are real	Can you verify the signal with a quick Google/LinkedIn check?	15 sec
Problem hypothesis is reasonable	Does the hypothesis follow from the data, or is it a stretch?	10 sec
No hallucinated companies or people	Are all named companies and people real and correctly referenced?	10 sec

Reply classifier agent:

Check	How to verify	Time
Classification matches the reply content	Read the reply. Does the classification (positive, negative, OOO, question) match?	5 sec
Recommended action is appropriate	Is the suggested next step reasonable for this classification?	3 sec
Edge cases handled	Multi-intent replies ("interested but OOO until the 15th") classified correctly?	5 sec

Who Does the Review

Reviewer profiles

Reviewer	Best for	Pros	Cons
The SDR/AE who sends the email	Email review	Knows the prospect. Can add context. Owns the relationship	Takes time from selling. May rubber-stamp to save time
SDR Manager	Email and sequence review	Quality-focused. Can coach from review data	Limited time. Can't review every email at scale
Marketing (content/copy person)	Email template and quality review	Strong writing instinct. Catches tone issues	Doesn't know the prospects individually
RevOps / dedicated QA person	All agent types	Systematic. Process-oriented. Can review at volume	May not have domain expertise for quality judgment
AI (LLM-as-judge)	Pre-screening. Quality scoring	Fast, cheap, consistent	70-85% agreement with humans. Not reliable as sole reviewer

Reviewer assignment rules

At 100% review: the sender reviews their own output. The SDR checks the AI-generated emails before they go out. This ensures the sender owns the quality and can add last-second personalization
At 25-50% review: SDR Manager spot-checks. The manager reviews a sample of outputs across all reps. This catches quality patterns that individual reps miss
At 10% review: rotate between SDR Manager and RevOps. Distributed spot-checking prevents reviewer fatigue
Never have the person who wrote the prompt be the sole reviewer. They have blind spots. They'll approve outputs that match their expectations, even if the expectations are wrong. Use an independent reviewer

Review Time Budget

How much time does HITL cost?

Review rate	Outputs per day	Time per output	Daily review time	Monthly review time
100% (full review)	50 emails	20 seconds	~17 minutes	~6 hours
50% (selective)	25 emails	20 seconds	~8 minutes	~3 hours
25% (selective)	13 emails	20 seconds	~4 minutes	~1.5 hours
10% (spot-check)	5 emails	30 seconds (more careful)	~2.5 minutes	~1 hour

Time budget rules:

Full review of 50 emails takes ~17 minutes. This is less than the time saved by not writing 50 emails manually. HITL at 100% is still a net time savings vs manual email writing
Target 20-30 seconds per email review. If review consistently takes 60+ seconds, the AI output quality is too low. Fix the prompt, don't budget more review time
Schedule review as a time block. "Review AI output" at 8:30am for 15 minutes, not "review throughout the day." Batched review is faster than reviewing one at a time as they arrive

Handling Review Results

What to do with each review decision

Decision	Action	Data logged
Approve	Output goes to production as-is	Timestamp, reviewer, "approved"
Edit (minor)	Reviewer makes a small fix (typo, awkward phrasing). Output goes to production	Timestamp, reviewer, "edited - minor", what was changed
Edit (major)	Reviewer significantly rewrites. Output goes to production	Timestamp, reviewer, "edited - major", what was changed and why
Reject	Output discarded. Input sent back for regeneration with feedback	Timestamp, reviewer, "rejected", rejection reason

Using review data to improve the agent

Pattern in review data	What it tells you	Prompt fix
15% of outputs edited for same reason (e.g., "tone too formal")	Systematic prompt issue	Add instruction to prompt: "Write in a casual, conversational tone" + add an example of the desired tone
5% of outputs rejected for hallucination	The "don't fabricate" instruction isn't strong enough, or input data has gaps	Strengthen the anti-hallucination rule. Add: "If a field is missing, say 'Not found' instead of guessing." Check input data pipeline for empty fields
Reviewers consistently add the same line	There's a missing element the prompt doesn't generate	Add the element to the prompt: "Always include [X] in the output"
Review time increasing over time	Output quality is degrading (prompt drift, data quality decline, model update)	Re-run evals. Compare current output to the best outputs from 30 days ago. Find the drift
Edit rate below 5% for 4+ weeks	Agent quality is stable. Ready to reduce review rate	Advance to the next HITL level (e.g., 100% → 50%)

Designing for Review Speed

Make review as fast as possible

Technique	How it helps	Implementation
Side-by-side display	Show input data next to output. Reviewer checks accuracy without switching tabs	Build a review UI or use a spreadsheet with input in column A, output in column B
Highlight personalized elements	Bold or color the AI-generated personalization so the reviewer can focus on what to check	Template the display to highlight dynamic content
Pre-check indicators	Show automated check results (pass/fail for word count, banned phrases) alongside the output	Reviewer skips checks that already passed and focuses on quality and accuracy
One-click approve/reject	Approve with one click. Reject with one click + dropdown reason	Build into the review tool or use a simple form
Batch review mode	Show 10-20 outputs in a scrollable list. Reviewer approves/rejects in sequence	Faster than opening each output individually

Review speed rules

The review interface matters. A reviewer checking emails in a CRM, switching to a LinkedIn tab, switching to a spreadsheet to verify, switching back to approve is slow. Build a side-by-side view where input and output are visible together
Pre-pass automated checks. Don't make the human count words or search for banned phrases. That's the automation's job. The human checks tone, accuracy, and relevance. The things only a human can judge
One-click approve/reject. If approving an email takes 3 clicks and a confirmation dialog, the reviewer burns 5 seconds per email on interface friction. That's 4 minutes per batch of 50. Streamline

HITL for Different GTM Workflows

Cold email (AI-generated or AI-personalized)

Phase	Review approach
First 2 weeks	100% review. SDR reviews every email before send
Week 3-4	50% review. SDR reviews half, manager spot-checks the rest
Week 5-8	25% spot-check. Random sample + any flagged outputs
Week 8+	10% spot-check. Focus on new prompt versions or new ICP segments

Research briefs (AI-generated account research)

Phase	Review approach
First 2 weeks	100% review. AE or SDR verifies every brief before using
Week 3-4	50% review. Focus on accuracy (company data, signals)
Week 5+	25% spot-check. Auto-checks handle format. Human checks accuracy on the sample

Reply classification (AI-classified inbound replies)

Phase	Review approach
First 2 weeks	100% review. Every classification verified by SDR
Week 3-4	Review low-confidence classifications only (< 80% confidence)
Week 5+	Spot-check 10% of all classifications + 100% of low-confidence

CRM data updates (AI-enriched or AI-processed)

Phase	Review approach
Always	100% review before write. AI proposes the update. Human approves before CRM is modified
Exception	Bulk enrichment (filling missing fields) can run at 25% review after validation on the first batch

CRM write rule: Never allow an AI agent to write to CRM without human approval. One wrong update cascading through workflows, automations, and reports causes damage that takes hours to fix. Always review CRM writes.

Measurement

Metric	Definition	Target	Frequency
Review rate	% of outputs reviewed by a human	Decreasing over time (100% → 10%)	Weekly
Approve rate	% of reviewed outputs approved without edits	> 85% (at current review level)	Weekly
Edit rate (minor)	% of reviewed outputs with minor edits	< 10%	Weekly
Edit rate (major)	% of reviewed outputs with major rewrites	< 3%	Weekly
Reject rate	% of reviewed outputs rejected entirely	< 2%	Weekly
Average review time per output	Seconds per review	15-30 sec for emails. 60-120 sec for briefs	Monthly
Reviewer agreement (if multiple reviewers)	Do two reviewers make the same decision on the same output?	> 85% agreement	Quarterly (calibration)
Time saved vs manual creation	Hours saved by AI + review vs fully manual	Track to justify the investment	Monthly

Advancement criteria

Current level	Advance to next level when	Revert to previous level when
100% review	Approve rate > 90% for 2 consecutive weeks	N/A (this is the starting point)
50% review	Approve rate > 90% AND reject rate < 2% for 2 weeks	Reject rate > 5% in any week
25% review	Approve rate > 92% AND reject rate < 1% for 4 weeks	Reject rate > 3% in any week
10% spot-check	Approve rate > 95% AND reject rate < 0.5% for 4 weeks	Reject rate > 2% in any week
Exception-only	Only reached after 60+ days at 10% with near-zero rejects	Any pattern of rejects triggers return to 10%

Anti-Pattern Check

Starting at 0% review. "The agent seems to work. Let's just send." One hallucinated claim to a Tier 1 ABM account. One wrong company name. One fabricated proof point. The damage exceeds the cost of 100% review for a month. Start at 100%. Always
Reviewing but not logging results. Reviews happen but nobody tracks approve/edit/reject rates. Without data, you can't measure quality improvement, identify prompt issues, or justify reducing the review rate. Log every decision
Reviewer rubber-stamps to save time. The SDR approves 50 emails in 2 minutes without reading any of them. Review fatigue. Fix with rotation (different reviewer each day), batch review UI (fast approval flow), or quality audits (manager spot-checks reviewer decisions)
Skipping HITL for "low-risk" output. "It's just a research brief, nobody sees it but us." The research brief informs the cold email. A wrong fact in the brief becomes a hallucinated claim in the email. Review the upstream output, not just the customer-facing output
Same review rate for 6 months. If the agent has been at 100% review for 6 months with 95% approve rate, you're over-reviewing. Advance to 50%. The review rate should decrease over time as quality stabilizes
No advancement criteria. "We'll reduce review when we feel confident." Confidence is not a metric. Define specific criteria: approve rate, reject rate, duration at current level. Advance based on data
AI writes to CRM without review. An enrichment agent updates 500 contact records. 15 have wrong companies. The wrong data cascades into lead scoring, routing, and outbound. 15 wrong records = 15 embarrassing emails. Always review CRM writes
One reviewer for everything. One person reviews all AI output across all agents. They become the bottleneck. Rotate reviewers. Train 2-3 people. Distribute the load

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# Human-in-the-Loop Review

## The HITL Spectrum

| Level | What the human does | When to use | Review rate |
|-------|-------------------|-------------|-------------|
| Full review | Human reads every output. Approves, edits, or rejects before it goes out | First 2-4 weeks of any new agent or prompt | 100% |
| Selective review | Human reviews flagged outputs (low confidence, edge cases) + random sample | After 2-4 weeks of stable quality. Automated checks pass consistently | 25-50% |
| Spot-check | Human reviews a random 10-20% sample. The rest goes out automatically | After 30+ days of stable quality with < 2% error rate | 10-20% |
| Exception-only | Human reviews only outputs that fail automated checks | Mature agents with 60+ days of production data and < 1% error rate | 1-5% |

### Progression rules

- **Start at 100%. Always.** No exceptions. The first 100-200 outputs from any new agent should be human-reviewed. This catches prompt issues, hallucination patterns, and edge cases before customers see them
- **Advance one level at a time.** 100% → 50% → 25% → 10%. Don't jump from 100% to 10% because "it's been working great for a week." One week is not enough data
- **Advancement criteria:** Move to the next level when automated check pass rate > 98% AND human review quality score > 4.0/5 AND hallucination rate < 2% for 2+ consecutive weeks
- **Regression triggers:** If quality drops at any level, revert to the previous level. If error rate hits 5%+ at spot-check level, return to selective review until quality stabilizes

---

## Review Workflow Design

### The review queue

```
Agent produces output batch (e.g., 50 personalized emails)
  ↓
Automated checks run (instant)
  - Format compliance (word count, structure)
  - Rule compliance (banned phrases, em-dashes)
  - Accuracy cross-check (proper nouns match input)
  ↓
  PASS → Enters review queue
  FAIL → Rejected. Logged. Sent back for regeneration

Review queue:
  At 100% review: all 50 outputs in queue
  At 25% review: ~13 random + any flagged outputs
  At 10% review: ~5 random + any flagged outputs
  ↓
Human reviewer checks each queued output
  → Approve: output goes to production (sent to prospect)
  → Edit: reviewer fixes the issue, output goes to production
  → Reject: output is discarded. Input sent back for regeneration
  ↓
Results logged: approved, edited, or rejected + reason
```

### Review workflow rules

- **Automated checks run BEFORE human review.** Don't waste human time on outputs that fail automated checks. Fix the obvious errors programmatically, then give humans the nuanced ones
- **Flagged outputs always get reviewed.** Even at 10% spot-check, any output flagged by automated checks (low confidence, unusual input, potential hallucination) gets human review. Spot-check applies to unflagged outputs only
- **Every review decision is logged.** Approve, edit, or reject + the reason. This data feeds prompt improvement. If 20% of outputs are edited for the same reason ("tone too formal"), that's a prompt fix
- **Set a target review time per output.** 15-30 seconds for email review. 1-2 minutes for research briefs. If review takes longer, the output quality is too low or the reviewer is over-editing

---

## What the Reviewer Checks

### Review checklist by agent type

**Email writer agent:**

| Check | How to verify | Time | Priority |
|-------|-------------|------|----------|
| Factual accuracy | Does the email reference real facts from the input data? No fabricated signals, company details, or proof points | 5 sec | P0 |
| Tone and voice | Does it sound like a peer, not a bot? Natural, not robotic? | 5 sec | P1 |
| Relevance | Does the personalization connect to the prospect's actual situation? | 5 sec | P1 |
| Word count | Within limit? (Should be auto-checked, but verify) | 2 sec | P0 |
| CTA appropriate | Is the ask reasonable? 15 minutes, not 45? No "book a demo"? | 2 sec | P1 |
| Would you send this? | The ultimate gut check. If you'd be embarrassed to send it, reject | 3 sec | P0 |

**Research agent:**

| Check | How to verify | Time |
|-------|-------------|------|
| Company data correct | Quick check: does the company size, funding, industry match what you can verify? | 10 sec |
| Signals are real | Can you verify the signal with a quick Google/LinkedIn check? | 15 sec |
| Problem hypothesis is reasonable | Does the hypothesis follow from the data, or is it a stretch? | 10 sec |
| No hallucinated companies or people | Are all named companies and people real and correctly referenced? | 10 sec |

**Reply classifier agent:**

| Check | How to verify | Time |
|-------|-------------|------|
| Classification matches the reply content | Read the reply. Does the classification (positive, negative, OOO, question) match? | 5 sec |
| Recommended action is appropriate | Is the suggested next step reasonable for this classification? | 3 sec |
| Edge cases handled | Multi-intent replies ("interested but OOO until the 15th") classified correctly? | 5 sec |

---

## Who Does the Review

### Reviewer profiles

| Reviewer | Best for | Pros | Cons |
|----------|---------|------|------|
| The SDR/AE who sends the email | Email review | Knows the prospect. Can add context. Owns the relationship | Takes time from selling. May rubber-stamp to save time |
| SDR Manager | Email and sequence review | Quality-focused. Can coach from review data | Limited time. Can't review every email at scale |
| Marketing (content/copy person) | Email template and quality review | Strong writing instinct. Catches tone issues | Doesn't know the prospects individually |
| RevOps / dedicated QA person | All agent types | Systematic. Process-oriented. Can review at volume | May not have domain expertise for quality judgment |
| AI (LLM-as-judge) | Pre-screening. Quality scoring | Fast, cheap, consistent | 70-85% agreement with humans. Not reliable as sole reviewer |

### Reviewer assignment rules

- **At 100% review: the sender reviews their own output.** The SDR checks the AI-generated emails before they go out. This ensures the sender owns the quality and can add last-second personalization
- **At 25-50% review: SDR Manager spot-checks.** The manager reviews a sample of outputs across all reps. This catches quality patterns that individual reps miss
- **At 10% review: rotate between SDR Manager and RevOps.** Distributed spot-checking prevents reviewer fatigue
- **Never have the person who wrote the prompt be the sole reviewer.** They have blind spots. They'll approve outputs that match their expectations, even if the expectations are wrong. Use an independent reviewer

---

## Review Time Budget

### How much time does HITL cost?

| Review rate | Outputs per day | Time per output | Daily review time | Monthly review time |
|------------|----------------|----------------|------------------|-------------------|
| 100% (full review) | 50 emails | 20 seconds | ~17 minutes | ~6 hours |
| 50% (selective) | 25 emails | 20 seconds | ~8 minutes | ~3 hours |
| 25% (selective) | 13 emails | 20 seconds | ~4 minutes | ~1.5 hours |
| 10% (spot-check) | 5 emails | 30 seconds (more careful) | ~2.5 minutes | ~1 hour |

**Time budget rules:**
- **Full review of 50 emails takes ~17 minutes.** This is less than the time saved by not writing 50 emails manually. HITL at 100% is still a net time savings vs manual email writing
- **Target 20-30 seconds per email review.** If review consistently takes 60+ seconds, the AI output quality is too low. Fix the prompt, don't budget more review time
- **Schedule review as a time block.** "Review AI output" at 8:30am for 15 minutes, not "review throughout the day." Batched review is faster than reviewing one at a time as they arrive

---

## Handling Review Results

### What to do with each review decision

| Decision | Action | Data logged |
|----------|--------|-----------|
| Approve | Output goes to production as-is | Timestamp, reviewer, "approved" |
| Edit (minor) | Reviewer makes a small fix (typo, awkward phrasing). Output goes to production | Timestamp, reviewer, "edited - minor", what was changed |
| Edit (major) | Reviewer significantly rewrites. Output goes to production | Timestamp, reviewer, "edited - major", what was changed and why |
| Reject | Output discarded. Input sent back for regeneration with feedback | Timestamp, reviewer, "rejected", rejection reason |

### Using review data to improve the agent

| Pattern in review data | What it tells you | Prompt fix |
|----------------------|-------------------|-----------|
| 15% of outputs edited for same reason (e.g., "tone too formal") | Systematic prompt issue | Add instruction to prompt: "Write in a casual, conversational tone" + add an example of the desired tone |
| 5% of outputs rejected for hallucination | The "don't fabricate" instruction isn't strong enough, or input data has gaps | Strengthen the anti-hallucination rule. Add: "If a field is missing, say 'Not found' instead of guessing." Check input data pipeline for empty fields |
| Reviewers consistently add the same line | There's a missing element the prompt doesn't generate | Add the element to the prompt: "Always include [X] in the output" |
| Review time increasing over time | Output quality is degrading (prompt drift, data quality decline, model update) | Re-run evals. Compare current output to the best outputs from 30 days ago. Find the drift |
| Edit rate below 5% for 4+ weeks | Agent quality is stable. Ready to reduce review rate | Advance to the next HITL level (e.g., 100% → 50%) |

---

## Designing for Review Speed

### Make review as fast as possible

| Technique | How it helps | Implementation |
|-----------|-------------|---------------|
| Side-by-side display | Show input data next to output. Reviewer checks accuracy without switching tabs | Build a review UI or use a spreadsheet with input in column A, output in column B |
| Highlight personalized elements | Bold or color the AI-generated personalization so the reviewer can focus on what to check | Template the display to highlight dynamic content |
| Pre-check indicators | Show automated check results (pass/fail for word count, banned phrases) alongside the output | Reviewer skips checks that already passed and focuses on quality and accuracy |
| One-click approve/reject | Approve with one click. Reject with one click + dropdown reason | Build into the review tool or use a simple form |
| Batch review mode | Show 10-20 outputs in a scrollable list. Reviewer approves/rejects in sequence | Faster than opening each output individually |

### Review speed rules

- **The review interface matters.** A reviewer checking emails in a CRM, switching to a LinkedIn tab, switching to a spreadsheet to verify, switching back to approve is slow. Build a side-by-side view where input and output are visible together
- **Pre-pass automated checks.** Don't make the human count words or search for banned phrases. That's the automation's job. The human checks tone, accuracy, and relevance. The things only a human can judge
- **One-click approve/reject.** If approving an email takes 3 clicks and a confirmation dialog, the reviewer burns 5 seconds per email on interface friction. That's 4 minutes per batch of 50. Streamline

---

## HITL for Different GTM Workflows

### Cold email (AI-generated or AI-personalized)

| Phase | Review approach |
|-------|---------------|
| First 2 weeks | 100% review. SDR reviews every email before send |
| Week 3-4 | 50% review. SDR reviews half, manager spot-checks the rest |
| Week 5-8 | 25% spot-check. Random sample + any flagged outputs |
| Week 8+ | 10% spot-check. Focus on new prompt versions or new ICP segments |

### Research briefs (AI-generated account research)

| Phase | Review approach |
|-------|---------------|
| First 2 weeks | 100% review. AE or SDR verifies every brief before using |
| Week 3-4 | 50% review. Focus on accuracy (company data, signals) |
| Week 5+ | 25% spot-check. Auto-checks handle format. Human checks accuracy on the sample |

### Reply classification (AI-classified inbound replies)

| Phase | Review approach |
|-------|---------------|
| First 2 weeks | 100% review. Every classification verified by SDR |
| Week 3-4 | Review low-confidence classifications only (< 80% confidence) |
| Week 5+ | Spot-check 10% of all classifications + 100% of low-confidence |

### CRM data updates (AI-enriched or AI-processed)

| Phase | Review approach |
|-------|---------------|
| Always | 100% review before write. AI proposes the update. Human approves before CRM is modified |
| Exception | Bulk enrichment (filling missing fields) can run at 25% review after validation on the first batch |

**CRM write rule:** Never allow an AI agent to write to CRM without human approval. One wrong update cascading through workflows, automations, and reports causes damage that takes hours to fix. Always review CRM writes.

---

## Measurement

| Metric | Definition | Target | Frequency |
|--------|-----------|--------|-----------|
| Review rate | % of outputs reviewed by a human | Decreasing over time (100% → 10%) | Weekly |
| Approve rate | % of reviewed outputs approved without edits | > 85% (at current review level) | Weekly |
| Edit rate (minor) | % of reviewed outputs with minor edits | < 10% | Weekly |
| Edit rate (major) | % of reviewed outputs with major rewrites | < 3% | Weekly |
| Reject rate | % of reviewed outputs rejected entirely | < 2% | Weekly |
| Average review time per output | Seconds per review | 15-30 sec for emails. 60-120 sec for briefs | Monthly |
| Reviewer agreement (if multiple reviewers) | Do two reviewers make the same decision on the same output? | > 85% agreement | Quarterly (calibration) |
| Time saved vs manual creation | Hours saved by AI + review vs fully manual | Track to justify the investment | Monthly |

### Advancement criteria

| Current level | Advance to next level when | Revert to previous level when |
|--------------|---------------------------|------------------------------|
| 100% review | Approve rate > 90% for 2 consecutive weeks | N/A (this is the starting point) |
| 50% review | Approve rate > 90% AND reject rate < 2% for 2 weeks | Reject rate > 5% in any week |
| 25% review | Approve rate > 92% AND reject rate < 1% for 4 weeks | Reject rate > 3% in any week |
| 10% spot-check | Approve rate > 95% AND reject rate < 0.5% for 4 weeks | Reject rate > 2% in any week |
| Exception-only | Only reached after 60+ days at 10% with near-zero rejects | Any pattern of rejects triggers return to 10% |

---

## Anti-Pattern Check

- Starting at 0% review. "The agent seems to work. Let's just send." One hallucinated claim to a Tier 1 ABM account. One wrong company name. One fabricated proof point. The damage exceeds the cost of 100% review for a month. Start at 100%. Always
- Reviewing but not logging results. Reviews happen but nobody tracks approve/edit/reject rates. Without data, you can't measure quality improvement, identify prompt issues, or justify reducing the review rate. Log every decision
- Reviewer rubber-stamps to save time. The SDR approves 50 emails in 2 minutes without reading any of them. Review fatigue. Fix with rotation (different reviewer each day), batch review UI (fast approval flow), or quality audits (manager spot-checks reviewer decisions)
- Skipping HITL for "low-risk" output. "It's just a research brief, nobody sees it but us." The research brief informs the cold email. A wrong fact in the brief becomes a hallucinated claim in the email. Review the upstream output, not just the customer-facing output
- Same review rate for 6 months. If the agent has been at 100% review for 6 months with 95% approve rate, you're over-reviewing. Advance to 50%. The review rate should decrease over time as quality stabilizes
- No advancement criteria. "We'll reduce review when we feel confident." Confidence is not a metric. Define specific criteria: approve rate, reject rate, duration at current level. Advance based on data
- AI writes to CRM without review. An enrichment agent updates 500 contact records. 15 have wrong companies. The wrong data cascades into lead scoring, routing, and outbound. 15 wrong records = 15 embarrassing emails. Always review CRM writes
- One reviewer for everything. One person reviews all AI output across all agents. They become the bottleneck. Rotate reviewers. Train 2-3 people. Distribute the load