---
name: human-in-the-loop-review
slug: human-in-the-loop-review
description: This skill should be used when the user asks to "design human review for AI output", "set up human-in-the-loop", "build a review process for agent output", "create a QA process for AI-generated content", "design human oversight for agents", "set up approval gates for AI", "build a review workflow for AI emails", "design human QA for agent output", "create a human review cadence for AI", or any variation of designing human review and approval processes for AI agent output in B2B SaaS GTM.
category: general
---

# Human-in-the-Loop Review

Human-in-the-loop (HITL) review is the process of having a human check, edit, or approve AI-generated output before it reaches a customer or enters a system of record. Every production AI agent that produces customer-facing output needs HITL. The question is not whether to include human review. It's how much, at what stage, and how to reduce it over time without reducing quality.

The principle: start at 100% human review. Earn the right to reduce it through demonstrated quality. 100% → 50% → 25% → 10% spot-check. Never 0%. Even the most reliable agent produces occasional bad output. The cost of one bad email to a Tier 1 ABM account exceeds the cost of reviewing 1,000 good emails.

## The HITL Spectrum

| Level | What the human does | When to use | Review rate |
|-------|-------------------|-------------|-------------|
| Full review | Human reads every output. Approves, edits, or rejects before it goes out | First 2-4 weeks of any new agent or prompt | 100% |
| Selective review | Human reviews flagged outputs (low confidence, edge cases) + random sample | After 2-4 weeks of stable quality. Automated checks pass consistently | 25-50% |
| Spot-check | Human reviews a random 10-20% sample. The rest goes out automatically | After 30+ days of stable quality with < 2% error rate | 10-20% |
| Exception-only | Human reviews only outputs that fail automated checks | Mature agents with 60+ days of production data and < 1% error rate | 1-5% |

### Progression rules

- **Start at 100%. Always.** No exceptions. The first 100-200 outputs from any new agent should be human-reviewed. This catches prompt issues, hallucination patterns, and edge cases before customers see them
- **Advance one level at a time.** 100% → 50% → 25% → 10%. Don't jump from 100% to 10% because "it's been working great for a week." One week is not enough data
- **Advancement criteria:** Move to the next level when automated check pass rate > 98% AND human review quality score > 4.0/5 AND hallucination rate < 2% for 2+ consecutive weeks
- **Regression triggers:** If quality drops at any level, revert to the previous level. If error rate hits 5%+ at spot-check level, return to selective review until quality stabilizes

---

## Review Workflow Design

### The review queue

```
Agent produces output batch (e.g., 50 personalized emails)
  ↓
Automated checks run (instant)
  - Format compliance (word count, structure)
  - Rule compliance (banned phrases, em-dashes)
  - Accuracy cross-check (proper nouns match input)
  ↓
  PASS → Enters review queue
  FAIL → Rejected. Logged. Sent back for regeneration

Review queue:
  At 100% review: all 50 outputs in queue
  At 25% review: ~13 random + any flagged outputs
  At 10% review: ~5 random + any flagged outputs
  ↓
Human reviewer checks each queued output
  → Approve: output goes to production (sent to prospect)
  → Edit: reviewer fixes the issue, output goes to production
  → Reject: output is discarded. Input sent back for regeneration
  ↓
Results logged: approved, edited, or rejected + reason
```

### Review workflow rules

- **Automated checks run BEFORE human review.** Don't waste human time on outputs that fail automated checks. Fix the obvious errors programmatically, then give humans the nuanced ones
- **Flagged outputs always get reviewed.** Even at 10% spot-check, any output flagged by automated checks (low confidence, unusual input, potential hallucination) gets human review. Spot-check applies to unflagged outputs only
- **Every review decision is logged.** Approve, edit, or reject + the reason. This data feeds prompt improvement. If 20% of outputs are edited for the same reason ("tone too formal"), that's a prompt fix
- **Set a target review time per output.** 15-30 seconds for email review. 1-2 minutes for research briefs. If review takes longer, the output quality is too low or the reviewer is over-editing

---

## What the Reviewer Checks

### Review checklist by agent type

**Email writer agent:**

| Check | How to verify | Time | Priority |
|-------|-------------|------|----------|
| Factual accuracy | Does the email reference real facts from the input data? No fabricated signals, company details, or proof points | 5 sec | P0 |
| Tone and voice | Does it sound like a peer, not a bot? Natural, not robotic? | 5 sec | P1 |
| Relevance | Does the personalization connect to the prospect's actual situation? | 5 sec | P1 |
| Word count | Within limit? (Should be auto-checked, but verify) | 2 sec | P0 |
| CTA appropriate | Is the ask reasonable? 15 minutes, not 45? No "book a demo"? | 2 sec | P1 |
| Would you send this? | The ultimate gut check. If you'd be embarrassed to send it, reject | 3 sec | P0 |

**Research agent:**

| Check | How to verify | Time |
|-------|-------------|------|
| Company data correct | Quick check: does the company size, funding, industry match what you can verify? | 10 sec |
| Signals are real | Can you verify the signal with a quick Google/LinkedIn check? | 15 sec |
| Problem hypothesis is reasonable | Does the hypothesis follow from the data, or is it a stretch? | 10 sec |
| No hallucinated companies or people | Are all named companies and people real and correctly referenced? | 10 sec |

**Reply classifier agent:**

| Check | How to verify | Time |
|-------|-------------|------|
| Classification matches the reply content | Read the reply. Does the classification (positive, negative, OOO, question) match? | 5 sec |
| Recommended action is appropriate | Is the suggested next step reasonable for this classification? | 3 sec |
| Edge cases handled | Multi-intent replies ("interested but OOO until the 15th") classified correctly? | 5 sec |

---

## Who Does the Review

### Reviewer profiles

| Reviewer | Best for | Pros | Cons |
|----------|---------|------|------|
| The SDR/AE who sends the email | Email review | Knows the prospect. Can add context. Owns the relationship | Takes time from selling. May rubber-stamp to save time |
| SDR Manager | Email and sequence review | Quality-focused. Can coach from review data | Limited time. Can't review every email at scale |
| Marketing (content/copy person) | Email template and quality review | Strong writing instinct. Catches tone issues | Doesn't know the prospects individually |
| RevOps / dedicated QA person | All agent types | Systematic. Process-oriented. Can review at volume | May not have domain expertise for quality judgment |
| AI (LLM-as-judge) | Pre-screening. Quality scoring | Fast, cheap, consistent | 70-85% agreement with humans. Not reliable as sole reviewer |

### Reviewer assignment rules

- **At 100% review: the sender reviews their own output.** The SDR checks the AI-generated emails before they go out. This ensures the sender owns the quality and can add last-second personalization
- **At 25-50% review: SDR Manager spot-checks.** The manager reviews a sample of outputs across all reps. This catches quality patterns that individual reps miss
- **At 10% review: rotate between SDR Manager and RevOps.** Distributed spot-checking prevents reviewer fatigue
- **Never have the person who wrote the prompt be the sole reviewer.** They have blind spots. They'll approve outputs that match their expectations, even if the expectations are wrong. Use an independent reviewer

---

## Review Time Budget

### How much time does HITL cost?

| Review rate | Outputs per day | Time per output | Daily review time | Monthly review time |
|------------|----------------|----------------|------------------|-------------------|
| 100% (full review) | 50 emails | 20 seconds | ~17 minutes | ~6 hours |
| 50% (selective) | 25 emails | 20 seconds | ~8 minutes | ~3 hours |
| 25% (selective) | 13 emails | 20 seconds | ~4 minutes | ~1.5 hours |
| 10% (spot-check) | 5 emails | 30 seconds (more careful) | ~2.5 minutes | ~1 hour |

**Time budget rules:**
- **Full review of 50 emails takes ~17 minutes.** This is less than the time saved by not writing 50 emails manually. HITL at 100% is still a net time savings vs manual email writing
- **Target 20-30 seconds per email review.** If review consistently takes 60+ seconds, the AI output quality is too low. Fix the prompt, don't budget more review time
- **Schedule review as a time block.** "Review AI output" at 8:30am for 15 minutes, not "review throughout the day." Batched review is faster than reviewing one at a time as they arrive

---

## Handling Review Results

### What to do with each review decision

| Decision | Action | Data logged |
|----------|--------|-----------|
| Approve | Output goes to production as-is | Timestamp, reviewer, "approved" |
| Edit (minor) | Reviewer makes a small fix (typo, awkward phrasing). Output goes to production | Timestamp, reviewer, "edited - minor", what was changed |
| Edit (major) | Reviewer significantly rewrites. Output goes to production | Timestamp, reviewer, "edited - major", what was changed and why |
| Reject | Output discarded. Input sent back for regeneration with feedback | Timestamp, reviewer, "rejected", rejection reason |

### Using review data to improve the agent

| Pattern in review data | What it tells you | Prompt fix |
|----------------------|-------------------|-----------|
| 15% of outputs edited for same reason (e.g., "tone too formal") | Systematic prompt issue | Add instruction to prompt: "Write in a casual, conversational tone" + add an example of the desired tone |
| 5% of outputs rejected for hallucination | The "don't fabricate" instruction isn't strong enough, or input data has gaps | Strengthen the anti-hallucination rule. Add: "If a field is missing, say 'Not found' instead of guessing." Check input data pipeline for empty fields |
| Reviewers consistently add the same line | There's a missing element the prompt doesn't generate | Add the element to the prompt: "Always include [X] in the output" |
| Review time increasing over time | Output quality is degrading (prompt drift, data quality decline, model update) | Re-run evals. Compare current output to the best outputs from 30 days ago. Find the drift |
| Edit rate below 5% for 4+ weeks | Agent quality is stable. Ready to reduce review rate | Advance to the next HITL level (e.g., 100% → 50%) |

---

## Designing for Review Speed

### Make review as fast as possible

| Technique | How it helps | Implementation |
|-----------|-------------|---------------|
| Side-by-side display | Show input data next to output. Reviewer checks accuracy without switching tabs | Build a review UI or use a spreadsheet with input in column A, output in column B |
| Highlight personalized elements | Bold or color the AI-generated personalization so the reviewer can focus on what to check | Template the display to highlight dynamic content |
| Pre-check indicators | Show automated check results (pass/fail for word count, banned phrases) alongside the output | Reviewer skips checks that already passed and focuses on quality and accuracy |
| One-click approve/reject | Approve with one click. Reject with one click + dropdown reason | Build into the review tool or use a simple form |
| Batch review mode | Show 10-20 outputs in a scrollable list. Reviewer approves/rejects in sequence | Faster than opening each output individually |

### Review speed rules

- **The review interface matters.** A reviewer checking emails in a CRM, switching to a LinkedIn tab, switching to a spreadsheet to verify, switching back to approve is slow. Build a side-by-side view where input and output are visible together
- **Pre-pass automated checks.** Don't make the human count words or search for banned phrases. That's the automation's job. The human checks tone, accuracy, and relevance. The things only a human can judge
- **One-click approve/reject.** If approving an email takes 3 clicks and a confirmation dialog, the reviewer burns 5 seconds per email on interface friction. That's 4 minutes per batch of 50. Streamline

---

## HITL for Different GTM Workflows

### Cold email (AI-generated or AI-personalized)

| Phase | Review approach |
|-------|---------------|
| First 2 weeks | 100% review. SDR reviews every email before send |
| Week 3-4 | 50% review. SDR reviews half, manager spot-checks the rest |
| Week 5-8 | 25% spot-check. Random sample + any flagged outputs |
| Week 8+ | 10% spot-check. Focus on new prompt versions or new ICP segments |

### Research briefs (AI-generated account research)

| Phase | Review approach |
|-------|---------------|
| First 2 weeks | 100% review. AE or SDR verifies every brief before using |
| Week 3-4 | 50% review. Focus on accuracy (company data, signals) |
| Week 5+ | 25% spot-check. Auto-checks handle format. Human checks accuracy on the sample |

### Reply classification (AI-classified inbound replies)

| Phase | Review approach |
|-------|---------------|
| First 2 weeks | 100% review. Every classification verified by SDR |
| Week 3-4 | Review low-confidence classifications only (< 80% confidence) |
| Week 5+ | Spot-check 10% of all classifications + 100% of low-confidence |

### CRM data updates (AI-enriched or AI-processed)

| Phase | Review approach |
|-------|---------------|
| Always | 100% review before write. AI proposes the update. Human approves before CRM is modified |
| Exception | Bulk enrichment (filling missing fields) can run at 25% review after validation on the first batch |

**CRM write rule:** Never allow an AI agent to write to CRM without human approval. One wrong update cascading through workflows, automations, and reports causes damage that takes hours to fix. Always review CRM writes.

---

## Measurement

| Metric | Definition | Target | Frequency |
|--------|-----------|--------|-----------|
| Review rate | % of outputs reviewed by a human | Decreasing over time (100% → 10%) | Weekly |
| Approve rate | % of reviewed outputs approved without edits | > 85% (at current review level) | Weekly |
| Edit rate (minor) | % of reviewed outputs with minor edits | < 10% | Weekly |
| Edit rate (major) | % of reviewed outputs with major rewrites | < 3% | Weekly |
| Reject rate | % of reviewed outputs rejected entirely | < 2% | Weekly |
| Average review time per output | Seconds per review | 15-30 sec for emails. 60-120 sec for briefs | Monthly |
| Reviewer agreement (if multiple reviewers) | Do two reviewers make the same decision on the same output? | > 85% agreement | Quarterly (calibration) |
| Time saved vs manual creation | Hours saved by AI + review vs fully manual | Track to justify the investment | Monthly |

### Advancement criteria

| Current level | Advance to next level when | Revert to previous level when |
|--------------|---------------------------|------------------------------|
| 100% review | Approve rate > 90% for 2 consecutive weeks | N/A (this is the starting point) |
| 50% review | Approve rate > 90% AND reject rate < 2% for 2 weeks | Reject rate > 5% in any week |
| 25% review | Approve rate > 92% AND reject rate < 1% for 4 weeks | Reject rate > 3% in any week |
| 10% spot-check | Approve rate > 95% AND reject rate < 0.5% for 4 weeks | Reject rate > 2% in any week |
| Exception-only | Only reached after 60+ days at 10% with near-zero rejects | Any pattern of rejects triggers return to 10% |

---

## Anti-Pattern Check

- Starting at 0% review. "The agent seems to work. Let's just send." One hallucinated claim to a Tier 1 ABM account. One wrong company name. One fabricated proof point. The damage exceeds the cost of 100% review for a month. Start at 100%. Always
- Reviewing but not logging results. Reviews happen but nobody tracks approve/edit/reject rates. Without data, you can't measure quality improvement, identify prompt issues, or justify reducing the review rate. Log every decision
- Reviewer rubber-stamps to save time. The SDR approves 50 emails in 2 minutes without reading any of them. Review fatigue. Fix with rotation (different reviewer each day), batch review UI (fast approval flow), or quality audits (manager spot-checks reviewer decisions)
- Skipping HITL for "low-risk" output. "It's just a research brief, nobody sees it but us." The research brief informs the cold email. A wrong fact in the brief becomes a hallucinated claim in the email. Review the upstream output, not just the customer-facing output
- Same review rate for 6 months. If the agent has been at 100% review for 6 months with 95% approve rate, you're over-reviewing. Advance to 50%. The review rate should decrease over time as quality stabilizes
- No advancement criteria. "We'll reduce review when we feel confident." Confidence is not a metric. Define specific criteria: approve rate, reject rate, duration at current level. Advance based on data
- AI writes to CRM without review. An enrichment agent updates 500 contact records. 15 have wrong companies. The wrong data cascades into lead scoring, routing, and outbound. 15 wrong records = 15 embarrassing emails. Always review CRM writes
- One reviewer for everything. One person reviews all AI output across all agents. They become the bottleneck. Rotate reviewers. Train 2-3 people. Distribute the load