general human-in-the-loop-review

human-in-the-loop-review

This skill should be used when the user asks to "design human review for AI output", "set up human-in-the-loop", "build a review process for agent output", "create a QA process for AI-generated content", "design human oversight for agents", "set up approval gates for AI", "build a review workflow for AI emails", "design human QA for agent output", "create a human review cadence for AI", or any variation of designing human review and approval processes for AI agent output in B2B SaaS GTM.
Download .md

Human-in-the-Loop Review

Human-in-the-loop (HITL) review is the process of having a human check, edit, or approve AI-generated output before it reaches a customer or enters a system of record. Every production AI agent that produces customer-facing output needs HITL. The question is not whether to include human review. It's how much, at what stage, and how to reduce it over time without reducing quality.

The principle: start at 100% human review. Earn the right to reduce it through demonstrated quality. 100% → 50% → 25% → 10% spot-check. Never 0%. Even the most reliable agent produces occasional bad output. The cost of one bad email to a Tier 1 ABM account exceeds the cost of reviewing 1,000 good emails.

The HITL Spectrum

Level What the human does When to use Review rate
Full review Human reads every output. Approves, edits, or rejects before it goes out First 2-4 weeks of any new agent or prompt 100%
Selective review Human reviews flagged outputs (low confidence, edge cases) + random sample After 2-4 weeks of stable quality. Automated checks pass consistently 25-50%
Spot-check Human reviews a random 10-20% sample. The rest goes out automatically After 30+ days of stable quality with < 2% error rate 10-20%
Exception-only Human reviews only outputs that fail automated checks Mature agents with 60+ days of production data and < 1% error rate 1-5%

Progression rules

  • Start at 100%. Always. No exceptions. The first 100-200 outputs from any new agent should be human-reviewed. This catches prompt issues, hallucination patterns, and edge cases before customers see them
  • Advance one level at a time. 100% → 50% → 25% → 10%. Don't jump from 100% to 10% because "it's been working great for a week." One week is not enough data
  • Advancement criteria: Move to the next level when automated check pass rate > 98% AND human review quality score > 4.0/5 AND hallucination rate < 2% for 2+ consecutive weeks
  • Regression triggers: If quality drops at any level, revert to the previous level. If error rate hits 5%+ at spot-check level, return to selective review until quality stabilizes

Review Workflow Design

The review queue

Agent produces output batch (e.g., 50 personalized emails)
  ↓
Automated checks run (instant)
  - Format compliance (word count, structure)
  - Rule compliance (banned phrases, em-dashes)
  - Accuracy cross-check (proper nouns match input)
  ↓
  PASS → Enters review queue
  FAIL → Rejected. Logged. Sent back for regeneration

Review queue:
  At 100% review: all 50 outputs in queue
  At 25% review: ~13 random + any flagged outputs
  At 10% review: ~5 random + any flagged outputs
  ↓
Human reviewer checks each queued output
  → Approve: output goes to production (sent to prospect)
  → Edit: reviewer fixes the issue, output goes to production
  → Reject: output is discarded. Input sent back for regeneration
  ↓
Results logged: approved, edited, or rejected + reason

Review workflow rules

  • Automated checks run BEFORE human review. Don't waste human time on outputs that fail automated checks. Fix the obvious errors programmatically, then give humans the nuanced ones
  • Flagged outputs always get reviewed. Even at 10% spot-check, any output flagged by automated checks (low confidence, unusual input, potential hallucination) gets human review. Spot-check applies to unflagged outputs only
  • Every review decision is logged. Approve, edit, or reject + the reason. This data feeds prompt improvement. If 20% of outputs are edited for the same reason ("tone too formal"), that's a prompt fix
  • Set a target review time per output. 15-30 seconds for email review. 1-2 minutes for research briefs. If review takes longer, the output quality is too low or the reviewer is over-editing

What the Reviewer Checks

Review checklist by agent type

Email writer agent:

Check How to verify Time Priority
Factual accuracy Does the email reference real facts from the input data? No fabricated signals, company details, or proof points 5 sec P0
Tone and voice Does it sound like a peer, not a bot? Natural, not robotic? 5 sec P1
Relevance Does the personalization connect to the prospect's actual situation? 5 sec P1
Word count Within limit? (Should be auto-checked, but verify) 2 sec P0
CTA appropriate Is the ask reasonable? 15 minutes, not 45? No "book a demo"? 2 sec P1
Would you send this? The ultimate gut check. If you'd be embarrassed to send it, reject 3 sec P0

Research agent:

Check How to verify Time
Company data correct Quick check: does the company size, funding, industry match what you can verify? 10 sec
Signals are real Can you verify the signal with a quick Google/LinkedIn check? 15 sec
Problem hypothesis is reasonable Does the hypothesis follow from the data, or is it a stretch? 10 sec
No hallucinated companies or people Are all named companies and people real and correctly referenced? 10 sec

Reply classifier agent:

Check How to verify Time
Classification matches the reply content Read the reply. Does the classification (positive, negative, OOO, question) match? 5 sec
Recommended action is appropriate Is the suggested next step reasonable for this classification? 3 sec
Edge cases handled Multi-intent replies ("interested but OOO until the 15th") classified correctly? 5 sec

Who Does the Review

Reviewer profiles

Reviewer Best for Pros Cons
The SDR/AE who sends the email Email review Knows the prospect. Can add context. Owns the relationship Takes time from selling. May rubber-stamp to save time
SDR Manager Email and sequence review Quality-focused. Can coach from review data Limited time. Can't review every email at scale
Marketing (content/copy person) Email template and quality review Strong writing instinct. Catches tone issues Doesn't know the prospects individually
RevOps / dedicated QA person All agent types Systematic. Process-oriented. Can review at volume May not have domain expertise for quality judgment
AI (LLM-as-judge) Pre-screening. Quality scoring Fast, cheap, consistent 70-85% agreement with humans. Not reliable as sole reviewer

Reviewer assignment rules

  • At 100% review: the sender reviews their own output. The SDR checks the AI-generated emails before they go out. This ensures the sender owns the quality and can add last-second personalization
  • At 25-50% review: SDR Manager spot-checks. The manager reviews a sample of outputs across all reps. This catches quality patterns that individual reps miss
  • At 10% review: rotate between SDR Manager and RevOps. Distributed spot-checking prevents reviewer fatigue
  • Never have the person who wrote the prompt be the sole reviewer. They have blind spots. They'll approve outputs that match their expectations, even if the expectations are wrong. Use an independent reviewer

Review Time Budget

How much time does HITL cost?

Review rate Outputs per day Time per output Daily review time Monthly review time
100% (full review) 50 emails 20 seconds ~17 minutes ~6 hours
50% (selective) 25 emails 20 seconds ~8 minutes ~3 hours
25% (selective) 13 emails 20 seconds ~4 minutes ~1.5 hours
10% (spot-check) 5 emails 30 seconds (more careful) ~2.5 minutes ~1 hour

Time budget rules:

  • Full review of 50 emails takes ~17 minutes. This is less than the time saved by not writing 50 emails manually. HITL at 100% is still a net time savings vs manual email writing
  • Target 20-30 seconds per email review. If review consistently takes 60+ seconds, the AI output quality is too low. Fix the prompt, don't budget more review time
  • Schedule review as a time block. "Review AI output" at 8:30am for 15 minutes, not "review throughout the day." Batched review is faster than reviewing one at a time as they arrive

Handling Review Results

What to do with each review decision

Decision Action Data logged
Approve Output goes to production as-is Timestamp, reviewer, "approved"
Edit (minor) Reviewer makes a small fix (typo, awkward phrasing). Output goes to production Timestamp, reviewer, "edited - minor", what was changed
Edit (major) Reviewer significantly rewrites. Output goes to production Timestamp, reviewer, "edited - major", what was changed and why
Reject Output discarded. Input sent back for regeneration with feedback Timestamp, reviewer, "rejected", rejection reason

Using review data to improve the agent

Pattern in review data What it tells you Prompt fix
15% of outputs edited for same reason (e.g., "tone too formal") Systematic prompt issue Add instruction to prompt: "Write in a casual, conversational tone" + add an example of the desired tone
5% of outputs rejected for hallucination The "don't fabricate" instruction isn't strong enough, or input data has gaps Strengthen the anti-hallucination rule. Add: "If a field is missing, say 'Not found' instead of guessing." Check input data pipeline for empty fields
Reviewers consistently add the same line There's a missing element the prompt doesn't generate Add the element to the prompt: "Always include [X] in the output"
Review time increasing over time Output quality is degrading (prompt drift, data quality decline, model update) Re-run evals. Compare current output to the best outputs from 30 days ago. Find the drift
Edit rate below 5% for 4+ weeks Agent quality is stable. Ready to reduce review rate Advance to the next HITL level (e.g., 100% → 50%)

Designing for Review Speed

Make review as fast as possible

Technique How it helps Implementation
Side-by-side display Show input data next to output. Reviewer checks accuracy without switching tabs Build a review UI or use a spreadsheet with input in column A, output in column B
Highlight personalized elements Bold or color the AI-generated personalization so the reviewer can focus on what to check Template the display to highlight dynamic content
Pre-check indicators Show automated check results (pass/fail for word count, banned phrases) alongside the output Reviewer skips checks that already passed and focuses on quality and accuracy
One-click approve/reject Approve with one click. Reject with one click + dropdown reason Build into the review tool or use a simple form
Batch review mode Show 10-20 outputs in a scrollable list. Reviewer approves/rejects in sequence Faster than opening each output individually

Review speed rules

  • The review interface matters. A reviewer checking emails in a CRM, switching to a LinkedIn tab, switching to a spreadsheet to verify, switching back to approve is slow. Build a side-by-side view where input and output are visible together
  • Pre-pass automated checks. Don't make the human count words or search for banned phrases. That's the automation's job. The human checks tone, accuracy, and relevance. The things only a human can judge
  • One-click approve/reject. If approving an email takes 3 clicks and a confirmation dialog, the reviewer burns 5 seconds per email on interface friction. That's 4 minutes per batch of 50. Streamline

HITL for Different GTM Workflows

Cold email (AI-generated or AI-personalized)

Phase Review approach
First 2 weeks 100% review. SDR reviews every email before send
Week 3-4 50% review. SDR reviews half, manager spot-checks the rest
Week 5-8 25% spot-check. Random sample + any flagged outputs
Week 8+ 10% spot-check. Focus on new prompt versions or new ICP segments

Research briefs (AI-generated account research)

Phase Review approach
First 2 weeks 100% review. AE or SDR verifies every brief before using
Week 3-4 50% review. Focus on accuracy (company data, signals)
Week 5+ 25% spot-check. Auto-checks handle format. Human checks accuracy on the sample

Reply classification (AI-classified inbound replies)

Phase Review approach
First 2 weeks 100% review. Every classification verified by SDR
Week 3-4 Review low-confidence classifications only (< 80% confidence)
Week 5+ Spot-check 10% of all classifications + 100% of low-confidence

CRM data updates (AI-enriched or AI-processed)

Phase Review approach
Always 100% review before write. AI proposes the update. Human approves before CRM is modified
Exception Bulk enrichment (filling missing fields) can run at 25% review after validation on the first batch

CRM write rule: Never allow an AI agent to write to CRM without human approval. One wrong update cascading through workflows, automations, and reports causes damage that takes hours to fix. Always review CRM writes.


Measurement

Metric Definition Target Frequency
Review rate % of outputs reviewed by a human Decreasing over time (100% → 10%) Weekly
Approve rate % of reviewed outputs approved without edits > 85% (at current review level) Weekly
Edit rate (minor) % of reviewed outputs with minor edits < 10% Weekly
Edit rate (major) % of reviewed outputs with major rewrites < 3% Weekly
Reject rate % of reviewed outputs rejected entirely < 2% Weekly
Average review time per output Seconds per review 15-30 sec for emails. 60-120 sec for briefs Monthly
Reviewer agreement (if multiple reviewers) Do two reviewers make the same decision on the same output? > 85% agreement Quarterly (calibration)
Time saved vs manual creation Hours saved by AI + review vs fully manual Track to justify the investment Monthly

Advancement criteria

Current level Advance to next level when Revert to previous level when
100% review Approve rate > 90% for 2 consecutive weeks N/A (this is the starting point)
50% review Approve rate > 90% AND reject rate < 2% for 2 weeks Reject rate > 5% in any week
25% review Approve rate > 92% AND reject rate < 1% for 4 weeks Reject rate > 3% in any week
10% spot-check Approve rate > 95% AND reject rate < 0.5% for 4 weeks Reject rate > 2% in any week
Exception-only Only reached after 60+ days at 10% with near-zero rejects Any pattern of rejects triggers return to 10%

Anti-Pattern Check

  • Starting at 0% review. "The agent seems to work. Let's just send." One hallucinated claim to a Tier 1 ABM account. One wrong company name. One fabricated proof point. The damage exceeds the cost of 100% review for a month. Start at 100%. Always
  • Reviewing but not logging results. Reviews happen but nobody tracks approve/edit/reject rates. Without data, you can't measure quality improvement, identify prompt issues, or justify reducing the review rate. Log every decision
  • Reviewer rubber-stamps to save time. The SDR approves 50 emails in 2 minutes without reading any of them. Review fatigue. Fix with rotation (different reviewer each day), batch review UI (fast approval flow), or quality audits (manager spot-checks reviewer decisions)
  • Skipping HITL for "low-risk" output. "It's just a research brief, nobody sees it but us." The research brief informs the cold email. A wrong fact in the brief becomes a hallucinated claim in the email. Review the upstream output, not just the customer-facing output
  • Same review rate for 6 months. If the agent has been at 100% review for 6 months with 95% approve rate, you're over-reviewing. Advance to 50%. The review rate should decrease over time as quality stabilizes
  • No advancement criteria. "We'll reduce review when we feel confident." Confidence is not a metric. Define specific criteria: approve rate, reject rate, duration at current level. Advance based on data
  • AI writes to CRM without review. An enrichment agent updates 500 contact records. 15 have wrong companies. The wrong data cascades into lead scoring, routing, and outbound. 15 wrong records = 15 embarrassing emails. Always review CRM writes
  • One reviewer for everything. One person reviews all AI output across all agents. They become the bottleneck. Rotate reviewers. Train 2-3 people. Distribute the load
Want agents that use skill files like this?
We customize skill files for your brand voice and methodology, then run content agents against them.
Book a call