how-to 11 min read May 04, 2026

How to Design a Growth Experiment That Won't Embarrass You at Readout

Q: How do you design a growth experiment?

Start with a falsifiable hypothesis tied to a mechanism, pick one primary metric and 1-2 guardrails, calculate the minimum detectable effect (MDE) and sample size before you write code, lock the duration, define segmentation upfront, and pre-register the decision rule. Write all of this in a one-page design doc that gets reviewed before launch, not at readout.

Q: What should a growth experiment hypothesis include?

A strong hypothesis includes evidence, a specific change, a directional prediction, an expected effect size (the MDE), and a target segment. Format: Because [evidence], if we [change], then [primary metric] will [direction] by at least [MDE] for [segment]. Vague hypotheses like 'we should try X' are not testable.

Q: How do you calculate sample size for an A/B test?

Use a power analysis with four inputs: baseline conversion rate, minimum detectable effect (MDE), statistical significance (alpha = 0.05), and statistical power (80%). Plug them into Evan Miller's or Optimizely's sample size calculator. Lower MDE and lower variance need bigger samples.

Q: What are guardrail metrics?

Guardrail metrics are secondary metrics you monitor during an experiment to catch unintended harm. They are not metrics you expect to improve, they are metrics you refuse to let regress. Common guardrails include page latency, error rate, churn, support ticket volume, and revenue per user.

Q: Why do most growth experiments fail at readout?

Most experiments fail because they were designed to ship, not to learn. Common failures: vague hypothesis, underpowered test (winner's curse inflates effects), peeking and stopping early (false positive rate hits 26.1%), no guardrails, or post-hoc decision rules. Fix all of this in design, not in the readout meeting.

Q: What is a minimum detectable effect (MDE)?

MDE is the smallest true difference between control and treatment that your test can reliably detect at your chosen significance and power. A 15-20% relative MDE is typical for conversion rates, 2-5% for high-traffic ecommerce, and 1-2% for sites with 1M+ monthly visitors.

Q: What is the peeking problem in A/B testing?

Peeking is checking results before the planned end date and stopping when you see significance. According to Evan Miller, checking after every new batch of data inflates the false positive rate from 5% to 26.1%. Even checking 10 times triples it to 16%. Fix: pre-register duration or use sequential testing.

Q: What is a pre-registered decision rule?

A pre-registered decision rule states, before the test launches, exactly what result will cause you to ship, kill, or iterate. It eliminates post-hoc rationalization at readout. Format: Ship if X. Kill if Y. Iterate if Z.

By Peter Foy

A step-by-step checklist for designing growth experiments that actually learn: hypothesis, MDE, sample size, guardrails, pre-registered decision rules.

TL;DR

Most growth experiments fail at readout because they were designed to ship, not to learn. Before you write any code, lock six things: a falsifiable hypothesis, one primary metric, 1-2 guardrails, an MDE-driven sample size, a fixed duration, and a pre-registered decision rule. Skip any of these and you will end up debating noise, peeking your way to false positives, or shipping wins that quietly hurt retention.

Write the hypothesis as 'Because [evidence], if we [change], then [metric] will move by [MDE] for [segment]'
Pick one primary metric. Add 1-2 guardrails. Document the threshold for each before launch.
Calculate sample size with Evan Miller's calculator. Lock the duration. Do not peek.
Peeking inflates the false positive rate from 5% to 26.1% (Evan Miller). One in four 'wins' is noise.
Pre-register the decision rule: what ships, what kills, what iterates. Decide before you see the data.

Most growth experiments fail at readout because they were designed to ship, not to learn. The fix is a one-page design doc that locks six decisions before any code is written: a falsifiable hypothesis, one primary metric, one or two guardrails, an MDE-driven sample size, a fixed duration, and a pre-registered decision rule. Get those right and the readout becomes a five-minute confirmation. Skip any of them and you will spend the meeting debating noise, defending a peeked p-value, or explaining why the win shipped and retention dropped 4% the next month.

Why do most growth experiments fail at readout?

Most growth experiments fail at readout because the design was optimized for shipping fast, not for producing a credible decision. The result is a meeting where smart people argue about noise.

The failure rates are public and brutal. According to Ronny Kohavi's Trustworthy Online Controlled Experiments (2020), Airbnb sees 92% of experiments fail, Bing 85%, and Microsoft overall around 66%. An Optimizely meta-analysis of 20,000 customer experiments found only 10% had a statistically significant lift on the primary metric.

Failure isn't the problem. Failing without learning is. The recurring patterns:

The hypothesis was vague, so no result could falsify it
The test was underpowered, so the 'winner' was the winner's curse -- inflated effect, real lift near zero
Someone peeked and called it early, pushing the false positive rate from 5% to 26.1% (Evan Miller)
No guardrails were defined, so the conversion lift hid a latency or retention regression
The decision rule was negotiated after seeing the data

Every one of these is a design problem, not an analysis problem. You cannot fix them at readout.

Growth Experiment Failure Rates at Top Tech Companies

Airbnb

92%

Bing

85%

Google (avg)

85%

Booking.com

90%

Microsoft (overall)

66%

Optimizely meta-analysis

90%

Source: Ronny Kohavi, Trustworthy Online Controlled Experiments (2020); Optimizely meta-analysis of 20,000 tests

What is a growth experiment design checklist?

A growth experiment design checklist is a one-page document you complete before writing code, containing the six decisions that determine whether the experiment will produce a credible result.

The six items, in order:

Hypothesis statement -- falsifiable, with mechanism and expected direction
Primary metric -- one number the test will be judged on
Guardrail metrics -- 1-2 metrics you refuse to let regress
MDE + sample size -- the smallest effect worth detecting and the users required
Duration + segmentation -- how long the test runs and which users are included
Pre-registered decision rule -- what result ships, kills, or iterates

Design Element	Common Failure	Fix
Hypothesis	'We should try X'	'Because [evidence], if we [change], then [metric] will move by [MDE] for [segment]'
Primary metric	Picked after readout	Picked, written, signed off before launch
Guardrails	None defined	1-2 with explicit regression thresholds
MDE	Set to make math easy	Set to smallest lift worth shipping
Sample size	'Run it a week'	Calculated via Evan Miller or Optimizely
Decision rule	Vibes at readout	Pre-registered in the design doc

If your team uses a prioritization scoring framework like ICE or RICE to pick experiments, the checklist comes after scoring, before tickets.

What should a growth experiment hypothesis include?

A growth experiment hypothesis is a falsifiable prediction stating that a specific change will move a specific metric by a specific amount for a specific segment, grounded in specific evidence. 'We should try a new headline' is not a hypothesis. It is a wish.

The template that survives readout:

Because [evidence -- session recordings, funnel analytics, qualitative data], if we [specific change to one variable], then [primary metric] will [direction] by at least [MDE] for [segment], because [mechanism].

A real example:

Because session recordings show 38% of mobile users abandon checkout at the address step, if we replace the address form with a Google Places autocomplete, then mobile checkout completion will increase by at least 4% (relative) for new users on iOS and Android, because reducing keystrokes lowers form abandonment.

This hypothesis is falsifiable: a result of -2% kills it. It has a mechanism: fewer keystrokes equals less abandonment. It has a target effect size: 4% relative, which drives the sample size calculation. It has a segment: mobile new users on iOS and Android.

Compare to the version that gets rejected at readout: 'Test if Google Places autocomplete improves checkout.' Improves what, by how much, for whom, why? You cannot calculate sample size for that. You cannot pre-register a decision rule for that. You will end up arguing about it in three weeks.

How do you pick a primary metric and guardrail metrics?

Pick one primary metric the test will be judged on. Add 1-2 guardrail metrics you refuse to let regress. That's it. More primaries means more multiple comparisons, more false positives, and more arguing about which win counted.

The primary metric should be:

Closest to the change (don't measure revenue if you changed a button color -- measure click-through)
Sensitive enough to move within the test window (long-cycle metrics like 90-day retention rarely fit a 14-day test)
Aligned with the Overall Evaluation Criterion (OEC) Kohavi describes -- the north-star metric the org actually optimizes

Guardrail metrics are the second half. According to PostHog's guide on guardrail metrics and Mixpanel's, these catch the unintended harm a primary-only readout misses.

Standard guardrails by experiment type:

Experiment type	Primary metric	Guardrails
Pricing page test	Trial sign-up rate	Revenue per visitor, refund rate
Onboarding flow	Activation rate	D7 retention, support ticket volume
Performance / infra	Page latency p95	Error rate, conversion rate
Email subject line	Open rate	Unsubscribe rate, reply rate
Checkout UX	Conversion rate	Average order value, fraud rate

Write the threshold for each guardrail in the design doc. 'Latency p95 must not regress more than 50ms.' 'Refund rate must not increase more than 0.3 percentage points.' If a guardrail breaches, the experiment does not ship even if the primary won.

How do you calculate sample size for an A/B test?

Calculate sample size with a power analysis using four inputs: baseline conversion rate, minimum detectable effect (MDE), significance level (alpha, typically 0.05), and statistical power (typically 0.80). Plug them into Evan Miller's sample size calculator, Optimizely's, or your platform's built-in tool.

The four inputs:

Baseline conversion rate -- the current rate of the primary metric (e.g., 4.2% checkout conversion)
MDE -- the smallest relative or absolute lift you care about detecting (e.g., +5% relative)
Alpha -- the false positive rate you'll accept, almost always 0.05
Power -- the probability of detecting a true effect, almost always 0.80

A worked example: baseline 4%, MDE +10% relative (so detecting 4% vs 4.4%), alpha 0.05, power 0.80. The calculator returns roughly 17,000 users per variant for a two-sided test.

The MDE choice is where teams cheat. According to DRIP's guide on MDE:

Most ecommerce stores: 2-5% relative MDE is realistic
High-traffic sites (1M+ monthly visitors): 1-2%
Low-traffic sites: 5-10%, accept it or batch tests

Statsig's guidance is blunter: a 15-20% relative MDE is typical for conversion rates, but if you set MDE high just to make sample size convenient, you'll declare 'no effect' on every test that produces a real but small lift.

If the calculator says you need 80,000 users per variant and you only have 8,000, the answer is not 'run it anyway.' The answer is: pick a higher-funnel metric, batch the test, or run it longer. See our list of growth experiment platforms for tools with built-in calculators.

How long should a growth experiment run, and how should you segment it?

Run for the duration your sample size calculation requires, with a minimum of one full business cycle (7-14 days) to capture day-of-week and weekend effects. Lock the duration in the design doc. Do not stop early because results 'look significant.'

Duration rules of thumb:

Minimum: 7 days, even if sample size is hit on day 3. Tuesday users behave differently from Saturday users.
Standard: 14 days for most B2C funnels, capturing two weekends
B2B / enterprise: 21-28 days minimum, since buying cycles are longer and weekly seasonality is stronger
Maximum: 4-6 weeks. Past that, cookie churn and novelty effects corrupt the data

Segmentation decisions also belong in the design doc, not the readout. Pre-register:

Which segments are in the test (e.g., new users only, mobile only, US-only)
Which segment cuts you'll analyze (device, channel, country, plan tier)
That segment cuts are exploratory, not confirmatory -- you cannot declare a win on a subgroup if the overall test lost

The trap: running across all users, then 'discovering' the test won for one segment. With 10 segment cuts and alpha 0.05, the probability that at least one segment shows a false positive is 40%. That is data dredging, and a sharp readout audience will catch it. Pre-register your segments, or treat any subgroup finding as a hypothesis for the next experiment.

How do you pre-register your decision criteria?

Pre-registering decision criteria means writing down -- before the test launches -- exactly what result will cause you to ship, kill, or iterate. This is the 'what would change my mind' commitment. According to the Center for Open Science's preregistration guidance, a clear pre-registration specifies the analysis, the comparison, and how results map to corroborating or falsifying the prediction.

The decision rule template:

Ship if: primary metric > +[MDE]% with statistical significance (p < 0.05), AND no guardrail regresses more than its threshold.

Kill if: primary metric < +0% (any negative or flat), OR any guardrail breaches its threshold.

Iterate if: primary metric is directionally positive but below MDE or not statistically significant -- redesign and retest, do not ship.

This matters because of the peeking problem. From Evan Miller's How Not To Run an A/B Test:

Plan to check once at the end: false positive rate is the advertised 5%
Check 10 times during the test: false positive rate jumps to 16%
Check after every new batch of data and stop at p<0.05: false positive rate hits 26.1% -- one in four 'wins' is noise

If your team needs to peek (and most do, for monitoring), use sequential testing methods like Bayesian sequential analysis or alpha-spending functions. These are designed to allow legal peeks. Vanilla fixed-horizon tests are not.

Write the decision rule in the same doc as the hypothesis. Get the PM, eng lead, and analyst to sign it. When the readout happens, the conversation is 'did we hit the rule?' not 'how do we frame this?'

False Positive Rate When Peeking at A/B Tests

Check once (planned)

Check 10 times

16%

Check after every batch (stop at p<.05)

26.1%

Source: Evan Miller, How Not To Run an A/B Test

What are 3 readout horror stories you can avoid?

Three patterns that detonate readouts, with real numbers from public post-mortems and benchmarks.

Horror story 1: The underpowered winner. A growth team runs a CTA color test for 5 days. Sample size: 1,200 per variant. Result: variant +27% conversion, p=0.04. They ship. Three months later, the metric has not budged. According to research summarized by Towards Data Science, in underpowered tests the winner's curse inflates effects: the only way a small true effect crosses p<0.05 in a small sample is if random noise pushes it far above its true value. The 27% lift was probably a 1-2% true lift, plus 25% noise.

Horror story 2: The peeker. A team plans a 14-day pricing test. They check on day 3 (looks flat), day 5 (variant -3%), day 7 (variant +6%, p=0.048!). They stop the test and ship. Per Evan Miller, peeking after every batch and stopping at significance pushes the false positive rate to 26.1%. Six weeks later, an analyst re-runs the data through the full planned duration: variant ends at +0.4%, not significant. The team had shipped noise.

Horror story 3: No guardrails. An onboarding redesign hits +12% activation, p<0.001. The team ships. Two months later, D30 retention is down 4% for the activated cohort -- the new flow activated lower-intent users who churned faster. Net revenue impact: negative. If retention had been a pre-registered guardrail with a -2% threshold, the experiment would have flagged for review instead of auto-shipping.

The pattern across all three: design errors masquerading as analysis errors. Each one is preventable in the design doc, free. Each one is unfixable at readout, expensive.

What does a complete growth experiment design doc look like?

A complete growth experiment design doc fits on one page and contains every decision the readout will hinge on. Use this template:

EXPERIMENT NAME: [short, searchable]
OWNER: [name]
STATUS: Design Review | Live | Readout | Shipped/Killed

1. HYPOTHESIS
Because [evidence], if we [change], then [metric]
will [direction] by at least [MDE] for [segment],
because [mechanism].

2. PRIMARY METRIC
[Single metric, definition, current baseline]

3. GUARDRAILS (1-2)
- [Metric]: must not regress more than [threshold]
- [Metric]: must not regress more than [threshold]

4. SAMPLE SIZE
- Baseline: [X%]
- MDE: [Y% relative]
- Alpha: 0.05  Power: 0.80
- Required: [N users per variant]
- Calculator used: [Evan Miller / Optimizely / internal]

5. DURATION & SEGMENTATION
- Start: [date]  End: [date, locked]
- Included: [segments]
- Excluded: [segments]
- Pre-registered cuts: [list]

6. DECISION RULE
- SHIP IF: primary > +[MDE]%, p<0.05, all guardrails within threshold
- KILL IF: primary <= 0% OR any guardrail breaches
- ITERATE IF: directionally positive but below MDE or n.s.

7. SIGN-OFF
PM: [ ]  Eng: [ ]  Analyst: [ ]  Date: [ ]

If this doc isn't completed and signed, the experiment doesn't launch. That single rule, applied consistently, eliminates 80% of readout debates. It is the cheapest piece of process you will ever install on a growth team.

Design Element	What It Is	Common Failure Mode	Fix
Hypothesis	Falsifiable prediction with mechanism + expected effect size	Vague 'we should try X' framing	Use 'Because [evidence], if we [change], then [metric] will [direction] by [size] for [segment]'
Primary metric	The single number the test will be judged on	Multiple primaries, picked after seeing data	Pick one. Write it in the doc before launch.
Guardrails	Metrics you refuse to let regress (latency, churn, support tickets)	No guardrails -- shipped wins quietly hurt retention	1-2 guardrails with explicit thresholds
MDE	Smallest effect the test can reliably detect	Set MDE to whatever makes sample size convenient	Anchor MDE to the smallest lift worth shipping
Sample size	Users per variant required for 80% power at chosen MDE	Stop when 'it looks significant'	Calculate up front with Evan Miller's calculator. Lock duration.
Decision rule	What result ships, kills, or extends the test	Vibes-based readout debate	Pre-register: 'Ship if X. Kill if Y. Iterate if Z.'

Frequently asked questions

How do you design a growth experiment?

Start with a falsifiable hypothesis tied to a mechanism, pick one primary metric and 1-2 guardrails, calculate the minimum detectable effect (MDE) and sample size before you write code, lock the duration, define segmentation upfront, and pre-register the decision rule. Write all of this in a one-page design doc that gets reviewed before launch, not at readout.

What should a growth experiment hypothesis include?

A strong hypothesis includes evidence (why you believe this), a specific change (the variant), a directional prediction (the metric will move up or down), an expected effect size (the MDE), and a target segment. Format: 'Because [evidence], if we [change], then [primary metric] will [direction] by at least [MDE] for [segment].' Vague hypotheses like 'we should try X' are not testable.

How do you calculate sample size for an A/B test?

Use a power analysis with four inputs: baseline conversion rate, minimum detectable effect (MDE), statistical significance (typically alpha = 0.05), and statistical power (typically 80%). Plug them into Evan Miller's sample size calculator or Optimizely's. Lower MDE and lower variance need bigger samples. For most B2B funnels, plan for 5,000-50,000 users per variant.

What are guardrail metrics?

Guardrail metrics are secondary metrics you monitor during an experiment to catch unintended harm. They are not metrics you expect to improve, they are metrics you refuse to let regress. Common guardrails include page latency, error rate, churn, support ticket volume, and revenue per user. A 'win' that crashes a guardrail is not a win.

Why do most growth experiments fail at readout?

Most experimentsfail because they were designed to ship, not to learn. Common failures: hypothesis was vague, the test was underpowered (winner's curse inflates effects), someone peeked and stopped early (false positive rate climbs to 26.1%), no guardrails caught the regression, or the decision rule was negotiated after seeing the data. Fix all of this in design, not in the readout meeting.

What is a minimum detectable effect (MDE)?

MDE is the smallest true difference between control and treatment that your test can reliably detect at your chosen significance and power. A 15-20% relative MDE is typical for conversion rates, 2-5% for high-traffic ecommerce, and 1-2% for sites with 1M+ monthly visitors. Set MDE to the smallest lift that would actually change your decision to ship.

How long should an A/B test run?

Run for the duration your sample size calculation requires, with a minimum of one full business cycle (typically 7-14 days) to capture day-of-week effects. Lock the duration before launch. Do not stop early because results 'look significant' -- peeking inflates false positive rates from 5% to over 26%.

What is the peeking problem in A/B testing?

Peeking is checking results before the planned end date and stopping when you see significance. According to Evan Miller, checking after every new batch of data inflates the false positive rate from 5% to 26.1%. Even checking 10 times triples it to 16%. The fix: pre-register your duration, or use sequential testing methods designed to allow peeks.

What is a pre-registered decision rule?

A pre-registered decision rule states, before the test launches, exactly what result will cause you to ship, kill, or iterate. Format: 'Ship if primary metric > +X% with no guardrail regression > Y%. Kill if primary < +X% or any guardrail breaches Y%. Iterate if directionally positive but inconclusive.' Pre-registration prevents post-hoc rationalization at readout.

Use after the design checklist section

Get the Growth Experiment Design Template