Most growth experiments fail at readout because they were designed to ship, not to learn. The fix is a one-page design doc that locks six decisions before any code is written: a falsifiable hypothesis, one primary metric, one or two guardrails, an MDE-driven sample size, a fixed duration, and a pre-registered decision rule. Get those right and the readout becomes a five-minute confirmation. Skip any of them and you will spend the meeting debating noise, defending a peeked p-value, or explaining why the win shipped and retention dropped 4% the next month.

Why do most growth experiments fail at readout?

Most growth experiments fail at readout because the design was optimized for shipping fast, not for producing a credible decision. The result is a meeting where smart people argue about noise.

The failure rates are public and brutal. According to Ronny Kohavi's Trustworthy Online Controlled Experiments (2020), Airbnb sees 92% of experiments fail, Bing 85%, and Microsoft overall around 66%. An Optimizely meta-analysis of 20,000 customer experiments found only 10% had a statistically significant lift on the primary metric.

Failure isn't the problem. Failing without learning is. The recurring patterns:

  • The hypothesis was vague, so no result could falsify it
  • The test was underpowered, so the 'winner' was the winner's curse -- inflated effect, real lift near zero
  • Someone peeked and called it early, pushing the false positive rate from 5% to 26.1% (Evan Miller)
  • No guardrails were defined, so the conversion lift hid a latency or retention regression
  • The decision rule was negotiated after seeing the data

Every one of these is a design problem, not an analysis problem. You cannot fix them at readout.

Growth Experiment Failure Rates at Top Tech Companies
Airbnb
92%
Bing
85%
Google (avg)
85%
Booking.com
90%
Microsoft (overall)
66%
Optimizely meta-analysis
90%
Source: Ronny Kohavi, Trustworthy Online Controlled Experiments (2020); Optimizely meta-analysis of 20,000 tests

What is a growth experiment design checklist?

A growth experiment design checklist is a one-page document you complete before writing code, containing the six decisions that determine whether the experiment will produce a credible result.

The six items, in order:

  1. Hypothesis statement -- falsifiable, with mechanism and expected direction
  2. Primary metric -- one number the test will be judged on
  3. Guardrail metrics -- 1-2 metrics you refuse to let regress
  4. MDE + sample size -- the smallest effect worth detecting and the users required
  5. Duration + segmentation -- how long the test runs and which users are included
  6. Pre-registered decision rule -- what result ships, kills, or iterates
Design Element Common Failure Fix
Hypothesis 'We should try X' 'Because [evidence], if we [change], then [metric] will move by [MDE] for [segment]'
Primary metric Picked after readout Picked, written, signed off before launch
Guardrails None defined 1-2 with explicit regression thresholds
MDE Set to make math easy Set to smallest lift worth shipping
Sample size 'Run it a week' Calculated via Evan Miller or Optimizely
Decision rule Vibes at readout Pre-registered in the design doc

If your team uses a prioritization scoring framework like ICE or RICE to pick experiments, the checklist comes after scoring, before tickets.

What should a growth experiment hypothesis include?

A growth experiment hypothesis is a falsifiable prediction stating that a specific change will move a specific metric by a specific amount for a specific segment, grounded in specific evidence. 'We should try a new headline' is not a hypothesis. It is a wish.

The template that survives readout:

Because [evidence -- session recordings, funnel analytics, qualitative data], if we [specific change to one variable], then [primary metric] will [direction] by at least [MDE] for [segment], because [mechanism].

A real example:

Because session recordings show 38% of mobile users abandon checkout at the address step, if we replace the address form with a Google Places autocomplete, then mobile checkout completion will increase by at least 4% (relative) for new users on iOS and Android, because reducing keystrokes lowers form abandonment.

This hypothesis is falsifiable: a result of -2% kills it. It has a mechanism: fewer keystrokes equals less abandonment. It has a target effect size: 4% relative, which drives the sample size calculation. It has a segment: mobile new users on iOS and Android.

Compare to the version that gets rejected at readout: 'Test if Google Places autocomplete improves checkout.' Improves what, by how much, for whom, why? You cannot calculate sample size for that. You cannot pre-register a decision rule for that. You will end up arguing about it in three weeks.

How do you pick a primary metric and guardrail metrics?

Pick one primary metric the test will be judged on. Add 1-2 guardrail metrics you refuse to let regress. That's it. More primaries means more multiple comparisons, more false positives, and more arguing about which win counted.

The primary metric should be:

  • Closest to the change (don't measure revenue if you changed a button color -- measure click-through)
  • Sensitive enough to move within the test window (long-cycle metrics like 90-day retention rarely fit a 14-day test)
  • Aligned with the Overall Evaluation Criterion (OEC) Kohavi describes -- the north-star metric the org actually optimizes

Guardrail metrics are the second half. According to PostHog's guide on guardrail metrics and Mixpanel's, these catch the unintended harm a primary-only readout misses.

Standard guardrails by experiment type:

Experiment type Primary metric Guardrails
Pricing page test Trial sign-up rate Revenue per visitor, refund rate
Onboarding flow Activation rate D7 retention, support ticket volume
Performance / infra Page latency p95 Error rate, conversion rate
Email subject line Open rate Unsubscribe rate, reply rate
Checkout UX Conversion rate Average order value, fraud rate

Write the threshold for each guardrail in the design doc. 'Latency p95 must not regress more than 50ms.' 'Refund rate must not increase more than 0.3 percentage points.' If a guardrail breaches, the experiment does not ship even if the primary won.

How do you calculate sample size for an A/B test?

Calculate sample size with a power analysis using four inputs: baseline conversion rate, minimum detectable effect (MDE), significance level (alpha, typically 0.05), and statistical power (typically 0.80). Plug them into Evan Miller's sample size calculator, Optimizely's, or your platform's built-in tool.

The four inputs:

  1. Baseline conversion rate -- the current rate of the primary metric (e.g., 4.2% checkout conversion)
  2. MDE -- the smallest relative or absolute lift you care about detecting (e.g., +5% relative)
  3. Alpha -- the false positive rate you'll accept, almost always 0.05
  4. Power -- the probability of detecting a true effect, almost always 0.80

A worked example: baseline 4%, MDE +10% relative (so detecting 4% vs 4.4%), alpha 0.05, power 0.80. The calculator returns roughly 17,000 users per variant for a two-sided test.

The MDE choice is where teams cheat. According to DRIP's guide on MDE:

  • Most ecommerce stores: 2-5% relative MDE is realistic
  • High-traffic sites (1M+ monthly visitors): 1-2%
  • Low-traffic sites: 5-10%, accept it or batch tests

Statsig's guidance is blunter: a 15-20% relative MDE is typical for conversion rates, but if you set MDE high just to make sample size convenient, you'll declare 'no effect' on every test that produces a real but small lift.

If the calculator says you need 80,000 users per variant and you only have 8,000, the answer is not 'run it anyway.' The answer is: pick a higher-funnel metric, batch the test, or run it longer. See our list of growth experiment platforms for tools with built-in calculators.

How long should a growth experiment run, and how should you segment it?

Run for the duration your sample size calculation requires, with a minimum of one full business cycle (7-14 days) to capture day-of-week and weekend effects. Lock the duration in the design doc. Do not stop early because results 'look significant.'

Duration rules of thumb:

  • Minimum: 7 days, even if sample size is hit on day 3. Tuesday users behave differently from Saturday users.
  • Standard: 14 days for most B2C funnels, capturing two weekends
  • B2B / enterprise: 21-28 days minimum, since buying cycles are longer and weekly seasonality is stronger
  • Maximum: 4-6 weeks. Past that, cookie churn and novelty effects corrupt the data

Segmentation decisions also belong in the design doc, not the readout. Pre-register:

  • Which segments are in the test (e.g., new users only, mobile only, US-only)
  • Which segment cuts you'll analyze (device, channel, country, plan tier)
  • That segment cuts are exploratory, not confirmatory -- you cannot declare a win on a subgroup if the overall test lost

The trap: running across all users, then 'discovering' the test won for one segment. With 10 segment cuts and alpha 0.05, the probability that at least one segment shows a false positive is 40%. That is data dredging, and a sharp readout audience will catch it. Pre-register your segments, or treat any subgroup finding as a hypothesis for the next experiment.

How do you pre-register your decision criteria?

Pre-registering decision criteria means writing down -- before the test launches -- exactly what result will cause you to ship, kill, or iterate. This is the 'what would change my mind' commitment. According to the Center for Open Science's preregistration guidance, a clear pre-registration specifies the analysis, the comparison, and how results map to corroborating or falsifying the prediction.

The decision rule template:

Ship if: primary metric > +[MDE]% with statistical significance (p < 0.05), AND no guardrail regresses more than its threshold.

Kill if: primary metric < +0% (any negative or flat), OR any guardrail breaches its threshold.

Iterate if: primary metric is directionally positive but below MDE or not statistically significant -- redesign and retest, do not ship.

This matters because of the peeking problem. From Evan Miller's How Not To Run an A/B Test:

  • Plan to check once at the end: false positive rate is the advertised 5%
  • Check 10 times during the test: false positive rate jumps to 16%
  • Check after every new batch of data and stop at p<0.05: false positive rate hits 26.1% -- one in four 'wins' is noise

If your team needs to peek (and most do, for monitoring), use sequential testing methods like Bayesian sequential analysis or alpha-spending functions. These are designed to allow legal peeks. Vanilla fixed-horizon tests are not.

Write the decision rule in the same doc as the hypothesis. Get the PM, eng lead, and analyst to sign it. When the readout happens, the conversation is 'did we hit the rule?' not 'how do we frame this?'

False Positive Rate When Peeking at A/B Tests
Check once (planned)
5%
Check 10 times
16%
Check after every batch (stop at p<.05)
26.1%
Source: Evan Miller, How Not To Run an A/B Test

What are 3 readout horror stories you can avoid?

Three patterns that detonate readouts, with real numbers from public post-mortems and benchmarks.

Horror story 1: The underpowered winner. A growth team runs a CTA color test for 5 days. Sample size: 1,200 per variant. Result: variant +27% conversion, p=0.04. They ship. Three months later, the metric has not budged. According to research summarized by Towards Data Science, in underpowered tests the winner's curse inflates effects: the only way a small true effect crosses p<0.05 in a small sample is if random noise pushes it far above its true value. The 27% lift was probably a 1-2% true lift, plus 25% noise.

Horror story 2: The peeker. A team plans a 14-day pricing test. They check on day 3 (looks flat), day 5 (variant -3%), day 7 (variant +6%, p=0.048!). They stop the test and ship. Per Evan Miller, peeking after every batch and stopping at significance pushes the false positive rate to 26.1%. Six weeks later, an analyst re-runs the data through the full planned duration: variant ends at +0.4%, not significant. The team had shipped noise.

Horror story 3: No guardrails. An onboarding redesign hits +12% activation, p<0.001. The team ships. Two months later, D30 retention is down 4% for the activated cohort -- the new flow activated lower-intent users who churned faster. Net revenue impact: negative. If retention had been a pre-registered guardrail with a -2% threshold, the experiment would have flagged for review instead of auto-shipping.

The pattern across all three: design errors masquerading as analysis errors. Each one is preventable in the design doc, free. Each one is unfixable at readout, expensive.

What does a complete growth experiment design doc look like?

A complete growth experiment design doc fits on one page and contains every decision the readout will hinge on. Use this template:

EXPERIMENT NAME: [short, searchable]
OWNER: [name]
STATUS: Design Review | Live | Readout | Shipped/Killed

1. HYPOTHESIS
Because [evidence], if we [change], then [metric]
will [direction] by at least [MDE] for [segment],
because [mechanism].

2. PRIMARY METRIC
[Single metric, definition, current baseline]

3. GUARDRAILS (1-2)
- [Metric]: must not regress more than [threshold]
- [Metric]: must not regress more than [threshold]

4. SAMPLE SIZE
- Baseline: [X%]
- MDE: [Y% relative]
- Alpha: 0.05  Power: 0.80
- Required: [N users per variant]
- Calculator used: [Evan Miller / Optimizely / internal]

5. DURATION & SEGMENTATION
- Start: [date]  End: [date, locked]
- Included: [segments]
- Excluded: [segments]
- Pre-registered cuts: [list]

6. DECISION RULE
- SHIP IF: primary > +[MDE]%, p<0.05, all guardrails within threshold
- KILL IF: primary <= 0% OR any guardrail breaches
- ITERATE IF: directionally positive but below MDE or n.s.

7. SIGN-OFF
PM: [ ]  Eng: [ ]  Analyst: [ ]  Date: [ ]

If this doc isn't completed and signed, the experiment doesn't launch. That single rule, applied consistently, eliminates 80% of readout debates. It is the cheapest piece of process you will ever install on a growth team.

Design ElementWhat It IsCommon Failure ModeFix
HypothesisFalsifiable prediction with mechanism + expected effect sizeVague 'we should try X' framingUse 'Because [evidence], if we [change], then [metric] will [direction] by [size] for [segment]'
Primary metricThe single number the test will be judged onMultiple primaries, picked after seeing dataPick one. Write it in the doc before launch.
GuardrailsMetrics you refuse to let regress (latency, churn, support tickets)No guardrails -- shipped wins quietly hurt retention1-2 guardrails with explicit thresholds
MDESmallest effect the test can reliably detectSet MDE to whatever makes sample size convenientAnchor MDE to the smallest lift worth shipping
Sample sizeUsers per variant required for 80% power at chosen MDEStop when 'it looks significant'Calculate up front with Evan Miller's calculator. Lock duration.
Decision ruleWhat result ships, kills, or extends the testVibes-based readout debatePre-register: 'Ship if X. Kill if Y. Iterate if Z.'