definitive-guide 16 min read May 03, 2026

How to Build a Growth Experimentation Framework That Actually Ships

Q: What is the difference between a growth experiment and an A/B test?

An A/B test is a statistical method for comparing two or more variants. A growth experiment is a broader hypothesis-driven test of a growth lever, which may or may not use A/B methodology. All A/B tests are experiments, but growth experiments also include pre/post tests, holdout tests, and qualitative experiments.

Q: How long should a growth experiment run?

Most growth experiments should run between 14 and 28 days. The lower bound covers a full weekly seasonality cycle. The upper bound is the hard kill date if significance has not been reached.

Q: How many growth experiments should a small team run per week?

A 3-5 person growth team should ship 2-4 launched experiments per week. Solo growth operators should target 1 per week. Going faster than your read-out capacity creates a backlog of unread tests.

Q: What metric should I track to measure if my experimentation framework is working?

Track experiments launched per week, experiments with a readout and decision per week, and the ratio of decisions made in the weekly review vs deferred. Win rate is a vanity metric since it floors around 12-33% across all serious programs.

Q: Should I use ICE, RICE, or PIE for scoring growth experiments?

Use ICE for speed and small teams. Use RICE if experiments vary widely in reach and you have quantitative data. Use PIE if 80% of your work is page-level CRO. The choice matters less than applying one framework consistently.

Q: Can a small team without engineers run a growth experimentation framework?

Yes. Email subject lines, ad creative, landing pages, pricing copy, and onboarding sequences can all be tested without engineering. The framework itself does not require engineers, though instrumentation is the first ceiling teams without analytics hit.

Q: What is a kill criterion in a growth experiment?

A kill criterion is a pre-written, objective trigger that ends an experiment early. Standard criteria include guardrail metric breaches, a hard end date of 14-28 days, and a minimum sample size floor. They are written before launch to remove bias from the decision.

Q: How do I get more good experiment ideas?

Open the idea inbox to the whole company and publicly reward contribution. Reforge reported that adding Slack and email feedback loops doubled idea volume. Pair with a monthly idea jam reviewing user interviews, support tickets, and funnel data.

Q: What does a growth experiment readout document include?

Hypothesis (pasted), variant and audience, sample size and duration, primary metric result with confidence interval, guardrail metrics, two or three segment cuts, one surprise, and the decision with a one-sentence rationale. The whole document fits on one page.

Q: How is a growth experimentation framework different from CRO?

CRO is a subset focused on optimizing existing pages and funnels. A growth experimentation framework covers the full growth stack: acquisition, activation, retention, monetization, and referral. CRO almost always uses A/B testing on web pages; growth experiments use a wider range of methods.

By Peter Foy

A definitive guide to building a growth experimentation framework: hypothesis template, scoring, weekly cadence, readout format, and kill criteria.

TL;DR

A growth experimentation framework is the operational system a team uses to convert ideas into shipped tests on a predictable cadence. It has five required components: a hypothesis template, a scoring rubric (ICE, RICE, or PIE), an experiment doc, a weekly review meeting, and explicit kill criteria. Without all five, you have a backlog, not a framework.

Most teams have a backlog (a list of ideas) rather than a framework (a system that ships them on cadence).
Only ~12% of growth experiments win on revenue per Optimizely, so velocity and learning matter more than hit rate.
Use the structured hypothesis format: Because [data], we believe [change] will cause [metric] to move [amount].
Run a 30-minute weekly review covering last week's readouts, this week's launches, and next week's queue.
Write kill criteria before launch: a date, a guardrail metric, and a sample-size floor. Decide once, not in the moment.

A growth experimentation framework is the repeatable system a team uses to turn raw ideas into instrumented, shipped tests on a predictable weekly cadence. A real framework has five working parts: (1) a hypothesis template, (2) a prioritization rubric like ICE or RICE, (3) an experiment doc that travels from kickoff to readout, (4) a weekly experiment review, and (5) written kill criteria. Most teams have a Notion list of ideas and call it a framework. It is not. This guide walks the operational loop end-to-end, with templates and example docs.

What is a growth experimentation framework?

A growth experimentation framework is a closed-loop operating system that takes an idea, scores it, ships it as a controlled test, reads out the result, and feeds the learning back into the next experiment. It is the difference between a team that runs five tests a quarter and one that runs fifty.

The framework has five mandatory components:

A hypothesis template. Standardized format every idea must fit into before it can be queued.
A scoring rubric. ICE, RICE, or PIE -- pick one, apply it consistently.
An experiment doc. A single living document per test, owned by one DRI, covering hypothesis, design, instrumentation, results, and decision.
A weekly experiment review meeting. 30 minutes, same time, same agenda.
Pre-registered kill criteria. A date, a guardrail, and a minimum sample size, written before launch.

If any of those five is missing, you do not have a framework. You have a Trello board with good intentions. According to Reforge's growth experiment management system, implementing exactly these components tripled their testing velocity and 2x'd cross-functional idea contribution.

Why do most growth teams have a backlog instead of a framework?

Most teams collect ideas faster than they ship tests, so the backlog grows while velocity stays flat. The root cause is almost always the same: the team has decided what to test but never decided how to test, when to read out, or who owns the call to ship, scale, or kill.

The symptoms are predictable:

A 200-row Notion database where 80% of cards are stuck in 'Idea' or 'Prioritized' status.
Tests that launch but never close (no readout, no decision, no learning logged).
Experiments scored once during planning and never re-scored as new data lands.
Recurring debates about 'should we kill this' three weeks past the original end date.
A 'growth meeting' that is actually a status update, not a decision forum.

Booking.com runs roughly 1,000 concurrent experiments and ships test variants across 75 countries in under an hour. They do not get there with a better backlog. They get there because every experiment passes through the same five-component loop, every time, with no exceptions for HiPPOs (highest-paid person's opinions).

What are the 7 steps of a growth experimentation cycle?

A complete growth experimentation cycle has seven steps, each with a named owner and a definition of done. Skip a step and the loop breaks: you either launch tests you cannot read, or read tests you cannot act on.

Idea capture. Anyone in the company can drop an idea into the inbox. Required fields: problem statement, rough hypothesis, suspected lever (acquisition, activation, retention, revenue, referral).
Prioritization scoring. The growth lead applies ICE, RICE, or PIE and ranks the queue weekly.
Experiment design. The DRI writes the full experiment doc: hypothesis, variants, audience, primary metric, guardrails, sample size estimate, and kill criteria. See our guide to designing a growth experiment.
Instrumentation review. Engineering or analytics confirms the events, segments, and dashboards exist before code ships. No instrumentation, no launch.
Launch. Test goes live. Owner posts the link to the experiment doc and the dashboard in the team channel.
Monitoring + decision. Daily check on guardrails for the first 72 hours. After that, hands-off until the predetermined end date or a kill trigger fires.
Readout + decision. DRI presents the result in the weekly review. Decision: ship, iterate, kill, or extend (with a stated reason). Learning is logged in the team's permanent test library.

This is the same loop HubSpot's growth team uses, where every growth squad maintains an Airtable visible to the rest of the company.

Who owns each step?

Every step has exactly one DRI. Capture is open, but prioritization is owned by the growth lead. Design and instrumentation review are co-owned by the experiment DRI and the engineer or analyst on the squad. Launch and decision are owned by the DRI. The weekly review is run by the growth lead. If two people own a step, no one does.

How do you write a good growth experiment hypothesis?

A good growth experiment hypothesis is a falsifiable, data-backed prediction in this exact form: Because [observation], we believe [change] for [audience] will cause [metric] to move [direction + size] within [timeframe]. If your hypothesis cannot be diagrammed onto that template, it is not a hypothesis -- it is a wish.

The template has five non-negotiable parts:

Observation. The data you are reacting to. Quantitative (event log, funnel drop) or qualitative (5+ user interviews, support tickets).
Change. A specific, single intervention. Not 'redesign onboarding.' Specifically: 'replace step-2 form with progressive profiling.'
Audience. Who sees the change. New signups in last 7 days, mobile only, US English locale.
Metric and effect size. Primary metric and the minimum detectable effect you would consider a win. 'Activation rate from 38% to 43%.'
Timeframe. Pre-committed end date or sample size.

Bad hypothesis (real example, anonymized)

'We should test a new onboarding flow because users seem confused.'

No data, no specific change, no audience, no metric, no end date. Cannot be falsified. Cannot be prioritized.

Good hypothesis

'Because session recordings show 62% of new signups abandon at the workspace-creation step (n=412 over 14 days), we believe replacing the empty workspace with a 3-template gallery for all new B2B signups will lift Day-1 activation from 38% to 43% (MDE 5pp at 80% power) within a 21-day test window. Guardrail: trial-to-paid conversion not down >2pp.'

The second one is testable, scoreable, and shippable. According to the Princeton GEO study referenced across the AEO literature, specificity is the single biggest predictor of whether an experiment finishes with a clear decision.

How do you decide which growth experiment to run first?

Score every queued experiment with one prioritization framework, applied identically across the team, and rerun the score weekly. The three frameworks worth knowing are ICE, RICE, and PIE. Pick one based on what data you actually have, not which acronym sounds best.

Framework	Factors	Best for	Origin
ICE	Impact, Confidence, Ease (1-10 each)	Small teams, rapid prioritization, early-stage products	Sean Ellis at GrowthHackers
RICE	Reach, Impact, Confidence, Effort	Teams with quantitative reach data, mature funnels	Sean McBride at Intercom
PIE	Potential, Importance, Ease	CRO-heavy teams testing existing pages	Chris Goward at WiderFunnel

Use ICE if you are a small team running fewer than 30 experiments a quarter. It is fast and forgiving. Use RICE when ideas vary widely in how many users they touch -- a homepage test reaches 10x more people than a billing-page test, and ICE will not capture that. Use PIE if 80% of what you ship is page-level CRO.

Whatever you choose, two rules apply:

Score in the meeting, not before. Forces calibration across the team.
Re-score the top 10 every week. New data changes Confidence. New roadmap changes Effort.

Do not stack-rank with three competing frameworks. According to Optimizely's analysis of 127,000 experiments, only ~12% of tests produce a clear win, so prioritization is less about picking the winner and more about maximizing learning per week.

Growth Experiment Win Rates Across Major Programs

Optimizely (overall)

12%

Optimizely (revenue)

10%

Microsoft (Bing)

33%

Industry hit rate

33%

GoodUI prediction rate

71%

Source: Optimizely Evolution of Experimentation Report; Kohavi et al. (Microsoft); GoodUI Evidence Database

What is the right experiment cadence for a small team?

For a 3-5 person growth team, the right cadence is 2-4 launched experiments per week with a 30-minute weekly review. That is the band where you ship enough to learn but not so much that instrumentation and analysis quality collapse.

The scaling curve looks like this:

Solo founder / 1-person growth: 1 experiment / week. Anything more and you cannot read results properly.
3-5 person team: 2-4 / week. This is the sweet spot for most Series A-B SaaS.
10+ personteam: 5-10 / week. Requires dedicated experimentation tooling and a full-time analyst.
Booking.com / Airbnb scale: 100-700+ / week, with Airbnb growing from 100 to 700 experiments per week over two years.

Velocity is the right north-star metric for an experimentation program, not win rate. According to Reforge's analysis, tripling velocity matters more than improving hit rate, because the win rate floor is bounded (~12-33%) but velocity is uncapped.

Do not chase Booking.com's number from a standing start. Their 1,000 concurrent tests run on a decade of platform investment. Start with 2 launched per week, hold that for 6 weeks, then evaluate where the bottleneck is (idea quality, instrumentation, or readout discipline) before increasing.

Experiment Velocity at High-Growth Companies

Booking.com (concurrent)

1000

Airbnb (per week, 2024)

700

Amazon (per year)

12000

Airbnb (per week, 2022)

100

Source: Reforge Growth Experiment Management System; Statsig State of Experimentation

What does a growth experiment readout look like?

A growth experiment readout is a 1-page document and a 5-minute verbal presentation in the weekly review that ends with a decision: ship, iterate, kill, or extend. It is not a report. Reports describe. Readouts decide.

Every readout has these eight sections, in order:

Hypothesis (paste the original, do not rewrite).
Variant + audience (what shipped to whom).
Sample size + duration (n per arm, days live).
Primary metric result (with confidence interval, p-value, or Bayesian probability).
Guardrail metrics (did anything regress?).
Segment cuts (top 2-3 only -- do not p-hack).
Surprise / what we did not expect.
Decision and rationale (ship / iterate / kill / extend, with one-sentence why).

Example readout summary line

'Template gallery lifted Day-1 activation 38% -> 41.2% (95% CI: +1.4pp to +5.0pp, p=0.03, n=2,840). Trial-to-paid flat. Decision: ship to 100%, iterate on template count next quarter.'

The readout lives in the same experiment doc the team has been writing in since kickoff -- not a new deck. Continuity is the point. According to Optimizely's experimentation reporting guidance, teams that maintain a single continuous doc per experiment are 2-3x more likely to revisit and reuse historical learnings.

What about inconclusive results?

Inconclusive is a valid outcome and should be called as 'kill' by default unless the team has a specific reason to extend. An inconclusive result after a properly powered test is information: the effect is too small to matter at this sample size. Logging 'kill -- effect too small to matter at our scale' is more honest, and more useful, than running it for another two weeks hoping the p-value drops.

How do you run a weekly growth experiment review meeting?

The weekly experiment review is a 30-minute, same-time, fixed-agenda meeting that is the operating heartbeat of the framework. Skip it for two weeks and the loop breaks: launches drift, readouts pile up, decisions stall, the backlog grows.

The agenda has four blocks, timeboxed:

Block	Time	What happens
1. Readouts	12 min	Each completed experiment gets 3-5 min: result + decision. No re-litigating.
2. In-flight check	5 min	Anything red on guardrails? Anything ready to read out next week?
3. Launch queue	8 min	This week's launches: instrumentation confirmed, doc complete, kill criteria written.
4. Backlog re-score	5 min	Top 10 in the queue -- any score changes since last week? Surface new ideas.

Rules of the room:

DRI presents their own readout. No proxies.
Decisions are made in the meeting. If something cannot be decided, the blocker is named and a decision date is set.
Outcomes go into the test library before the meeting ends. No 'I'll write it up later.'
Stakeholders observe; they do not relitigate scoring.

HubSpot's growth team's airtable-based system and Reforge's Pipefy-based system both share this same shape: short, structured, decisive, weekly.

What are good kill criteria for a growth experiment?

Good kill criteria are written before launch, are objective, and trigger automatically -- so the decision to stop is not made under emotional or political pressure when the experiment is already live. Every experiment doc must include three kill triggers.

The three triggers, in priority order:

Guardrail breach. A predefined regression in a protected metric. Example: trial-to-paid conversion drops more than 2pp, or page errors increase >0.5%. Kill immediately, do not wait for the end date.
Hard end date. A pre-committed last day. Typically 14-28 days. If the test has not reached significance by then, the effect is too small to matter. Kill.
Sample-size floor. Minimum n per arm to power the test. If the experiment cannot reach this volume in the time window, kill before launch -- not after.

Why pre-registering kills matters

According to Optimizely's research on failed experiments, 88% of implemented ideas do not produce a positive significant change. Without kill criteria, those failed tests stay live for weeks while teams 'wait for more data' -- which is almost always confirmation bias dressed up as patience.

The kill list framework used by operators applies a three-test rule: strategy fit, evidence, and cost-to-ship. If a project fails two of three, it is killed. Apply the same logic to experiments: if it fails the guardrail or the end date or the sample floor, kill it. No reprieves.

The payoff is throughput. Teams that kill ruthlessly run more experiments, learn faster, and burn fewer engineering cycles maintaining variants nobody is reading.

What are the most common growth experimentation framework failure modes?

Five failure modes account for nearly every framework that quietly dies after the first quarter. Watch for these specifically; they are easier to prevent than to fix.

The vanity hit-rate. Teams optimize for 'experiments that win' instead of 'learnings per week.' Win rates floor around 12-33% across major programs, so optimizing for them caps your ceiling. Optimize velocity and decision quality instead.
No instrumentation gate. Tests launch before the events exist. Two weeks later: 'we cannot actually measure this.' Add a hard gate: no events in production, no launch.
The HiPPO override. An exec kills or ships an experiment based on opinion mid-flight. Solve with pre-registered kill criteria and a no-relitigation rule in the weekly review.
Backlog hoarding. Ideas accumulate, never get scored, never get killed. Apply a 90-day rule: any idea in the backlog longer than 90 days without scoring gets archived.
No central learning library. Every readout vanishes into Slack. Six months later, the team retests the same idea. Maintain a single, searchable library of every experiment doc, ever.

The pattern across all five: lack of forced function. A framework only works when the cadence and the constraints are non-negotiable. The moment you make exceptions, you are back to a backlog with extra steps.

How does this framework compare to GoodUI, Reforge, and Optimizely's approaches?

The framework above is a synthesis of three well-documented public systems. Each one weights the loop differently; pick the emphasis that matches your team.

Source	Core insight	What they emphasize	Best fit for
GoodUI Evidence	Pattern-based prediction beats fresh ideation	Pre-test prediction; 71% prediction accuracy across 610 tests, 141 patterns	CRO-heavy teams, e-commerce, landing pages
Reforge GEMS	Velocity > win rate; cross-functional idea sourcing	The operating system: queue, schedule, weekly review	Mid-stage SaaS growth teams
Optimizely	Statistical rigor and guardrails	Win-rate analysis (12% overall, 10% revenue), trustworthy reads	Mature programs, large enterprise
Sean Ellis high-tempo testing	Tempo as the primary lever	ICE scoring + weekly meeting cadence	Early-stage, small teams

If you are starting from zero, copy Sean Ellis's high-tempo testing structure verbatim for the first 90 days: ICE scoring, weekly meeting, hypothesis template, kill criteria. After 90 days, layer in Reforge's instrumentation discipline. After 180 days, add Optimizely's statistical rigor and GoodUI's pattern library. Trying to install all four in month one is how frameworks die.

Get the growth experimentation framework Notion template

We packaged the full operating system -- hypothesis template, ICE/RICE scoring rubric, experiment doc template, weekly review agenda, readout format, and pre-registered kill criteria -- into a single Notion workspace you can duplicate in 30 seconds.

What is in the template:

Idea inbox with required fields and a Slack capture form.
Scoring queue with ICE, RICE, and PIE columns (use whichever fits).
Experiment doc template that stays with the test from kickoff to readout.
Pre-built weekly review agenda with timeboxes.
Readout 1-pager with the 8-section structure.
Test library (the searchable archive of every experiment, win or loss).
Kill criteria checklist required before any launch.

Duplicate the template, customize the scoring rubric for your team, and run your first weekly review within seven days. For deeper context on each component, see our companion guides on ICE vs RICE vs PIE, how to design a growth experiment, and the best growth experiment tools for 2026.

Framework	Factors	Best Use Case	Origin
ICE	Impact, Confidence, Ease (1-10 each)	Small teams, rapid prioritization, <30 experiments/quarter	Sean Ellis, GrowthHackers
RICE	Reach, Impact, Confidence, Effort	Mature funnels with quantitative reach data	Sean McBride, Intercom
PIE	Potential, Importance, Ease	CRO-heavy programs testing existing pages	Chris Goward, WiderFunnel
Weighted scoring	Custom factors with weights	Cross-functional teams with competing objectives	Internal product orgs

Frequently asked questions

What is the difference between a growth experiment and an A/B test?

An A/B test is a statistical method for comparing two or more variants. A growth experiment is a broader hypothesis-driven test of a growth lever, which may or may not use A/B methodology. All A/B tests are experiments, but growth experiments also include pre/post tests, holdout tests, and qualitative experiments where statistical significance is not the primary read.

How long should a growth experiment run?

Most growth experiments should run between 14 and 28 days. The lower bound (14 days) covers a full weekly seasonality cycle. The upper bound (28 days) is your hard kill date if significance has not been reached. According to Optimizely's guidance, running tests longer than 4 weeks rarely improves the read, because the effect size is bounded by the sample you can practically collect.

How many growth experiments should a small team run per week?

A 3-5 person growth team should ship 2-4 launched experiments per week. That is enough volume to learn meaningfully without breaking instrumentation or readout discipline. Solo growth operators should target 1 per week. Going faster than your read-out capacity creates a backlog of unread tests, which is the same problem as no testing.

What metric should I track to measure if my experimentation framework is working?

Track three metrics weekly: (1) experiments launched per week, (2) experiments with a written readout and decision per week, and (3) ratio of decisions made in the weekly review vs deferred. Win rate is a vanity metric, since it floors around 12-33% across all serious programs. Velocity and decision throughput are the levers you actually control.

Should I use ICE, RICE, or PIE for scoring growth experiments?

Use ICE if you are running fewer than 30 experiments a quarter and want speed. Use RICE if your experiments vary widely in how many users they reach and you have quantitative reach data. Use PIE if 80% of your work is page-level CRO. The choice matters less than applying one framework consistently across the whole team.

Can a small team without engineers run a growth experimentation framework?

Yes, but with constraints. No-engineer teams can run experiments on email subject lines, ad creative, landing pages built on a CMS, pricing page copy, and onboarding email sequences. The framework itself (hypothesis, scoring, weekly review, readout, kill criteria) does not require engineering. The instrumentation step is where teams without analytics support hit their first ceiling.

What is a kill criterion in a growth experiment?

A kill criterion is a pre-written, objective trigger that ends an experiment early. Standard kill criteria include guardrail metric breaches (e.g. revenue per user drops more than 2%), a hard end date (typically 14-28 days), and a minimum sample size floor. Kill criteria are written before launch, never during the experiment, to remove emotional and political bias from the decision.

How do I get more good experiment ideas?

Open the idea inbox to the whole company and reward contribution publicly. Reforge reported that adding Slack and email feedback loops doubled the volume of ideas added to the prioritization queue. Pair this with a 30-minute monthly 'idea jam' where the growth team reviews recent user interviews, support tickets, and funnel data together. Most teams have an idea drought because the input pipes are too narrow, not because the team is uncreative.

What does a growth experiment readout document include?

A complete readout includes: the original hypothesis (paste, do not rewrite), the variant and audience, sample size and duration, primary metric result with confidence interval, guardrail metrics, two or three segment cuts, one 'surprise' or unexpected finding, and the decision (ship, iterate, kill, or extend) with a one-sentence rationale. The whole document fits on one page.

How is a growth experimentation framework different from CRO?

Conversion Rate Optimization (CRO) is a subset of growth experimentation focused on optimizing existing pages and funnels. A growth experimentation framework covers the full growth stack: acquisition channels, activation flows, retention loops, monetization, and referral. CRO almost always uses A/B testing on web pages; growth experiments may use A/B tests, holdouts, geo-tests, pre/post tests, or qualitative methods.

After the 'Get the growth experimentation framework Notion template' section, as the primary in-page conversion point

Duplicate the Notion template