A growth experimentation framework is the repeatable system a team uses to turn raw ideas into instrumented, shipped tests on a predictable weekly cadence. A real framework has five working parts: (1) a hypothesis template, (2) a prioritization rubric like ICE or RICE, (3) an experiment doc that travels from kickoff to readout, (4) a weekly experiment review, and (5) written kill criteria. Most teams have a Notion list of ideas and call it a framework. It is not. This guide walks the operational loop end-to-end, with templates and example docs.
What is a growth experimentation framework?
A growth experimentation framework is a closed-loop operating system that takes an idea, scores it, ships it as a controlled test, reads out the result, and feeds the learning back into the next experiment. It is the difference between a team that runs five tests a quarter and one that runs fifty.
The framework has five mandatory components:
- A hypothesis template. Standardized format every idea must fit into before it can be queued.
- A scoring rubric. ICE, RICE, or PIE -- pick one, apply it consistently.
- An experiment doc. A single living document per test, owned by one DRI, covering hypothesis, design, instrumentation, results, and decision.
- A weekly experiment review meeting. 30 minutes, same time, same agenda.
- Pre-registered kill criteria. A date, a guardrail, and a minimum sample size, written before launch.
If any of those five is missing, you do not have a framework. You have a Trello board with good intentions. According to Reforge's growth experiment management system, implementing exactly these components tripled their testing velocity and 2x'd cross-functional idea contribution.
Why do most growth teams have a backlog instead of a framework?
Most teams collect ideas faster than they ship tests, so the backlog grows while velocity stays flat. The root cause is almost always the same: the team has decided what to test but never decided how to test, when to read out, or who owns the call to ship, scale, or kill.
The symptoms are predictable:
- A 200-row Notion database where 80% of cards are stuck in 'Idea' or 'Prioritized' status.
- Tests that launch but never close (no readout, no decision, no learning logged).
- Experiments scored once during planning and never re-scored as new data lands.
- Recurring debates about 'should we kill this' three weeks past the original end date.
- A 'growth meeting' that is actually a status update, not a decision forum.
Booking.com runs roughly 1,000 concurrent experiments and ships test variants across 75 countries in under an hour. They do not get there with a better backlog. They get there because every experiment passes through the same five-component loop, every time, with no exceptions for HiPPOs (highest-paid person's opinions).
What are the 7 steps of a growth experimentation cycle?
A complete growth experimentation cycle has seven steps, each with a named owner and a definition of done. Skip a step and the loop breaks: you either launch tests you cannot read, or read tests you cannot act on.
- Idea capture. Anyone in the company can drop an idea into the inbox. Required fields: problem statement, rough hypothesis, suspected lever (acquisition, activation, retention, revenue, referral).
- Prioritization scoring. The growth lead applies ICE, RICE, or PIE and ranks the queue weekly.
- Experiment design. The DRI writes the full experiment doc: hypothesis, variants, audience, primary metric, guardrails, sample size estimate, and kill criteria. See our guide to designing a growth experiment.
- Instrumentation review. Engineering or analytics confirms the events, segments, and dashboards exist before code ships. No instrumentation, no launch.
- Launch. Test goes live. Owner posts the link to the experiment doc and the dashboard in the team channel.
- Monitoring + decision. Daily check on guardrails for the first 72 hours. After that, hands-off until the predetermined end date or a kill trigger fires.
- Readout + decision. DRI presents the result in the weekly review. Decision: ship, iterate, kill, or extend (with a stated reason). Learning is logged in the team's permanent test library.
This is the same loop HubSpot's growth team uses, where every growth squad maintains an Airtable visible to the rest of the company.
Who owns each step?
Every step has exactly one DRI. Capture is open, but prioritization is owned by the growth lead. Design and instrumentation review are co-owned by the experiment DRI and the engineer or analyst on the squad. Launch and decision are owned by the DRI. The weekly review is run by the growth lead. If two people own a step, no one does.
How do you write a good growth experiment hypothesis?
A good growth experiment hypothesis is a falsifiable, data-backed prediction in this exact form: Because [observation], we believe [change] for [audience] will cause [metric] to move [direction + size] within [timeframe]. If your hypothesis cannot be diagrammed onto that template, it is not a hypothesis -- it is a wish.
The template has five non-negotiable parts:
- Observation. The data you are reacting to. Quantitative (event log, funnel drop) or qualitative (5+ user interviews, support tickets).
- Change. A specific, single intervention. Not 'redesign onboarding.' Specifically: 'replace step-2 form with progressive profiling.'
- Audience. Who sees the change. New signups in last 7 days, mobile only, US English locale.
- Metric and effect size. Primary metric and the minimum detectable effect you would consider a win. 'Activation rate from 38% to 43%.'
- Timeframe. Pre-committed end date or sample size.
Bad hypothesis (real example, anonymized)
'We should test a new onboarding flow because users seem confused.'
No data, no specific change, no audience, no metric, no end date. Cannot be falsified. Cannot be prioritized.
Good hypothesis
'Because session recordings show 62% of new signups abandon at the workspace-creation step (n=412 over 14 days), we believe replacing the empty workspace with a 3-template gallery for all new B2B signups will lift Day-1 activation from 38% to 43% (MDE 5pp at 80% power) within a 21-day test window. Guardrail: trial-to-paid conversion not down >2pp.'
The second one is testable, scoreable, and shippable. According to the Princeton GEO study referenced across the AEO literature, specificity is the single biggest predictor of whether an experiment finishes with a clear decision.
How do you decide which growth experiment to run first?
Score every queued experiment with one prioritization framework, applied identically across the team, and rerun the score weekly. The three frameworks worth knowing are ICE, RICE, and PIE. Pick one based on what data you actually have, not which acronym sounds best.
| Framework | Factors | Best for | Origin |
|---|---|---|---|
| ICE | Impact, Confidence, Ease (1-10 each) | Small teams, rapid prioritization, early-stage products | Sean Ellis at GrowthHackers |
| RICE | Reach, Impact, Confidence, Effort | Teams with quantitative reach data, mature funnels | Sean McBride at Intercom |
| PIE | Potential, Importance, Ease | CRO-heavy teams testing existing pages | Chris Goward at WiderFunnel |
Use ICE if you are a small team running fewer than 30 experiments a quarter. It is fast and forgiving. Use RICE when ideas vary widely in how many users they touch -- a homepage test reaches 10x more people than a billing-page test, and ICE will not capture that. Use PIE if 80% of what you ship is page-level CRO.
Whatever you choose, two rules apply:
- Score in the meeting, not before. Forces calibration across the team.
- Re-score the top 10 every week. New data changes Confidence. New roadmap changes Effort.
Do not stack-rank with three competing frameworks. According to Optimizely's analysis of 127,000 experiments, only ~12% of tests produce a clear win, so prioritization is less about picking the winner and more about maximizing learning per week.
What is the right experiment cadence for a small team?
For a 3-5 person growth team, the right cadence is 2-4 launched experiments per week with a 30-minute weekly review. That is the band where you ship enough to learn but not so much that instrumentation and analysis quality collapse.
The scaling curve looks like this:
- Solo founder / 1-person growth: 1 experiment / week. Anything more and you cannot read results properly.
- 3-5 person team: 2-4 / week. This is the sweet spot for most Series A-B SaaS.
- 10+ personteam: 5-10 / week. Requires dedicated experimentation tooling and a full-time analyst.
- Booking.com / Airbnb scale: 100-700+ / week, with Airbnb growing from 100 to 700 experiments per week over two years.
Velocity is the right north-star metric for an experimentation program, not win rate. According to Reforge's analysis, tripling velocity matters more than improving hit rate, because the win rate floor is bounded (~12-33%) but velocity is uncapped.
Do not chase Booking.com's number from a standing start. Their 1,000 concurrent tests run on a decade of platform investment. Start with 2 launched per week, hold that for 6 weeks, then evaluate where the bottleneck is (idea quality, instrumentation, or readout discipline) before increasing.
What does a growth experiment readout look like?
A growth experiment readout is a 1-page document and a 5-minute verbal presentation in the weekly review that ends with a decision: ship, iterate, kill, or extend. It is not a report. Reports describe. Readouts decide.
Every readout has these eight sections, in order:
- Hypothesis (paste the original, do not rewrite).
- Variant + audience (what shipped to whom).
- Sample size + duration (n per arm, days live).
- Primary metric result (with confidence interval, p-value, or Bayesian probability).
- Guardrail metrics (did anything regress?).
- Segment cuts (top 2-3 only -- do not p-hack).
- Surprise / what we did not expect.
- Decision and rationale (ship / iterate / kill / extend, with one-sentence why).
Example readout summary line
'Template gallery lifted Day-1 activation 38% -> 41.2% (95% CI: +1.4pp to +5.0pp, p=0.03, n=2,840). Trial-to-paid flat. Decision: ship to 100%, iterate on template count next quarter.'
The readout lives in the same experiment doc the team has been writing in since kickoff -- not a new deck. Continuity is the point. According to Optimizely's experimentation reporting guidance, teams that maintain a single continuous doc per experiment are 2-3x more likely to revisit and reuse historical learnings.
What about inconclusive results?
Inconclusive is a valid outcome and should be called as 'kill' by default unless the team has a specific reason to extend. An inconclusive result after a properly powered test is information: the effect is too small to matter at this sample size. Logging 'kill -- effect too small to matter at our scale' is more honest, and more useful, than running it for another two weeks hoping the p-value drops.
How do you run a weekly growth experiment review meeting?
The weekly experiment review is a 30-minute, same-time, fixed-agenda meeting that is the operating heartbeat of the framework. Skip it for two weeks and the loop breaks: launches drift, readouts pile up, decisions stall, the backlog grows.
The agenda has four blocks, timeboxed:
| Block | Time | What happens |
|---|---|---|
| 1. Readouts | 12 min | Each completed experiment gets 3-5 min: result + decision. No re-litigating. |
| 2. In-flight check | 5 min | Anything red on guardrails? Anything ready to read out next week? |
| 3. Launch queue | 8 min | This week's launches: instrumentation confirmed, doc complete, kill criteria written. |
| 4. Backlog re-score | 5 min | Top 10 in the queue -- any score changes since last week? Surface new ideas. |
Rules of the room:
- DRI presents their own readout. No proxies.
- Decisions are made in the meeting. If something cannot be decided, the blocker is named and a decision date is set.
- Outcomes go into the test library before the meeting ends. No 'I'll write it up later.'
- Stakeholders observe; they do not relitigate scoring.
HubSpot's growth team's airtable-based system and Reforge's Pipefy-based system both share this same shape: short, structured, decisive, weekly.
What are good kill criteria for a growth experiment?
Good kill criteria are written before launch, are objective, and trigger automatically -- so the decision to stop is not made under emotional or political pressure when the experiment is already live. Every experiment doc must include three kill triggers.
The three triggers, in priority order:
- Guardrail breach. A predefined regression in a protected metric. Example: trial-to-paid conversion drops more than 2pp, or page errors increase >0.5%. Kill immediately, do not wait for the end date.
- Hard end date. A pre-committed last day. Typically 14-28 days. If the test has not reached significance by then, the effect is too small to matter. Kill.
- Sample-size floor. Minimum n per arm to power the test. If the experiment cannot reach this volume in the time window, kill before launch -- not after.
Why pre-registering kills matters
According to Optimizely's research on failed experiments, 88% of implemented ideas do not produce a positive significant change. Without kill criteria, those failed tests stay live for weeks while teams 'wait for more data' -- which is almost always confirmation bias dressed up as patience.
The kill list framework used by operators applies a three-test rule: strategy fit, evidence, and cost-to-ship. If a project fails two of three, it is killed. Apply the same logic to experiments: if it fails the guardrail or the end date or the sample floor, kill it. No reprieves.
The payoff is throughput. Teams that kill ruthlessly run more experiments, learn faster, and burn fewer engineering cycles maintaining variants nobody is reading.
What are the most common growth experimentation framework failure modes?
Five failure modes account for nearly every framework that quietly dies after the first quarter. Watch for these specifically; they are easier to prevent than to fix.
- The vanity hit-rate. Teams optimize for 'experiments that win' instead of 'learnings per week.' Win rates floor around 12-33% across major programs, so optimizing for them caps your ceiling. Optimize velocity and decision quality instead.
- No instrumentation gate. Tests launch before the events exist. Two weeks later: 'we cannot actually measure this.' Add a hard gate: no events in production, no launch.
- The HiPPO override. An exec kills or ships an experiment based on opinion mid-flight. Solve with pre-registered kill criteria and a no-relitigation rule in the weekly review.
- Backlog hoarding. Ideas accumulate, never get scored, never get killed. Apply a 90-day rule: any idea in the backlog longer than 90 days without scoring gets archived.
- No central learning library. Every readout vanishes into Slack. Six months later, the team retests the same idea. Maintain a single, searchable library of every experiment doc, ever.
The pattern across all five: lack of forced function. A framework only works when the cadence and the constraints are non-negotiable. The moment you make exceptions, you are back to a backlog with extra steps.
How does this framework compare to GoodUI, Reforge, and Optimizely's approaches?
The framework above is a synthesis of three well-documented public systems. Each one weights the loop differently; pick the emphasis that matches your team.
| Source | Core insight | What they emphasize | Best fit for |
|---|---|---|---|
| GoodUI Evidence | Pattern-based prediction beats fresh ideation | Pre-test prediction; 71% prediction accuracy across 610 tests, 141 patterns | CRO-heavy teams, e-commerce, landing pages |
| Reforge GEMS | Velocity > win rate; cross-functional idea sourcing | The operating system: queue, schedule, weekly review | Mid-stage SaaS growth teams |
| Optimizely | Statistical rigor and guardrails | Win-rate analysis (12% overall, 10% revenue), trustworthy reads | Mature programs, large enterprise |
| Sean Ellis high-tempo testing | Tempo as the primary lever | ICE scoring + weekly meeting cadence | Early-stage, small teams |
If you are starting from zero, copy Sean Ellis's high-tempo testing structure verbatim for the first 90 days: ICE scoring, weekly meeting, hypothesis template, kill criteria. After 90 days, layer in Reforge's instrumentation discipline. After 180 days, add Optimizely's statistical rigor and GoodUI's pattern library. Trying to install all four in month one is how frameworks die.
Get the growth experimentation framework Notion template
We packaged the full operating system -- hypothesis template, ICE/RICE scoring rubric, experiment doc template, weekly review agenda, readout format, and pre-registered kill criteria -- into a single Notion workspace you can duplicate in 30 seconds.
What is in the template:
- Idea inbox with required fields and a Slack capture form.
- Scoring queue with ICE, RICE, and PIE columns (use whichever fits).
- Experiment doc template that stays with the test from kickoff to readout.
- Pre-built weekly review agenda with timeboxes.
- Readout 1-pager with the 8-section structure.
- Test library (the searchable archive of every experiment, win or loss).
- Kill criteria checklist required before any launch.
Duplicate the template, customize the scoring rubric for your team, and run your first weekly review within seven days. For deeper context on each component, see our companion guides on ICE vs RICE vs PIE, how to design a growth experiment, and the best growth experiment tools for 2026.
| Framework | Factors | Best Use Case | Origin |
|---|---|---|---|
| ICE | Impact, Confidence, Ease (1-10 each) | Small teams, rapid prioritization, <30 experiments/quarter | Sean Ellis, GrowthHackers |
| RICE | Reach, Impact, Confidence, Effort | Mature funnels with quantitative reach data | Sean McBride, Intercom |
| PIE | Potential, Importance, Ease | CRO-heavy programs testing existing pages | Chris Goward, WiderFunnel |
| Weighted scoring | Custom factors with weights | Cross-functional teams with competing objectives | Internal product orgs |