faq-heavy 12 min read May 04, 2026

Growth Experimentation: 17 FAQs From Real Growth Engineers

Q: Can you peek at A/B test results?

Not with fixed-horizon frequentist tests. Per Evan Miller, peeking 20 times can push your false positive rate from 5% above 30%. Use sequential testing or Bayesian methods (Statsig, GrowthBook, Eppo) if peeking is operationally required.

Q: What if my variants leak across groups?

Variant leakage (spillover) violates SUTVA. Per LinkedIn Engineering, fixes include cluster randomization (whole networks, geos, households), switchback for marketplaces, and ego-network designs for social products. Detect by running parallel user-level and cluster-level tests.

Q: What is sample ratio mismatch (SRM)?

SRM is when observed traffic split significantly diverges from the assigned split. Per Microsoft Research, about 6% of experiments have it. Detect with a chi-squared test. SRM-flagged experiments should be discarded, not interpreted.

By Peter Foy

17 real questions from r/datascience and growth Slacks: test duration, peeking, CUPED, SRM, low traffic, A/A tests, guardrails. Direct answers, cited sources.

TL;DR

Most growth experimentation questions cluster around five things: how long tests should run, whether peeking is safe, how to fix low traffic, how to validate setup, and how to interpret mixed results. The short answers: 2-6 weeks, only with sequential or Bayesian methods, use CUPED or switchback, run A/A tests and SRM checks, and pick a primary metric before you start. The 17 FAQs below give the full answers with citations.

Run tests at least 2 weeks, cap at 6-8 weeks. Sample size depends on baseline rate and MDE.
Peeking inflates false positive rates from 5% to 30%+. Use sequential or Bayesian methods if you must peek.
CUPED reduces variance 10-50%, cutting Microsoft's experiment runtimes from 8 weeks to 5-6.
About 6% of experiments have sample ratio mismatch (Microsoft + Booking). Always check.
Statistical significance does not equal business impact. Set a minimum effect of interest before testing.

If you run growth experiments, you have hit these 17 questions. They show up in r/datascience threads, in growth Slacks at 11pm, and on every PM's whiteboard the day before a launch decision. This guide answers each one with declarative, source-backed answers, the kind of answers you can paste into a doc and ship a decision from. No hedging, no "it depends" without specifics. Inline citations point to Evan Miller, Microsoft's Experimentation Platform, Spotify Engineering, and the original CUPED paper.

How long should an A/B test run?

Two weeks minimum, six to eight weeks maximum. Sample size is set by your baseline conversion rate and your minimum detectable effect (MDE). Calendar duration is set by traffic volume plus business cycles. Run for at least one full weekly cycle even if you hit sample size sooner. Three sub-questions cover the edges of this rule.

What's the minimum runtime for an A/B test?

Two weeks, even if your sample size calculation completes sooner. Evan Miller's sample size calculator tells you how many participants you need, not how many days. Two weeks captures one full weekday/weekend cycle and avoids day-of-week bias. The 6-8 week cap exists because longer tests suffer from cookie deletion, browser updates, and seasonal contamination. According to HubSpot (2026), 2-6 weeks is the practical sweet spot for most B2C and SaaS surfaces.

What if my MDE is too small?

A small MDE blows up your sample size and your timeline. According to Convert.com (2026), detecting a 1% relative lift on a 5% baseline conversion rate can require ~5 million total visitors and ~10 years; detecting a 10% lift takes ~50,000 visitors and 5 weeks. Most experimentation experts cap relative MDE at 2-5%. If your traffic envelope cannot power a 5% MDE in 6 weeks, you are not running that test. Either pick a bolder treatment or change methods (CUPED, switchback, segment-level rollout).

How do you run experiments with low traffic?

Three options. First, raise your MDE and test bolder changes (full redesigns, not button colors). According to VWO (2026), bold changes are the only way to get statistical significance in low-traffic environments. Second, pool traffic across URL patterns or page templates. Third, use variance reduction (CUPED) or switch to switchback experiments where the entire audience alternates between treatment and control over time windows. DoorDash, Uber, and Airbnb use switchback for marketplace tests.

Sample Size Required by Minimum Detectable Effect (5% baseline conversion)

1% relative MDE

5000000 visitors

2% relative MDE

1500000 visitors

5% relative MDE

250000 visitors

10% relative MDE

50000 visitors

Source: Convert.com, Understanding Minimum Detectable Effect in A/B Testing (2026)

Can you peek at A/B test results?

Not with classical fixed-horizon frequentist tests. Peeking inflates your false positive rate every time you check. The fix is to use a method designed for peeking (sequential or Bayesian) or commit to a pre-registered sample size and check only at the end. Three sub-questions clarify when peeking is safe and when it ruins the test.

Why does peeking break A/B tests?

Each look is another roll of the dice. Standard t-tests assume one analysis at a fixed sample size. According to Johari et al., SIGKDD 2017, repeated significance testing inflates Type I error: with 5 looks at α=0.05, the actual false positive rate climbs above 14%; with 20+ looks, Evan Miller (2026) shows it can reach 30-40%. The math does not care about your intent. If you check, the rate inflates.

When is sequential testing safe?

When your platform supports it. Sequential methods (mSPRT, group sequential designs, always-valid p-values) are designed for peeking. According to Spotify Engineering (2023), sequential testing adjusts significance thresholds continuously, so daily dashboard checks do not inflate Type I error. GrowthBook, Statsig, and Eppo all ship sequential testing as a built-in option. Use it when stakeholders demand interim looks or when early stopping has real business value.

Should I use Bayesian or frequentist methods?

Both are statistically valid. Pick the one your stakeholders can interpret. According to CXL (2026), frequentist methods suit small MDEs (1-2%) where strict statistical control matters. Bayesian methods give probability statements ("92% chance B beats A") and tolerate peeking by design, suiting MDEs of 5%+ and business-facing reporting. Most modern platforms (Statsig, GrowthBook, Eppo, VWO) support both. The wrong move is mixing methods mid-test or switching after seeing results.

False Positive Rate Inflation From Peeking at A/B Tests (alpha = 0.05)

1 look

2 looks

5 looks

14%

10 looks

19%

20 looks

28%

Source: Johari et al., Peeking at A/B Tests, SIGKDD 2017

How do you validate experiment setup?

Validate plumbing before you trust outcomes. Sample ratio mismatch, broken bucketing, and metric drift turn winning experiments into expensive lies. According to Microsoft Research, about 6% of experiments at Microsoft and Booking experience SRM. Three setup-validation questions follow.

Should you run A/A tests?

Yes, when standing up new infrastructure or new metrics. According to AB Tasty (2026), A/A tests catch SRM, broken event tracking, and false-positive-prone metrics before they corrupt real experiments. Run them after platform migrations, new SDK integrations, or new metric definitions. Skip them on mature setups with proven plumbing. Critical caveat: do not peek at the A/A test. Premature significance is the noise the A/A is designed to expose, not a setup failure.

What is sample ratio mismatch (SRM)?

SRM is when your observed traffic split (e.g., 52/48) significantly diverges from your assigned split (e.g., 50/50). According to Microsoft Research, about 6% of experiments at Microsoft and Booking show SRM, and the affected results are not trustworthy. Detect it with a chi-squared test on assignment counts. Common causes: bot filtering applied unevenly, redirect-based variants, broken IDs, biased segmentation. Lukas Vermeer's SRM Checker is the standard validator. SRM-flagged experiments must be discarded, not interpreted.

What if my variants leak across groups?

Variant leakage (interference, spillover) violates SUTVA: one user's treatment affects another user's outcome. According to LinkedIn Engineering (2019), social-graph features routinely leak. Fixes: cluster randomization (assign whole networks, geos, or households together), switchback testing for marketplaces, ego-network designs for social products. Netflix often rolls features by region. To detect leakage, run two parallel tests, one user-level and one cluster-level, and compare effect sizes. Diverging estimates means leakage is biasing the user-level result.

How do you increase A/B test power without more traffic?

Variance reduction. Techniques like CUPED compress confidence intervals so you can detect smaller effects with the same sample size. According to Microsoft's Experimentation Platform team, CUPED reduced experiment runtimes from 8 weeks to 5-6 weeks on Bing and Office. Three power-boosting questions follow.

What is CUPED in experimentation?

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance-reduction technique published by Alex Deng and team at Microsoft (2013). It uses pre-experiment user behavior as a covariate to strip predictable variance from experiment metrics. The result: 10-50% variance reduction depending on covariate quality, which translates directly to faster experiments or smaller detectable MDEs. Most modern platforms (Statsig, GrowthBook, Eppo, Optimizely, Amplitude) ship CUPED as a one-toggle feature.

How do you analyze a CUPED test?

Five steps. (1) Collect 1-2 weeks of pre-experiment data per user before the test starts. (2) Pick a covariate X highly correlated with target metric Y but unaffected by treatment (usually the same metric in the pre-period). (3) Calculate theta: θ = Cov(Y,X) / Var(X). (4) Compute the adjusted metric: Y_cuped = Y - θ(X - mean(X)). (5) Run a standard t-test on Y_cuped. According to Statsig's CUPED docs, variance reduction equals roughly the squared correlation between X and Y. New users without pre-period data fall back to the unadjusted metric.

When should you use switchback experiments?

When user-level randomization is invalid because of network effects, marketplaces, or shared resources. According to Bojinov & Simchi-Levi, HBS (2021), switchback alternates 100% of traffic between treatment and control over time windows. DoorDash, Uber, Lyft, and Airbnb use switchback for pricing, dispatch, and matching algorithms. Choose a switching interval long enough for the effect to manifest but short enough to generate many control/treatment periods, typically 1-6 hours for marketplaces.

How do you handle multiple metrics and multiple tests?

Correct for multiplicity, set guardrails, and pre-register your primary metric. According to Spotify's Confidence team (2024), running 20 metrics at α=0.05 yields a 64% chance of at least one false positive. Three sub-questions cover correction methods, guardrails, and concurrent test management.

How do you handle the multiple testing problem?

Apply a multiple-comparisons correction, and pre-register a primary metric. Bonferroni divides α by the number of tests (0.05 / 20 = 0.0025). It is conservative but easy to explain. Holm-Bonferroni and Šidák are uniformly more powerful. For dozens or hundreds of metrics, the Benjamini-Hochberg FDR procedure controls false discovery rate instead of family-wise error rate, sacrificing strictness for power. The cheapest fix: pick one primary metric before the test starts and treat everything else as secondary.

What are guardrail metrics?

Guardrails are metrics you do not want to harm, even if your primary metric improves. According to Spotify Engineering (2024), Spotify uses music-minutes-played as a guardrail when testing podcast features, ensuring podcast wins do not cannibalize music. Common guardrails: page load time, error rate, retention, support ticket volume, revenue per user. Set a non-inferiority threshold (e.g., "no more than 1% degradation") and check guardrails before declaring a winner. Airbnb and Netflix make guardrail checks non-negotiable shipping criteria.

Can you run multiple A/B tests simultaneously?

Yes, and most mature programs do. According to Microsoft Research, serious interaction effects between concurrent experiments are rare in practice. Standard approach: randomize independently across tests so users land in random combinations, treating each test in isolation. Audit for interactions only when tests touch the same surface area or pixel. For deeply coupled changes, use mutually exclusive cohorts (isolated traffic) or run them as a multivariate test (MVT) so interactions are explicitly modeled.

How do you decide which experiments win?

Statistical significance is necessary but not sufficient. Effect size, business impact, novelty decay, and the winner's curse all matter. Spotify reports a learning rate of ~64%, far higher than its win rate, because the highest-value outcomes are often what does not work. Two final questions on decision-making.

What if a result is statistically significant but practically meaningless?

Do not ship it just because the p-value passed. According to Statsig (2026), statistical significance only means the effect is non-zero, not that it is worth implementing. A 1.5% conversion lift can be statistically significant but lose against an 18-month engineering payback period. Set a minimum effect of interest (MEI) before running the test, and discount for the winner's curse: chosen winners overstate their true effect by ~10-30% because you selected on the noisy upper tail.

How do you account for novelty effect?

Run longer, segment by user tenure, and track how the effect moves over time. According to the arXiv paper "Novelty and Primacy" (Hohnhold et al., 2021), novelty effects often dissipate within 2-4 weeks. Compare new-user vs. returning-user cohorts: if returning users show a fading lift while new users hold steady, novelty is your culprit. For redesigns, also check primacy effect (existing users stumble on changes), which biases against the variant. Both effects argue for tests of at least three weeks.

Question	Short answer	Primary source
How long should A/B tests run?	2-8 weeks, sample size driven by MDE	Evan Miller
Can you peek?	Only with sequential or Bayesian methods	Johari et al. SIGKDD 2017
MDE too small?	Cap at 2-5% relative or change methods	Convert.com 2026
Low traffic fix?	Bolder MDE, CUPED, switchback	VWO / Statsig
Variant leakage?	Cluster randomize or switchback	LinkedIn Engineering
A/A tests?	Yes, on new infra and metrics	AB Tasty 2026
What is SRM?	Traffic split deviation; ~6% of tests have it	Microsoft Research
What is CUPED?	Pre-experiment data variance reduction; 10-50% gain	Deng et al. Microsoft 2013
Switchback when?	Marketplaces, network effects, shared resources	Bojinov & Simchi-Levi HBS
Bayesian or frequentist?	Both valid; frequentist for small MDE, Bayesian for stakeholder reporting	CXL 2026
Multiple testing?	Bonferroni / Holm / BH-FDR; pre-register primary	Spotify Confidence
Guardrails?	Metrics you must not harm; non-inferiority threshold	Spotify Engineering
Concurrent tests?	Yes, interactions are rare in practice	Microsoft Research
Stat sig but tiny?	Do not ship; set MEI before test	Statsig 2026
Novelty effect?	Run 3+ weeks; segment new vs returning	Hohnhold et al. 2021
Winner's curse?	Discount selected winners by 10-30%	Statsig 2026

Frequently asked questions

How long should an A/B test run?

Run for at least 2 weeks (one full weekly business cycle), with 6-8 weeks as the upper cap. Sample size is set by your baseline conversion rate and MDE using a calculator like Evan Miller's. Tests beyond 8 weeks suffer from cookie loss and seasonal contamination.

Can you peek at A/B test results?

Not with fixed-horizon frequentist tests. According to Evan Miller, peeking 20 times can push your false positive rate from 5% above 30%. Use sequential testing or Bayesian methods (Statsig, GrowthBook, Eppo all support them) if peeking is operationally required.

What if my MDE is too small?

Sample size and timeline blow up. Detecting a 1% relative lift on a 5% baseline takes ~5M visitors per Convert.com (2026). Cap MDE at 2-5% relative. If your traffic cannot power a 5% MDE in 6 weeks, change the test.

How do you run experiments with low traffic?

Test bolder changes (raise MDE), pool traffic across URL patterns, use CUPED for variance reduction, or switch to switchback testing where 100% of traffic alternates between treatment and control. Limit to two variants and skip multivariate tests until traffic supports it.

What if my variants leak across groups?

Variant leakage (spillover) violates SUTVA and biases your estimate. Per LinkedIn Engineering, fixes include cluster randomization (whole networks, geos, households), switchback for marketplaces, and ego-network designs for social products. Detect by running parallel user-level and cluster-level tests.

Should I A/A test?

Yes, when standing up new infrastructure or adding new metrics. A/A tests catch sample ratio mismatch, tracking bugs, and false-positive-prone metrics. Skip them on mature setups. Do not peek mid A/A test, the noise it surfaces is the point, not a defect.

What is sample ratio mismatch (SRM)?

SRM is when your observed traffic split (e.g., 52/48) significantly diverges from the assigned split (50/50). According to Microsoft Research, about 6% of experiments have it. Detect with a chi-squared test. SRM-flagged experiments should be discarded, not interpreted.

How do you analyze a CUPED test?

Collect 1-2 weeks of pre-experiment data per user, pick a covariate X correlated with metric Y, calculate θ = Cov(Y,X)/Var(X), then compute Y_cuped = Y - θ(X - mean(X)). Run a standard t-test on the adjusted metric. Per Statsig, variance reduction roughly equals the squared correlation between X and Y.

What is CUPED in experimentation?

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance-reduction method published by Microsoft in 2013. It uses pre-experiment user behavior as a covariate to remove predictable variance, cutting required runtime by 10-50%. Microsoft used it to reduce Bing and Office experiments from 8 weeks to 5-6.

When should you use switchback experiments?

Use switchback when network effects or marketplace dynamics make user-level randomization invalid. Per Bojinov & Simchi-Levi (HBS), switchback alternates 100% of traffic between treatment and control over time windows. DoorDash, Uber, Lyft, and Airbnb use it for pricing and matching.

Bayesian or frequentist for A/B testing?

Both are valid. Frequentist suits small MDEs (1-2%) where rigid statistical control matters. Bayesian gives probability statements ("92% chance B wins") and tolerates peeking. Per CXL, pick what your stakeholders interpret correctly. Do not switch methods mid-test.

How do you handle the multiple testing problem?

Pre-register a primary metric, apply Bonferroni (α / number of tests), Holm-Bonferroni, or Šidák for a few metrics. For dozens, use Benjamini-Hochberg FDR. Per Spotify, 20 uncorrected metrics give a 64% false-positive probability.

What are guardrail metrics?

Metrics you do not want to harm even when your primary metric improves. Per Spotify Engineering, Spotify uses music-minutes-played as a guardrail when testing podcast features. Common guardrails: latency, error rate, retention, support volume, revenue per user.

Can you run multiple A/B tests at the same time?

Yes. Per Microsoft Research, interaction effects between concurrent tests are rare. Randomize independently so users land in random combinations. Audit interactions only when tests touch the same surface, or use mutually exclusive cohorts when changes are tightly coupled.

What if my result is stat sig but not practically meaningful?

Do not ship. Statistical significance only means the effect is non-zero. Per Statsig (2026), set a minimum effect of interest (MEI) before testing, and discount for winner's curse: selected winners overstate true effect by 10-30%.

How do you account for novelty effect?

Run for at least 3 weeks, segment by user tenure, and track effect-over-time. Per the Novelty and Primacy paper (2021), novelty typically decays in 2-4 weeks. Returning-user lift fading while new-user lift holds is the diagnostic signature.

How do I handle the winner's curse when picking variants?

Selected winners overstate their true effect by 10-30% because you select on the noisy upper tail of the estimate distribution. Apply shrinkage or empirical-Bayes bias correction, or simply discount projected lift before committing to ROI calculations or roadmap decisions.

End-of-article CTA: this FAQ answers the tactical questions; the framework covers how to wire experiments into a growth program end to end.

Get the Growth Experimentation Framework