If you run growth experiments, you have hit these 17 questions. They show up in r/datascience threads, in growth Slacks at 11pm, and on every PM's whiteboard the day before a launch decision. This guide answers each one with declarative, source-backed answers, the kind of answers you can paste into a doc and ship a decision from. No hedging, no "it depends" without specifics. Inline citations point to Evan Miller, Microsoft's Experimentation Platform, Spotify Engineering, and the original CUPED paper.
How long should an A/B test run?
Two weeks minimum, six to eight weeks maximum. Sample size is set by your baseline conversion rate and your minimum detectable effect (MDE). Calendar duration is set by traffic volume plus business cycles. Run for at least one full weekly cycle even if you hit sample size sooner. Three sub-questions cover the edges of this rule.
What's the minimum runtime for an A/B test?
Two weeks, even if your sample size calculation completes sooner. Evan Miller's sample size calculator tells you how many participants you need, not how many days. Two weeks captures one full weekday/weekend cycle and avoids day-of-week bias. The 6-8 week cap exists because longer tests suffer from cookie deletion, browser updates, and seasonal contamination. According to HubSpot (2026), 2-6 weeks is the practical sweet spot for most B2C and SaaS surfaces.
What if my MDE is too small?
A small MDE blows up your sample size and your timeline. According to Convert.com (2026), detecting a 1% relative lift on a 5% baseline conversion rate can require ~5 million total visitors and ~10 years; detecting a 10% lift takes ~50,000 visitors and 5 weeks. Most experimentation experts cap relative MDE at 2-5%. If your traffic envelope cannot power a 5% MDE in 6 weeks, you are not running that test. Either pick a bolder treatment or change methods (CUPED, switchback, segment-level rollout).
How do you run experiments with low traffic?
Three options. First, raise your MDE and test bolder changes (full redesigns, not button colors). According to VWO (2026), bold changes are the only way to get statistical significance in low-traffic environments. Second, pool traffic across URL patterns or page templates. Third, use variance reduction (CUPED) or switch to switchback experiments where the entire audience alternates between treatment and control over time windows. DoorDash, Uber, and Airbnb use switchback for marketplace tests.
Can you peek at A/B test results?
Not with classical fixed-horizon frequentist tests. Peeking inflates your false positive rate every time you check. The fix is to use a method designed for peeking (sequential or Bayesian) or commit to a pre-registered sample size and check only at the end. Three sub-questions clarify when peeking is safe and when it ruins the test.
Why does peeking break A/B tests?
Each look is another roll of the dice. Standard t-tests assume one analysis at a fixed sample size. According to Johari et al., SIGKDD 2017, repeated significance testing inflates Type I error: with 5 looks at α=0.05, the actual false positive rate climbs above 14%; with 20+ looks, Evan Miller (2026) shows it can reach 30-40%. The math does not care about your intent. If you check, the rate inflates.
When is sequential testing safe?
When your platform supports it. Sequential methods (mSPRT, group sequential designs, always-valid p-values) are designed for peeking. According to Spotify Engineering (2023), sequential testing adjusts significance thresholds continuously, so daily dashboard checks do not inflate Type I error. GrowthBook, Statsig, and Eppo all ship sequential testing as a built-in option. Use it when stakeholders demand interim looks or when early stopping has real business value.
Should I use Bayesian or frequentist methods?
Both are statistically valid. Pick the one your stakeholders can interpret. According to CXL (2026), frequentist methods suit small MDEs (1-2%) where strict statistical control matters. Bayesian methods give probability statements ("92% chance B beats A") and tolerate peeking by design, suiting MDEs of 5%+ and business-facing reporting. Most modern platforms (Statsig, GrowthBook, Eppo, VWO) support both. The wrong move is mixing methods mid-test or switching after seeing results.
How do you validate experiment setup?
Validate plumbing before you trust outcomes. Sample ratio mismatch, broken bucketing, and metric drift turn winning experiments into expensive lies. According to Microsoft Research, about 6% of experiments at Microsoft and Booking experience SRM. Three setup-validation questions follow.
Should you run A/A tests?
Yes, when standing up new infrastructure or new metrics. According to AB Tasty (2026), A/A tests catch SRM, broken event tracking, and false-positive-prone metrics before they corrupt real experiments. Run them after platform migrations, new SDK integrations, or new metric definitions. Skip them on mature setups with proven plumbing. Critical caveat: do not peek at the A/A test. Premature significance is the noise the A/A is designed to expose, not a setup failure.
What is sample ratio mismatch (SRM)?
SRM is when your observed traffic split (e.g., 52/48) significantly diverges from your assigned split (e.g., 50/50). According to Microsoft Research, about 6% of experiments at Microsoft and Booking show SRM, and the affected results are not trustworthy. Detect it with a chi-squared test on assignment counts. Common causes: bot filtering applied unevenly, redirect-based variants, broken IDs, biased segmentation. Lukas Vermeer's SRM Checker is the standard validator. SRM-flagged experiments must be discarded, not interpreted.
What if my variants leak across groups?
Variant leakage (interference, spillover) violates SUTVA: one user's treatment affects another user's outcome. According to LinkedIn Engineering (2019), social-graph features routinely leak. Fixes: cluster randomization (assign whole networks, geos, or households together), switchback testing for marketplaces, ego-network designs for social products. Netflix often rolls features by region. To detect leakage, run two parallel tests, one user-level and one cluster-level, and compare effect sizes. Diverging estimates means leakage is biasing the user-level result.
How do you increase A/B test power without more traffic?
Variance reduction. Techniques like CUPED compress confidence intervals so you can detect smaller effects with the same sample size. According to Microsoft's Experimentation Platform team, CUPED reduced experiment runtimes from 8 weeks to 5-6 weeks on Bing and Office. Three power-boosting questions follow.
What is CUPED in experimentation?
CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance-reduction technique published by Alex Deng and team at Microsoft (2013). It uses pre-experiment user behavior as a covariate to strip predictable variance from experiment metrics. The result: 10-50% variance reduction depending on covariate quality, which translates directly to faster experiments or smaller detectable MDEs. Most modern platforms (Statsig, GrowthBook, Eppo, Optimizely, Amplitude) ship CUPED as a one-toggle feature.
How do you analyze a CUPED test?
Five steps. (1) Collect 1-2 weeks of pre-experiment data per user before the test starts. (2) Pick a covariate X highly correlated with target metric Y but unaffected by treatment (usually the same metric in the pre-period). (3) Calculate theta: θ = Cov(Y,X) / Var(X). (4) Compute the adjusted metric: Y_cuped = Y - θ(X - mean(X)). (5) Run a standard t-test on Y_cuped. According to Statsig's CUPED docs, variance reduction equals roughly the squared correlation between X and Y. New users without pre-period data fall back to the unadjusted metric.
When should you use switchback experiments?
When user-level randomization is invalid because of network effects, marketplaces, or shared resources. According to Bojinov & Simchi-Levi, HBS (2021), switchback alternates 100% of traffic between treatment and control over time windows. DoorDash, Uber, Lyft, and Airbnb use switchback for pricing, dispatch, and matching algorithms. Choose a switching interval long enough for the effect to manifest but short enough to generate many control/treatment periods, typically 1-6 hours for marketplaces.
How do you handle multiple metrics and multiple tests?
Correct for multiplicity, set guardrails, and pre-register your primary metric. According to Spotify's Confidence team (2024), running 20 metrics at α=0.05 yields a 64% chance of at least one false positive. Three sub-questions cover correction methods, guardrails, and concurrent test management.
How do you handle the multiple testing problem?
Apply a multiple-comparisons correction, and pre-register a primary metric. Bonferroni divides α by the number of tests (0.05 / 20 = 0.0025). It is conservative but easy to explain. Holm-Bonferroni and Šidák are uniformly more powerful. For dozens or hundreds of metrics, the Benjamini-Hochberg FDR procedure controls false discovery rate instead of family-wise error rate, sacrificing strictness for power. The cheapest fix: pick one primary metric before the test starts and treat everything else as secondary.
What are guardrail metrics?
Guardrails are metrics you do not want to harm, even if your primary metric improves. According to Spotify Engineering (2024), Spotify uses music-minutes-played as a guardrail when testing podcast features, ensuring podcast wins do not cannibalize music. Common guardrails: page load time, error rate, retention, support ticket volume, revenue per user. Set a non-inferiority threshold (e.g., "no more than 1% degradation") and check guardrails before declaring a winner. Airbnb and Netflix make guardrail checks non-negotiable shipping criteria.
Can you run multiple A/B tests simultaneously?
Yes, and most mature programs do. According to Microsoft Research, serious interaction effects between concurrent experiments are rare in practice. Standard approach: randomize independently across tests so users land in random combinations, treating each test in isolation. Audit for interactions only when tests touch the same surface area or pixel. For deeply coupled changes, use mutually exclusive cohorts (isolated traffic) or run them as a multivariate test (MVT) so interactions are explicitly modeled.
How do you decide which experiments win?
Statistical significance is necessary but not sufficient. Effect size, business impact, novelty decay, and the winner's curse all matter. Spotify reports a learning rate of ~64%, far higher than its win rate, because the highest-value outcomes are often what does not work. Two final questions on decision-making.
What if a result is statistically significant but practically meaningless?
Do not ship it just because the p-value passed. According to Statsig (2026), statistical significance only means the effect is non-zero, not that it is worth implementing. A 1.5% conversion lift can be statistically significant but lose against an 18-month engineering payback period. Set a minimum effect of interest (MEI) before running the test, and discount for the winner's curse: chosen winners overstate their true effect by ~10-30% because you selected on the noisy upper tail.
How do you account for novelty effect?
Run longer, segment by user tenure, and track how the effect moves over time. According to the arXiv paper "Novelty and Primacy" (Hohnhold et al., 2021), novelty effects often dissipate within 2-4 weeks. Compare new-user vs. returning-user cohorts: if returning users show a fading lift while new users hold steady, novelty is your culprit. For redesigns, also check primacy effect (existing users stumble on changes), which biases against the variant. Both effects argue for tests of at least three weeks.
| Question | Short answer | Primary source |
|---|---|---|
| How long should A/B tests run? | 2-8 weeks, sample size driven by MDE | Evan Miller |
| Can you peek? | Only with sequential or Bayesian methods | Johari et al. SIGKDD 2017 |
| MDE too small? | Cap at 2-5% relative or change methods | Convert.com 2026 |
| Low traffic fix? | Bolder MDE, CUPED, switchback | VWO / Statsig |
| Variant leakage? | Cluster randomize or switchback | LinkedIn Engineering |
| A/A tests? | Yes, on new infra and metrics | AB Tasty 2026 |
| What is SRM? | Traffic split deviation; ~6% of tests have it | Microsoft Research |
| What is CUPED? | Pre-experiment data variance reduction; 10-50% gain | Deng et al. Microsoft 2013 |
| Switchback when? | Marketplaces, network effects, shared resources | Bojinov & Simchi-Levi HBS |
| Bayesian or frequentist? | Both valid; frequentist for small MDE, Bayesian for stakeholder reporting | CXL 2026 |
| Multiple testing? | Bonferroni / Holm / BH-FDR; pre-register primary | Spotify Confidence |
| Guardrails? | Metrics you must not harm; non-inferiority threshold | Spotify Engineering |
| Concurrent tests? | Yes, interactions are rare in practice | Microsoft Research |
| Stat sig but tiny? | Do not ship; set MEI before test | Statsig 2026 |
| Novelty effect? | Run 3+ weeks; segment new vs returning | Hohnhold et al. 2021 |
| Winner's curse? | Discount selected winners by 10-30% | Statsig 2026 |