Why Most A/B Tests Are Lying to You

Thursday. A product supervisor at a Sequence B SaaS firm opens her A/B testing dashboard for the fourth time that day, a half-drunk chilly brew beside her laptop computer. The display screen reads: Variant B, +8.3% conversion raise, 96% statistical significance.

She screenshots the outcome. Posts it within the #product-wins Slack channel with a celebration emoji. The top of engineering replies with a thumbs-up and begins planning the rollout dash.

Right here’s what the dashboard didn’t present her: if she had waited three extra days (the unique deliberate take a look at length), that significance would have dropped to 74%. The +8.3% raise would have shrunk to +1.2%. Under the noise flooring. Not actual.

In the event you’ve ever stopped a take a look at early as a result of it “hit significance,” you’ve in all probability shipped a model of this error. You’re in massive firm. At Google and Bing, solely 10% to 20% of controlled experiments generate positive results, in response to Ronny Kohavi’s analysis revealed within the Harvard Enterprise Evaluate. At Microsoft broadly, one-third of experiments show efficient, one-third are impartial, and one-third actively damage the metrics they meant to enhance. Most concepts don’t work. The experiments that “show” they do are sometimes telling you what you need to hear.

In case your A/B testing instrument permits you to peek at outcomes every day and cease at any time when the boldness bar turns inexperienced, it’s not a testing instrument. It’s a random quantity generator with a nicer UI.

The 4 statistical sins under account for almost all of unreliable A/B take a look at outcomes. Every takes lower than quarter-hour to repair. By the top of this text, you’ll have a five-item pre-test guidelines and a call framework for selecting between frequentist, Bayesian, and sequential testing which you can apply to your subsequent experiment Monday morning.

The Peeking Drawback: 26% of Your Winners Aren’t Actual

Each time you verify your A/B take a look at outcomes earlier than the deliberate finish date, you’re working a brand new statistical take a look at. Not metaphorically. Actually.

Frequentist significance assessments are designed for a single take a look at a pre-determined pattern dimension. If you verify outcomes after 100 guests, then 200, then 500, then 1,000, you’re not working one take a look at. You’re working 4. Every look provides noise one other probability to masquerade as sign.

Evan Miller quantified this in his extensively cited evaluation “How Not to Run an A/B Test.” In the event you verify outcomes after each batch of recent knowledge and cease the second you see p < 0.05, the precise false optimistic price isn’t 5%.

It’s 26.1%.

One in 4 “winners” is pure noise.

The mechanics are easy. A significance take a look at controls the false optimistic price at 5% for a single evaluation level. A number of checks create a number of alternatives for random fluctuations to cross the importance threshold. As Miller places it: “In the event you peek at an ongoing experiment ten occasions, then what you suppose is 1% significance is definitely simply 5%.”

Checking outcomes repeatedly and stopping at significance inflates your false optimistic price by greater than 5x. Picture by the writer.

That is the most typical sin in A/B testing, and the costliest. Groups make product selections, allocate engineering assets, and report income influence to management based mostly on outcomes that had a one-in-four probability of being imaginary.

The repair is straightforward however unpopular: calculate your required pattern dimension earlier than you begin, and don’t take a look at the outcomes till you hit it. If that self-discipline feels painful (and for many groups, it does), sequential testing gives a center path. Extra on that within the framework under.

Examine your take a look at outcomes after each batch of tourists, and also you’ll “discover” a winner 26% of the time. Even when there isn’t one.

The Energy Vacuum: Small Samples, Inflated Results

Peeking creates false winners. The second sin makes actual winners look larger than they’re.

Statistical energy is the chance that your take a look at will detect an actual impact when one exists. The usual goal is 80%, that means a 20% probability you’ll miss an actual impact even when it’s there. To hit 80% energy, you want a selected pattern dimension, and that quantity depends upon three issues: your baseline conversion price, the smallest impact you need to detect, and your significance threshold.

Most groups skip the ability calculation. They run the take a look at “till it’s important” or “for 2 weeks,” whichever comes first. This creates a phenomenon referred to as the winner’s curse.

Right here’s the way it works. In an underpowered take a look at, the random variation in your knowledge is massive relative to the actual impact. The one approach a real-but-small impact reaches statistical significance in a small pattern is that if random noise pushes the measured impact far above its true worth. So the very act of reaching significance in an underpowered take a look at ensures that your estimated impact is inflated.

When small samples produce important outcomes, the noticed impact is often inflated properly above the true worth.
Picture by the writer.

A workforce would possibly have fun a +8% conversion raise, ship the change, after which watch the precise quantity settle at +2% over the next quarter. The take a look at wasn’t flawed precisely (there was an actual impact), however the workforce based mostly their income projections on an inflated quantity. An artifact of inadequate pattern dimension.

An underpowered take a look at that reaches significance doesn’t discover the reality. It finds an exaggeration of the reality.

The repair: run an influence evaluation earlier than each take a look at. Set your Minimal Detectable Impact (MDE) on the smallest change that might justify the engineering and product effort to ship. Calculate the pattern dimension wanted at 80% energy. Then run the take a look at till you attain that quantity. No early exits.

The A number of Comparisons Entice

The third sin scales with ambition. Your A/B take a look at tracks conversion price, common order worth, bounce price, time on web page, and click-through price on the call-to-action. 5 metrics. Commonplace follow.

Right here’s the issue. At a 5% significance degree per metric, the chance of not less than one false optimistic throughout all 5 isn’t 5%. It’s 22.6%.

The mathematics: 1 − (1 − 0.05)⁵ = 0.226.

Scale that to twenty metrics (widespread in analytics-heavy groups) and the chance hits 64.2%. You’re extra prone to discover noise that appears actual than to keep away from it fully.

At 20 metrics and an ordinary 5% threshold, you have got an almost two-in-three probability of celebrating noise.
Picture by the writer.

Take a look at 20 metrics at a 5% threshold and you’ve got a 64% probability of celebrating noise.

That is the multiple comparisons problem, and most practitioners comprehend it exists in idea however don’t appropriate for it in follow. They declare one major metric, then quietly have fun when a secondary metric hits significance. Or they run the identical take a look at throughout 4 consumer segments and depend a segment-level win as an actual outcome.

Two corrections exist, and main platforms already help them. Benjamini-Hochberg controls the anticipated proportion of false discoveries amongst your important outcomes (much less conservative, preserves extra energy). Holm-Bonferroni controls the chance of even one false optimistic (extra conservative, applicable when a single flawed name has severe penalties). Optimizely makes use of a tiered model of Benjamini-Hochberg. GrowthBook gives each.

The repair: declare one major metric earlier than the take a look at begins. All the pieces else is exploratory. In the event you should consider a number of metrics formally, apply a correction. In case your platform doesn’t supply one, you want a unique platform.

When “Vital” Doesn’t Imply Vital

The fourth sin is the quietest and probably the costliest. A take a look at could be statistically important and virtually nugatory on the similar time.

Statistical significance solutions precisely one query: “Is that this outcome seemingly because of probability?” It says nothing about whether or not the distinction is large enough to matter. A take a look at with 2 million guests can detect a 0.02 share level raise on conversion with excessive confidence. That raise is actual. It’s additionally not value a single dash of engineering time to ship.

The hole between “actual” and “value performing on” is the place sensible significance lives. Most groups by no means outline it.

Earlier than any take a look at, set a sensible significance threshold: the minimal impact dimension that justifies implementation. This could replicate the engineering value of transport the change, the chance value of the take a look at’s runtime, and the downstream income influence. If a 0.5 share level raise interprets to $200K in annual income and the change takes one dash to construct, that’s your threshold. Something under it’s a “true however ineffective” discovering.

The repair: calculate your MDE earlier than the take a look at begins, not only for energy evaluation (although it’s the identical quantity), however as a call gate. Even when a take a look at reaches significance, if the measured impact falls under the MDE, you don’t ship. Write this quantity down. Get stakeholder settlement earlier than launch.

The Bayesian Repair That Doesn’t Repair Something

In the event you’ve learn this far, a thought is perhaps forming: “I’ll simply change to Bayesian A/B testing. It handles peeking. It provides me ‘chance of being greatest’ as an alternative of complicated p-values. Drawback solved.”

That is the preferred false impression in trendy experimentation.

Bayesian A/B testing does clear up one actual drawback: communication. Telling a VP “there’s a 94% chance that Variant B is best” is clearer than “we reject the null speculation at α = 0.05.” Enterprise stakeholders perceive the primary assertion intuitively. The second requires a statistics lecture.

However Bayesian testing doesn’t clear up the peeking drawback.

In October 2025, Alex Molas revealed a detailed simulation study exhibiting that Bayesian A/B assessments with mounted posterior thresholds endure from the identical false optimistic inflation whenever you peek and cease on success. Utilizing a 95% “chance to beat management” as a stopping rule, checked after each 100 observations, produced false optimistic charges of 80%. Not 5%. Not 26%. Eighty p.c.

David Robinson at Variance Explained reached a parallel conclusion: a set posterior threshold used as a stopping rule doesn’t management error charges in the best way most practitioners assume. The posterior stays interpretable at any pattern dimension. However interpretability is just not the identical as error management.

None of this implies Bayesian strategies are ineffective. For low-stakes directional selections (selecting a weblog headline, selecting an e mail topic line) the place Sort I error management isn’t important, the intuitive chance framework is genuinely higher. For prime-stakes product selections the place you want dependable error ensures, “simply go Bayesian” is just not a solution. It’s a dressing up change on the identical drawback.

Switching from frequentist to Bayesian doesn’t treatment peeking. It simply adjustments the quantity you’re misinterpreting.

The actual answer isn’t a change in methodology. It’s a pre-test protocol that forces statistical self-discipline no matter which framework you select.

The Pre-Take a look at Protocol

That is the part the remainder of the article was constructing towards. All the pieces above established why you want it. All the pieces under reveals what adjustments after you have it.

The 5-Level Pre-Take a look at Guidelines

Run via these 5 objects earlier than urgent “Begin” on any A/B take a look at. Every one is go/fail. If any merchandise fails, repair it earlier than launching.

Pattern dimension calculated. Set your MDE (the smallest impact value transport). Calculate the required pattern dimension at 80% energy and 5% significance utilizing Evan Miller’s free calculator or your platform’s built-in instrument. Instance: Baseline conversion 3.2%, MDE 0.5 share factors → ~25,000 per variant.
Runtime mounted and documented. Divide required pattern dimension by every day eligible visitors. Spherical up. Add buffer for weekday/weekend variation (minimal 7 full days, even when pattern dimension is reached sooner). Write down the top date. Instance: 8,300 eligible guests/day, 50,000 complete wanted → 6 days minimal, rounded to 14 days to seize weekly cycles.
One major metric declared. Write it down earlier than the take a look at begins. Secondary metrics are exploratory solely. In the event you should consider a number of metrics formally, apply Benjamini-Hochberg or Holm-Bonferroni correction. Instance: “Major: checkout conversion price. Secondary (exploratory): common order worth, cart abandonment price.”
Sensible significance threshold set. Outline the minimal impact that justifies implementation. Agree on this with engineering and product stakeholders earlier than launch. If the take a look at reaches statistical significance however falls under this threshold, you don’t ship. Instance: “Minimal +0.5 share factors on conversion (value ~$200K yearly, justifies a 2-week dash).”
Evaluation technique chosen. Decide one: Frequentist, Bayesian, or Sequential. Doc why. Use the choice matrix under. Instance: “Sequential testing. Two deliberate analyses at day 7 and day 14. Alpha spending through O’Brien-Fleming bounds.”

Labored Instance: Checkout Circulation Take a look at

A mid-market e-commerce workforce (500K month-to-month guests) needs to check a brand new single-page checkout in opposition to their present multi-step move. Right here’s how they run the guidelines:

1. MDE: 0.5 share factors (from 3.2% baseline to three.7%). At 500K month-to-month guests with a $65 common order worth, a 0.5pp raise generates roughly $195K in incremental annual income. The brand new checkout prices about 2 weeks of engineering time (~$15K loaded). The ROI clears the bar.

2. Pattern dimension: At 80% energy and 5% significance, this requires ~25,000 per variant. 50,000 complete.

3. Runtime: 250K month-to-month guests attain checkout. That’s ~8,300/day. 50,000 complete ÷ 8,300/day = 6 days. Rounded to 14 days to seize weekday/weekend results.

4. Major metric: Checkout conversion price. Common order worth and cart abandonment tracked as exploratory (no correction wanted since they received’t drive the ship/no-ship determination).

5. Methodology: Sequential testing. Excessive visitors, and stakeholders need weekly progress updates. Two pre-planned analyses: day 7 and day 14. Alpha spending through O’Brien-Fleming bounds.

Outcome: At day 7, the noticed raise is +0.3 share factors. The sequential boundary isn’t crossed. Proceed. At day 14, the raise is +0.6 share factors. Boundary crossed. Ship it.

With out the protocol: The PM checks every day, sees +1.1 share factors on day 3 with 93% “significance,” and declares a winner. She ships based mostly on a quantity that’s practically double the reality. Income projections overshoot by 83%. The precise raise settles at +0.6 factors over the subsequent quarter. Management loses belief within the experimentation program.

The perfect A/B take a look at is the one the place you wrote down “what would change our thoughts?” earlier than urgent Begin.

What Rigorous Testing Really Buys You

At Microsoft Bing, an engineer picked up a low-priority concept that had been shelved for months: a small change to how advert headlines displayed in search outcomes. The change appeared too minor to prioritize. Somebody ran an A/B take a look at.

The outcome was a 12% increase in revenue per search, worth over $100 million annually within the U.S. alone. It grew to become the one most precious change Bing ever shipped.

This story, documented by Ronny Kohavi within the Harvard Enterprise Evaluate, carries two classes. First, instinct about what issues is flawed more often than not. At Google and Bing, 80% to 90% of experiments present no optimistic impact. As Kohavi places it: “Any determine that appears attention-grabbing or completely different is often flawed.” You want rigorous testing exactly as a result of your instincts aren’t ok.

Second, rigorous testing compounds. Bing’s experimentation program recognized dozens of revenue-improving adjustments per 30 days, collectively boosting income per search by 10% to 25% every year. This accumulation was a significant component in Bing rising its U.S. search share from 8% in 2009 to 23%.

The quarter-hour you spend on a pre-test guidelines isn’t overhead. It’s the distinction between an experimentation program that compounds actual positive factors and one which ships noise, erodes stakeholder belief, and makes A/B testing seem like theater.

That product supervisor from 3 PM Thursday? She’s going to run one other take a look at subsequent week. So are you.

The dashboard will nonetheless present a confidence share. It’s going to nonetheless flip inexperienced when it crosses a threshold. The UI is designed to make calling a winner really feel satisfying and definitive.

However now what the dashboard doesn’t present. The 26.1%. The winner’s curse. The 64% false alarm price. The Bayesian mirage.

Your subsequent take a look at begins quickly. The guidelines takes quarter-hour. The choice matrix takes 5. That’s 20 minutes between transport sign and transport noise.

Which one will it’s?

References

Evan Miller, “How Not To Run an A/B Test”
Alex Molas, “Bayesian A/B Testing Is Not Immune to Peeking” (October 2025)
David Robinson, “Is Bayesian A/B Testing Immune to Peeking? Not Exactly”, Variance Defined
Ron Kohavi, Stefan Thomke, “The Surprising Power of Online Experiments”, Harvard Enterprise Evaluate (September 2017)
Optimizely, “False Discovery Rate Control”, Help Documentation
GrowthBook, “Multiple Testing Corrections”, Documentation
Analytics-Toolkit, “Underpowered A/B Tests: Confusions, Myths, and Reality” (2020)
Statsig, “Effect Size: Practical vs Statistical Significance”
Statsig, “Sequential Testing: How to Peek at A/B Test Results Without Ruining Validity”

Source link

Spectral Clustering Explained: How Eigenvectors Reveal Complex Cluster Structures

How the Fourier Transform Converts Sound Into Frequencies

A better method for planning complex visual tasks | MIT News

Will you be the boss of your own AI workforce?

Load-Testing LLMs Using LLMPerf | Towards Data Science

How to extract data from contracts: A practical guide

Markus Buehler receives 2025 Washington Award | MIT News

Multi-Agent SQL Assistant, Part 2: Building a RAG Manager

Most Popular

Exploring Merit Order and Marginal Abatement Cost Curve in Python

The CNN That Challenges ViT

LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models

Our Picks

Spectral Clustering Explained: How Eigenvectors Reveal Complex Cluster Structures

We ran 16 AI Models on 9,000+ Real Documents. Here’s What We Found.

Why Most A/B Tests Are Lying to You

Why Most A/B Tests Are Lying to You

The Peeking Drawback: 26% of Your Winners Aren’t Actual

The Energy Vacuum: Small Samples, Inflated Results

The A number of Comparisons Entice

When “Vital” Doesn’t Imply Vital

The Bayesian Repair That Doesn’t Repair Something

The Pre-Take a look at Protocol

The 5-Level Pre-Take a look at Guidelines

Labored Instance: Checkout Circulation Take a look at

What Rigorous Testing Really Buys You

References

Related Posts