Field notes

A/B Test Sample Size Math: MDE, Duration, and Why Most D2C Tests Lie

July 28, 2025

The test we called a winner that cost us a quarter

Early in our agency's history, we called a 12 percent lift test winner after nine days on 3,200 sessions per variant. The client shipped it. Over the next 60 days, revenue from the winning variant trailed the original by 4 percent. We had run an underpowered test, peeked twice, and shipped a variant that had no real effect. That quarter cost the client roughly $80K in lost revenue and cost us trust. Every sample size calculation we run now comes from that lesson.

TL;DR ▸ Most D2C tests are underpowered because brands do not plan sample size before launch ▸ MDE is the lift you can detect; smaller MDE requires much larger samples (non-linear growth) ▸ Plan for two-week minimum duration to account for weekly cycles, regardless of what sample calculators say ▸ Call winners only when both sample target and duration target are hit, never one or the other

The formula, explained without the textbook

Sample size per variant for a simple two-variant test scales with four things. The baseline conversion rate of the control. The minimum detectable effect you want to reliably catch. The statistical power you require, typically 0.80. The significance level, typically 0.05. The formula approximates as:

n = 16 * p * (1 - p) / delta^2

Where p is the baseline conversion rate as a decimal and delta is the absolute difference you want to detect. The 16 constant comes from combining 80 percent power and 95 percent significance. That is the quick math. For a 2 percent baseline CVR trying to detect a 10 percent relative lift (absolute delta of 0.002), you need roughly:

n = 16 * 0.02 * 0.98 / (0.002)^2 = 78,400 sessions per variant

That is 156,800 total sessions. For a store doing 50,000 monthly sessions, that is a three-month test. For a store doing 300,000 monthly sessions, that is a two-week test. The implication: small stores cannot run small-lift tests. Either target bigger MDEs, test bigger changes, or accept longer durations.

The MDE table every tester should keep open

Baseline CVR	Target relative lift	Sessions per variant	Total sessions needed
1.5%	20%	47,300	94,600
1.5%	10%	189,000	378,000
2.0%	20%	35,200	70,400
2.0%	10%	141,000	282,000
3.0%	20%	23,000	46,000
3.0%	10%	92,000	184,000
4.0%	20%	16,800	33,600
4.0%	10%	67,200	134,400

The pattern is clear. Every halving of MDE roughly quadruples the required sample size. Stores targeting 5 percent lifts are looking at sample requirements four times larger than what the table shows for 10 percent lifts. Most D2C stores cannot afford that sample economically, which means they should not plan tests for 5 percent effects. They should plan for 15 to 20 percent effects and design tests that can plausibly generate them.

The two-week minimum rule

Traffic on D2C stores cycles weekly. Weekend buyers behave differently from weekday buyers. Email traffic spikes on Tuesday and Thursday mornings. Paid traffic skews by hour of day. A seven-day test captures one of each day type exactly once, which means weekend noise has nowhere to cancel out. A fourteen-day test captures two of each day type, which is the minimum for weekly cycle noise to dampen.

The rule: never call a test shorter than fourteen days regardless of sample size hit. If you hit sample at day ten, you wait four more days. If you have not hit sample at day fourteen, you either extend the test or revisit the MDE target.

One exception. If a variant is dramatically losing at day four or five with a confidence interval that excludes positive lift, killing early is fine. You are not declaring a winner. You are declaring a loser and stopping the bleed. The asymmetry matters: early kills of losers is responsible experimentation; early calls of winners is not.

Why peeking is worse than you think

Sequential testing math says that if you check your test result daily and stop as soon as p drops below 0.05, your actual false positive rate is not 5 percent. It is closer to 25 to 40 percent depending on how often you peek and how long the test runs. You are inflating the false positive rate three- to eightfold.

The fix is one of two things. Run the test blind until the pre-calculated sample size is hit and look only then. Or use a sequential testing method (SPRT, always-valid p-values, CUPED adjustments) designed to let you peek without inflating error rates. Modern testing tools like Optimizely, VWO, and Statsig all offer variants of sequential methods. Use them if you are going to peek. Otherwise, wait.

Most D2C teams are not using sequential methods. They are using classic frequentist tests and peeking anyway. That combination is the single biggest reason tests fail to replicate after shipping. Our services/analytics-reporting practice treats testing rigor as a core deliverable because the cost of bad testing compounds across quarters.

The sample-size planning workflow

Four steps before every test.

First, pull the baseline conversion rate for the exact page, audience, and window you plan to test on. Not site-wide CVR. Not a quarterly average. The CVR for this specific slice of traffic over the last 28 days.

Second, decide the target MDE. Be honest. If the variant is a subtle copy change, a 20 percent lift target is fantasy and you should not run the test. Save the slot for a bigger change.

Third, calculate sample size using the formula above or a standard calculator. Evan Miller's calculator or the one built into your testing tool. Write down the number. Get buy-in from whoever approves the roadmap.

Fourth, calculate duration by dividing required sessions by current traffic rate. If duration is more than eight weeks, the test is probably not worth running at all. Pick a bigger change or merge traffic from related pages.

Revenue per visitor versus conversion rate

CVR is easier to power. RPV is what actually matters. The ratio of variance between CVR and RPV is typically 3x to 10x higher for RPV, which means a CVR-powered test is significantly underpowered for RPV.

The fix: plan sample size on RPV, not CVR. Use historical standard deviation of RPV over the last 28 days to set the target effect size. The math is harder (use the t-test variant rather than the binomial) but the conclusion usually increases sample size by 2x to 5x.

Brands that test on CVR and ship winners often see mixed RPV outcomes. "We lifted CVR by 10 percent but revenue didn't move" is the most common report. The reason is almost always that the winner also lowered AOV, and the test was not powered to detect the AOV shift. Our break-even ROAS guide covers the downstream math on why AOV shifts matter for paid economics.

When not to run a test

Three scenarios where testing is the wrong tool.

First, when the answer is obvious. If your PDP is missing shipping info above the fold (point 5 in the PDP grader), do not test whether adding it helps. It helps. Ship it. Testing obvious fixes wastes sample.

Second, when the change is directional rather than a binary variant. Gradual color palette updates across a brand refresh are not test candidates. They are brand decisions.

Third, when traffic is below the threshold to power any meaningful test. Stores under 20,000 monthly sessions are better off focusing on qualitative research, heuristic audits, and the heuristic-driven changes they imply. Run tests when you have earned the traffic to test.

The CRO services we run treat testing as one tool among several. Smaller brands get heuristic audits and priority changes. Larger brands get test-and-iterate cycles. Treating testing as the only valid method wastes cycles for brands that cannot afford it.

Reporting tests honestly to stakeholders

Three fields every test report should contain. The calculated sample size target. The actual sample collected. Whether the test ran the full planned duration. Without these three, you cannot tell whether a declared winner is real or noise.

Our test readouts also include the confidence interval, not just the point estimate. A "12 percent lift" with a confidence interval of -3 to 28 percent is a maybe. A "12 percent lift" with a confidence interval of 6 to 18 percent is a probable. Stakeholders need to see the range, not the single number.

For a broader testing program structure, the ecommerce CRO tests that beat best practices post has the thinking on which hypotheses are worth testing. The ecommerce CRO checklist has the foundational work that testing rides on top of.

The honesty tax

Honest testing means fewer winners to celebrate and more tests called flat. That is uncomfortable. It is also correct. A quarterly program that ships three real winners is more valuable than a program that ships twelve declared winners of which only two replicate. The math is unforgiving. Respect it or pay the tax in false-positive shipped variants that quietly cost you revenue.

For brands running paid traffic, testing rigor is even more important because false-positive CRO wins distort the MER picture downstream. See the attribution for D2C MER essay for the blended measurement layer that helps reveal whether shipped CRO wins are real.

What to do this week

▸ Pull 28-day baseline CVR and RPV for your top three testing surfaces (PDP, landing page, cart) ▸ For each surface, calculate required sample size at 10 percent, 15 percent, and 20 percent MDE targets ▸ Compare required samples against your monthly traffic to set realistic testing cadence ▸ Audit your current testing tool setup to confirm whether you are running sequential or classic frequentist tests ▸ Commit to the two-week minimum duration rule across your testing program starting next test ▸ Build a test report template that includes sample target, actual sample, duration, and confidence interval

One-page resource

Get the Vendor Recovery Checklist.

The 12 steps every displaced maker should take in the next 30 days. Delivered in your inbox.