Field notes

A/B Testing on Shopify: Tools, Setup, and Common Traps

July 31, 2025

A/B testing on Shopify used to be dominated by Google Optimize, which is now gone. The current landscape has four viable tools, each with tradeoffs. Most DIY testing programs fail on the setup and statistical side rather than the tool side, so knowing both layers matters.

Here is how to run A/B testing on Shopify properly in 2026.

The tools

Shopify's native A/B testing (via Shopify Editions)

Shopify launched native A/B testing in 2024 for Plus merchants and rolled it out broadly in 2025.

Pros: Native integration, free, no external scripts, no CLS issues.

Cons: Limited to theme-level tests, cannot test complex logic. Reporting is basic.

When to use: Simple hero, collection, or PDP variants where the change is cosmetic.

Intelligems

Built specifically for Shopify. Popular in the DTC community.

Pros: Shopify-native, handles pricing tests (can A/B test actual product prices, not just visual elements), strong reporting.

Cons: Pricier than alternatives. Learning curve for non-technical users.

When to use: Mid-size to large Shopify stores wanting to test price, shipping, or offer variations.

Convert.com

Platform-agnostic, works on Shopify, WooCommerce, custom sites.

Pros: Mature, full-featured, handles complex targeting and segmentation.

Cons: $349+ per month. Script-based so can cause CLS issues if implemented sloppily.

When to use: Brands testing across multiple properties or needing enterprise features.

VWO

Similar to Convert in positioning.

Pros: Rich feature set, heatmaps and session recording built in.

Cons: Pricing starts $200+ per month. Enterprise-oriented UX.

When to use: Mid-market brands wanting a single tool for testing plus user research.

What not to use

Manual theme duplicates. "I will duplicate the theme, change a thing, and route 50 percent of traffic." This does not work because Shopify's caching and theme architecture make consistent variant delivery impossible.

Google Optimize. It is gone. Do not try to revive it.

Free tools with script-based testing. Causes CLS, bad UX, and confounds results.

The setup that determines whether your tests are trustworthy

Tool choice matters less than whether your setup is correct. Here is what correct looks like.

Define your primary metric before the test

Most tests go sideways because the team did not agree on what success looks like.

Primary metric must be revenue-related. Conversion rate is fine. Revenue per visitor is better. Total revenue is best if you have enough volume.

Guardrail metrics. Bounce rate, time on site, returns/refund rate. A test that wins on conversion but tanks bounce or return rate is not a real win.

Do not pick metrics mid-test. That is called p-hacking and it invalidates results.

Calculate required sample size before launching

Use a sample size calculator. Input your baseline conversion rate, your minimum detectable effect (MDE), and your desired statistical power.

Typical ecom: 2 percent baseline CVR, 10 percent MDE, 80 percent power, 95 percent confidence. You need about 50,000 visitors per variant. That means 100,000 total visitors to the test, which for a store doing 30,000 monthly visits takes over 3 months.

If you cannot run that long: either increase your MDE (only test high-confidence changes), test on a high-traffic page (PDP on top SKU), or accept that you will not get statistical power and ship high-confidence changes without formal testing.

Run for whole weeks

Test duration should be a multiple of 7 days to avoid weekday/weekend bias. Running 6 days or 9 days contaminates the result.

Do not peek

Peeking at results mid-test and stopping when you see "significance" gives false positives. Run the full pre-calculated sample. Then decide.

Segment post-hoc carefully

Segmenting results by device, country, or traffic source after the test is fine for insight. Declaring a test a "win on mobile" after it failed overall is p-hacking.

Common traps

Sample Ratio Mismatch (SRM)

If your 50/50 split is actually 52/48, something is wrong with assignment. Could be caching, could be bot traffic, could be a bug. Do not trust the test result.

Every testing tool reports traffic split. Check it before interpreting outcomes.

Cross-device users

A user who sees variant A on mobile and variant B on desktop confuses the experiment. Most tools handle this okay with fingerprinting or login-based identity, but it is worth verifying.

Bot traffic

Scrapers, uptime monitors, and crawlers can skew tests if they hit cache differently. Exclude bot traffic via your testing tool or via GA4 filters.

Promotional windows

Running a test through Black Friday contaminates the data. Heavy discount users behave differently than normal users. Pause tests during major promotions.

Winner's curse

Many "winning" tests at 95 percent confidence with small samples do not replicate. The winner is smaller than reported or zero. Re-run critical tests before permanently shipping. Ship the change but keep testing in parallel.

Novelty effect

A new design gets a temporary lift because it is novel. After 2 to 3 weeks, behavior normalizes. Running tests long enough past novelty reveals true effects.

The reporting that actually matters

Every test result should disclose:

Traffic per variant
Conversion count per variant
Conversion rate per variant
Revenue per visitor per variant
95 percent confidence interval
P-value
Test duration in days
Any exclusions applied

Reports missing this detail are hiding weakness. Good CRO teams disclose losses as clearly as wins because the knowledge compounds for future tests.

Testing cadence

For a store doing 20,000+ monthly sessions, 2 to 3 tests running at any time is reasonable. Tests can overlap if they do not share metric or page.

For smaller stores, run one test at a time and ship high-confidence changes without formal testing.

A reasonable year in testing

A well-run program produces:

15 to 25 tests shipped in a year
5 to 10 statistically significant wins
2 to 5 tests where the result informed strategy even without shipping (negative results have value)
Ongoing documentation of what was tested, what won, what lost, and what to try next

Most boutique stores overestimate how many tests they can run and underestimate how much compounding comes from sustained testing over 2 to 3 years.

When to hire help

A DIY testing program works if you have someone comfortable with statistics, Shopify theme development, and project management. That person needs 8 to 12 hours per week of dedicated capacity.

If you do not have that person, an agency is usually cost-effective. A good CRO retainer ($4,000 to $5,000 per month for boutique brands) includes test design, development, execution, and reporting.

Our CRO service runs exactly this. Starts with a research sprint to build a hypothesis backlog, then executes tests monthly.

One-page resource

Get the Vendor Recovery Checklist.

The 12 steps every displaced maker should take in the next 30 days. Delivered in your inbox.