A/B testing compares two versions of something — a headline, a pricing page, an onboarding step — by splitting traffic between them and measuring which converts better. Done right, it replaces opinions with evidence. Done wrong, it manufactures confident nonsense.

The one rule: decide your sample size before you start, and don't peek-and-stop.

The two numbers that matter

Statistical significance (p-value). The probability your result is a fluke. The convention: ship only when p < 0.05 — meaning less than a 5% chance the difference is random noise.
Sample size. How many visitors each variant needs before the test can detect a real difference. Small lifts need big samples: detecting a 2-point improvement on a 10% baseline takes thousands of visitors per variant, not hundreds.

A founder-sized testing process

Write the hypothesis first. "Changing X will improve metric Y because Z." If you can't fill the Z, you're guessing, not testing.
Test one variable. Change the headline or the CTA — not the whole page, or you won't know what worked.
Compute the required sample size up front (any calculator works; GrowthPilot does it automatically per test).
Run until the sample is reached — typically 1–2 full weeks minimum, to cover weekday/weekend behavior.
Declare a winner only at significance. No winner is a valid result: it tells you that change doesn't matter, so spend your energy elsewhere.

5 mistakes that invalidate tests

Peeking and stopping early. Checking daily and stopping at the first p < 0.05 inflates false positives massively. Set the sample, then judge once.
Testing with too little traffic. Below ~1,000 conversions/month, prefer big swings (offer, pricing, positioning) over button colors — small effects are undetectable.
Running many variants on small traffic. Each extra variant splits your sample and multiplies false-positive risk.
Ignoring seasonality. A test spanning Black Friday measures Black Friday, not your variant.
Shipping "almost significant" losers. p = 0.08 is not "nearly there"; it's "we don't know."

What to test first

Order by leverage: pricing page → signup flow → onboarding → headline/value prop → everything else. A 10% lift on activation compounds through every downstream AAARRR stage; a 10% lift on a footer link does nothing.

FAQ

What is statistical significance in A/B testing? A measure (p-value) of how likely your observed difference is pure chance. p < 0.05 is the standard shipping threshold.

How long should an A/B test run? Until the pre-computed sample size is reached, and at least one full business cycle (1–2 weeks) — whichever is longer.

How much traffic do I need to A/B test? To detect small lifts you typically need thousands of visitors per variant. With less traffic, test bigger changes.

Can an A/B test have no winner? Yes, and it's informative: the change doesn't move the metric, so you can stop debating it.

Create tests, track significance automatically, and declare winners with guardrails in GrowthPilot's built-in A/B testing module.