A/B Testing for Founders: Sample Size, Significance, and 5 Mistakes to Avoid
A/B testing compares two versions of something — a headline, a pricing page, an onboarding step — by splitting traffic between them and measuring which converts better. Done right, it replaces opinions with evidence. Done wrong, it manufactures confident nonsense.
The one rule: decide your sample size before you start, and don't peek-and-stop.
The two numbers that matter
- Statistical significance (p-value). The probability your result is a fluke. The convention: ship only when p < 0.05 — meaning less than a 5% chance the difference is random noise.
- Sample size. How many visitors each variant needs before the test can detect a real difference. Small lifts need big samples: detecting a 2-point improvement on a 10% baseline takes thousands of visitors per variant, not hundreds.
A founder-sized testing process
- Write the hypothesis first. "Changing X will improve metric Y because Z." If you can't fill the Z, you're guessing, not testing.
- Test one variable. Change the headline or the CTA — not the whole page, or you won't know what worked.
- Compute the required sample size up front (any calculator works; GrowthPilot does it automatically per test).
- Run until the sample is reached — typically 1–2 full weeks minimum, to cover weekday/weekend behavior.
- Declare a winner only at significance. No winner is a valid result: it tells you that change doesn't matter, so spend your energy elsewhere.
5 mistakes that invalidate tests
- Peeking and stopping early. Checking daily and stopping at the first p < 0.05 inflates false positives massively. Set the sample, then judge once.
- Testing with too little traffic. Below ~1,000 conversions/month, prefer big swings (offer, pricing, positioning) over button colors — small effects are undetectable.
- Running many variants on small traffic. Each extra variant splits your sample and multiplies false-positive risk.
- Ignoring seasonality. A test spanning Black Friday measures Black Friday, not your variant.
- Shipping "almost significant" losers. p = 0.08 is not "nearly there"; it's "we don't know."
What to test first
Order by leverage: pricing page → signup flow → onboarding → headline/value prop → everything else. A 10% lift on activation compounds through every downstream AAARRR stage; a 10% lift on a footer link does nothing.
FAQ
What is statistical significance in A/B testing? A measure (p-value) of how likely your observed difference is pure chance. p < 0.05 is the standard shipping threshold.
How long should an A/B test run? Until the pre-computed sample size is reached, and at least one full business cycle (1–2 weeks) — whichever is longer.
How much traffic do I need to A/B test? To detect small lifts you typically need thousands of visitors per variant. With less traffic, test bigger changes.
Can an A/B test have no winner? Yes, and it's informative: the change doesn't move the metric, so you can stop debating it.
Create tests, track significance automatically, and declare winners with guardrails in GrowthPilot's built-in A/B testing module.