To run a valid A/B test, you generally need a minimum of 1,000 visitors per variant, though the exact number depends on your current conversion rate, the size of the improvement you want to detect, and your desired level of statistical confidence. For most websites, this means somewhere between 2,000 and 10,000 total visitors across both variants before you can draw reliable conclusions. The sections below unpack each factor that shapes your required sample size, from significance thresholds to traffic levels and common pitfalls.

What makes an A/B test statistically valid?

An A/B test is statistically valid when it has a large enough sample size to detect a real difference, runs long enough to capture natural variation in visitor behaviour, and reaches a predefined significance threshold before you declare a winner. Without these three conditions, any result you see may simply be noise rather than a genuine signal.

Validity starts with proper test design. Before you launch, you need to define your primary metric (for example, click-through rate or form completions), set your significance level (typically 95%), and calculate the minimum sample size required to detect your target improvement. Running a test without these parameters in place means you are essentially guessing when to stop, which dramatically increases the chance of a false positive.

Two additional factors reinforce validity. First, traffic must be split randomly and simultaneously between variants, so that external factors such as day of the week or a marketing campaign affect both groups equally. Second, you should only test one variable at a time. Changing the headline and the button colour simultaneously makes it impossible to know which change drove the result.

How is sample size calculated for an A/B test?

Sample size for an A/B test is calculated using three inputs: your baseline conversion rate, the minimum detectable effect (MDE) you want to identify, and your chosen statistical power and significance level. These inputs are fed into a statistical formula that determines how many visitors each variant needs to produce a trustworthy result.

Here is what each input means in practice:

  • Baseline conversion rate: The current rate at which visitors complete your goal, for example 3% of visitors filling in a contact form.
  • Minimum detectable effect (MDE): The smallest relative improvement you care about detecting, such as a 20% lift from 3% to 3.6%.
  • Statistical power: Usually set at 80%, this is the probability that your test will detect a real effect if one exists.
  • Significance level: Usually set at 95% (alpha of 0.05), meaning you accept a 5% chance of a false positive.

The lower your baseline conversion rate and the smaller the effect you want to detect, the larger your required sample size. A page converting at 1% needs far more visitors to detect a 10% improvement than a page converting at 10% does. Most online calculators handle this maths for you, but understanding the inputs helps you set realistic expectations before you start.

What’s the minimum number of visitors needed per variant?

As a practical rule of thumb, you need at least 1,000 visitors per variant before results become meaningful, giving you 2,000 total visitors across a two-variant test. However, this figure is only reliable if your conversion rate is reasonably high (above 5%) and you are looking for a substantial improvement. For lower conversion rates or smaller effects, the minimum rises considerably.

To put this in concrete terms: if your landing page converts at 2% and you want to detect a 15% relative improvement (from 2% to 2.3%), a sample size calculator will typically return a requirement of around 15,000 to 20,000 visitors per variant. Trying to call a winner at 1,000 visitors in that scenario would be statistically irresponsible.

The key takeaway is that “minimum visitors” is not a fixed number. It is the output of a calculation, not a shortcut. Always calculate your required sample size before launching, not after you have seen the data.

How long should an A/B test run?

An A/B test should run for a minimum of two full business cycles, which for most websites means at least two weeks, regardless of how quickly you accumulate the required number of visitors. Running a test for fewer than seven days risks capturing skewed data because visitor behaviour varies significantly between weekdays and weekends.

Time matters for two reasons beyond raw visitor counts. First, the novelty effect: visitors who encounter a new design may behave differently in the first few days simply because it is unfamiliar. This effect fades over time. Second, seasonality: a test that runs only on Mondays and Tuesdays may not reflect the behaviour of your Thursday and Friday audience.

A reasonable maximum runtime is four to six weeks. Leaving a test running indefinitely introduces the risk of external changes, such as a new marketing campaign or seasonal shift, contaminating your results. If you have not reached significance after six weeks, the effect you are testing for may simply be too small to matter at your current traffic levels.

What is statistical significance and why does 95% matter?

Statistical significance is a measure of confidence that the difference observed between two variants is not due to random chance. A 95% significance level means there is only a 5% probability that the result occurred by chance, and a 95% probability that the observed difference reflects a genuine effect. This threshold is the most widely accepted standard in A/B testing.

The 95% threshold matters because it balances the cost of acting on a false positive against the cost of missing a real improvement. If you use a lower threshold, say 80%, you will call winners faster but make incorrect decisions more often. If you demand 99% significance, you will need substantially larger sample sizes and longer test durations, which is impractical for most teams.

It is worth noting that 95% significance does not mean your result is certainly correct. It means that if you ran this test 100 times under identical conditions, you would expect a false positive roughly five times. For high-stakes decisions, such as a complete homepage redesign, some teams prefer to aim for 99% confidence. For lower-stakes tests, 95% is a sensible and well-established standard.

Can you run a valid A/B test with low website traffic?

Yes, you can run a valid A/B test with low traffic, but you need to adjust your approach. With fewer than 5,000 monthly visitors, you should test only high-impact changes that are likely to produce large, detectable effects, accept longer test durations, and be willing to wait weeks or months for results. Testing minor tweaks like button colour on a low-traffic site is unlikely to ever produce statistically reliable data.

Practical strategies for low-traffic sites include:

  • Focus on high-value pages: Test your highest-traffic pages first, even if overall site traffic is low.
  • Test bold changes: A completely different page layout is more likely to produce a detectable effect than a subtle copy tweak.
  • Use a higher MDE: Accept that you can only reliably detect large improvements, and design your variants accordingly.
  • Consider qualitative research first: User interviews, heatmaps, and session recordings can guide you towards changes worth testing before you commit traffic to an experiment.

One approach to avoid is peeking at results daily and stopping the test early when you see a promising number. With low traffic, early results are highly unstable and almost always misleading.

Which A/B test calculator tools give accurate sample sizes?

Several reliable, free A/B test calculators provide accurate sample size estimates. The most widely used options are Evan Miller’s Sample Size Calculator, AB Testguide’s calculator, and the VWO A/B Test Duration Calculator. Each uses established statistical formulas and allows you to input your baseline rate, MDE, power, and significance level to produce a reliable estimate.

When choosing a calculator, look for one that lets you adjust all four core inputs independently. Some simplified tools only ask for a conversion rate and desired lift, which can produce misleading results if they assume fixed power and significance values you are not aware of. Transparency about the underlying assumptions is a sign of a trustworthy tool.

For teams running tests regularly, it is worth building a simple internal template that logs the pre-test calculation alongside the test results. This creates accountability and prevents the common habit of calculating sample size after the fact to justify an early call.

What are the most common A/B testing mistakes that invalidate results?

The most common A/B testing mistakes that invalidate results are stopping a test too early, running multiple tests on the same page simultaneously, failing to define a primary metric before launch, and testing too many variables at once. Each of these errors introduces bias or noise that makes it impossible to draw reliable conclusions.

Here is a breakdown of the most frequent mistakes and why they matter:

  1. Peeking and stopping early: Checking results before reaching your target sample size and stopping when you see significance dramatically inflates your false positive rate. Commit to your pre-calculated endpoint before you start.
  2. Running overlapping tests: If two tests share the same audience or page, the results of each will be contaminated by the other. Use audience segmentation or a testing roadmap to prevent overlap.
  3. Changing the test mid-run: Adjusting a variant, pausing traffic, or running a promotion during the test period introduces confounding variables that make your data untrustworthy.
  4. Ignoring the full business cycle: Tests that run for less than a week often miss natural variation in visitor behaviour, leading to skewed results.
  5. Celebrating secondary metrics: If your primary metric did not improve but a secondary one did, that is not a win. Defining your success metric before launch keeps you honest.
  6. Testing on too small a segment: Splitting already-low traffic into sub-segments (for example, mobile users only) reduces your sample size further and makes it nearly impossible to reach significance.

Avoiding these mistakes is less about statistical expertise and more about discipline. The most common errors are procedural, not mathematical.

How Spotler helps with A/B testing

Running valid A/B tests requires the right infrastructure as much as the right methodology. We built A/B testing directly into Spotler’s website personalisation tooling so that marketing teams can test, measure, and act on results without needing a separate tool or technical support.

With Spotler Website Personalisation for B2B, you can:

  • Set up A/B tests on personalised content blocks, overlays, and page variants for specific audience segments
  • Measure which personalisation performs best per audience group, from returning leads to visitors arriving via email campaigns
  • Automatically build enriched visitor profiles in the background, giving you the segmentation data needed to run more targeted tests
  • Connect test results directly to your email marketing automation and other channels within the Spotler Marketing Cloud, so winning variants inform your broader campaign strategy

Whether you are testing a headline for B2B visitors or a call-to-action for returning customers, our platform gives you the sample sizes, segmentation controls, and integrated analytics to run tests that actually hold up. Explore Spotler Website Personalisation and see how built-in A/B testing fits into your wider marketing setup.

Frequently Asked Questions

What should I do if my A/B test reaches the required sample size but hasn't reached statistical significance?

If you've hit your pre-calculated sample size without reaching significance, the most likely explanation is that the effect you're testing for is smaller than your minimum detectable effect — or doesn't exist. At this point, the responsible call is to end the test and treat it as a null result rather than extending it indefinitely in search of significance. Use the learnings to inform a bolder variant or redirect your testing effort to a higher-impact page element.

How do I choose the right minimum detectable effect (MDE) before launching a test?

Start by asking what improvement would actually be worth acting on for your business. If a 5% lift in conversions wouldn't meaningfully change your revenue or justify a permanent design change, there's little point designing a test to detect it. A practical starting point for most teams is an MDE of 10–20% relative improvement, which balances realistic ambition with achievable sample sizes. Just be aware that the smaller your MDE, the more visitors you'll need — so set it based on business value, not optimism.

Can I run A/B tests on personalised content without skewing my results?

Yes, but it requires careful audience segmentation from the outset. When testing personalised content, your control and variant must be served to comparable audience segments — for example, both groups should be returning visitors from the same traffic source, not a mix of new and returning users. Mixing segments introduces selection bias that makes it impossible to attribute results to the change you've made. Platforms with built-in audience controls, such as Spotler Website Personalisation, handle this segmentation automatically to keep your test groups comparable.

Is it ever acceptable to stop an A/B test before reaching the pre-calculated sample size?

In most cases, no. Stopping early — even when results look promising — dramatically inflates your false positive rate and is one of the most common ways teams end up acting on misleading data. The only defensible exception is if the variant is causing clear harm, such as a significant drop in conversions or a technical error affecting user experience. Outside of that scenario, commit to your pre-calculated endpoint and resist the temptation to call it early.

How do I prioritise which pages or elements to A/B test first?

Prioritise pages that combine high traffic volume with high business impact — your homepage, primary landing pages, and key conversion steps such as checkout or contact forms are typically the best starting points. A simple prioritisation framework is to score each test candidate on three factors: potential impact on your primary metric, the volume of traffic it receives, and how confident you are in the hypothesis behind the change. Start with the highest-scoring combination rather than testing elements at random.

What's the difference between A/B testing and multivariate testing, and when should I use each?

An A/B test compares two versions of a single variable — for example, two different headlines — making it straightforward to attribute any difference in performance to that one change. Multivariate testing simultaneously tests multiple elements and their combinations, which can be powerful but requires significantly more traffic to produce reliable results for each combination. For most teams, A/B testing is the right default. Multivariate testing is only worth considering when you have very high traffic volumes and need to understand how multiple elements interact with each other.

How should I document and share A/B test results across my team?

A lightweight test log goes a long way towards building a culture of evidence-based decision-making. For each test, record the hypothesis, the pre-test sample size calculation, the primary metric, the duration, the result, and the action taken — whether that's implementing the winner, discarding both variants, or running a follow-up test. Sharing this log across your marketing and product teams prevents duplicate testing, surfaces patterns over time, and ensures that winning variants are actually implemented rather than forgotten.


Frequently Asked Questions

What should I do if my A/B test reaches the required sample size but hasn't reached statistical significance?

If you've hit your pre-calculated sample size without reaching significance, the most likely explanation is that the effect you're testing for is smaller than your minimum detectable effect — or doesn't exist. At this point, the responsible call is to end the test and treat it as a null result rather than extending it indefinitely in search of significance. Use the learnings to inform a bolder variant or redirect your testing effort to a higher-impact page element.

How do I choose the right minimum detectable effect (MDE) before launching a test?

Start by asking what improvement would actually be worth acting on for your business. If a 5% lift in conversions wouldn't meaningfully change your revenue or justify a permanent design change, there's little point designing a test to detect it. A practical starting point for most teams is an MDE of 10–20% relative improvement, which balances realistic ambition with achievable sample sizes. Just be aware that the smaller your MDE, the more visitors you'll need — so set it based on business value, not optimism.

Can I run A/B tests on personalised content without skewing my results?

Yes, but it requires careful audience segmentation from the outset. When testing personalised content, your control and variant must be served to comparable audience segments — for example, both groups should be returning visitors from the same traffic source, not a mix of new and returning users. Mixing segments introduces selection bias that makes it impossible to attribute results to the change you've made. Platforms with built-in audience controls, such as Spotler Website Personalisation, handle this segmentation automatically to keep your test groups comparable.

Is it ever acceptable to stop an A/B test before reaching the pre-calculated sample size?

In most cases, no. Stopping early — even when results look promising — dramatically inflates your false positive rate and is one of the most common ways teams end up acting on misleading data. The only defensible exception is if the variant is causing clear harm, such as a significant drop in conversions or a technical error affecting user experience. Outside of that scenario, commit to your pre-calculated endpoint and resist the temptation to call it early.

How do I prioritise which pages or elements to A/B test first?

Prioritise pages that combine high traffic volume with high business impact — your homepage, primary landing pages, and key conversion steps such as checkout or contact forms are typically the best starting points. A simple prioritisation framework is to score each test candidate on three factors: potential impact on your primary metric, the volume of traffic it receives, and how confident you are in the hypothesis behind the change. Start with the highest-scoring combination rather than testing elements at random.

What's the difference between A/B testing and multivariate testing, and when should I use each?

An A/B test compares two versions of a single variable — for example, two different headlines — making it straightforward to attribute any difference in performance to that one change. Multivariate testing simultaneously tests multiple elements and their combinations, which can be powerful but requires significantly more traffic to produce reliable results for each combination. For most teams, A/B testing is the right default. Multivariate testing is only worth considering when you have very high traffic volumes and need to understand how multiple elements interact with each other.

How should I document and share A/B test results across my team?

A lightweight test log goes a long way towards building a culture of evidence-based decision-making. For each test, record the hypothesis, the pre-test sample size calculation, the primary metric, the duration, the result, and the action taken — whether that's implementing the winner, discarding both variants, or running a follow-up test. Sharing this log across your marketing and product teams prevents duplicate testing, surfaces patterns over time, and ensures that winning variants are actually implemented rather than forgotten.