Draft: Blog: Guide: How to Run an A/B Test

How to Run an A/B Test

A/B testing is one of the most powerful tools in the arsenal of a value-based designer, and its primary purpose is to measure the economic impact of a change in order to determine whether or not to make a new design decision. In its simplest form, you split traffic evenly between control (your original design) and variant (representing your design decision), and you measure what happens.

We’ve run over 600 experiments for all sorts of clients since 2013, and the number 1 thing I hear from store owners, designers, and developers is: what should I test?

Creating an A/B test isn’t so hard. You can use Google Optimize to do so. But without good ideas, you have no reason to be testing in the first place.

We all want to know what works, of course, but it’s not quite so easy as changing your headline and watching the numbers go up. Fortunately, there is a surefire winning strategy: through research.

Good testing ideas take time and effort to generate, and research is the best and most reliable way to do so. If you’re wondering how to research, we have several evergreen guides published right here. When you’ve researched a little and you’re ready to turn that resarch into revenue-generating design decisions, you’ll want to synthesize it.

Turning research into a test

When you’re making sense of research, ask two questions:

Am I observing a trend? One interview is well and good, but it may be an outlier. Better to confirm your findings with other sources first – even if it’s just a couple.
Does this represent a lost revenue opportunity for my store? If you find any revenue leaks in your research, you need to plug them. Some of these mean fixing bugs (a page loading slowly on mobile, say); others mean running a test to hedge risk (such as revising a pitch, or creating a totally new value proposition).

Here are a few questions that you should ask of any testable idea:

What research backs up this change?
What does this change have to do with the business’s overarching goals?
What is the next step on this page?
In what ways is the next step clear to the customer?
Does the page appear credible? How?
Does the change aid credibility?
Does the change improve usability? How?

These questions largely reflect the ones that ConversionXL and Baymard Institute both use when analyzing various ecommerce sites for what works and what doesn’t.

The hypothesis

A/B testing is an application of the scientific method to your design process – and good scientific processes start with a hypothesis.

In testing, a hypothesis contains three components:

The design change. Rewording the headline, changing your CTA, reworking the layout, etc.
The goal. Conversions from trial, average revenue per user (ARPU), etc.
The intended lift. An increase by 5%, say. Note that this involves a specific number. This is because you cannot run a test without a specific magnitude of the change in mind.

A hypothesis does not need to contain:

The reason why. “The text is more persuasive.” Oh, really? What about it is more persuasive? Do you really know your customers’ motivations with that level of granularity? Are you a mind-reader?
The impact on other goals. You have a goal and its expected change. That’s it. Other goals are going to be harder to measure, because their initial rates differ, so the relative change will differ as well (for example, more people are proportionally signing up for your mailing list than converting from trial).
The impact on other elements on the page. You’ve changed something in your masthead. Will it result in people paying less attention to something further down the page? That matters for your design, sure, but not for your hypothesis.
Literally anything else. You are probably thinking many things right now. Trust me, your hypothesis does not contain those things, either. It contains only those three things.

So, for those of you who are just starting out with their first tests, you may feel it necessary to skip the creation of a hypothesis. Don’t. It will be disastrous for your testing practice.

Why a hypothesis is essential

A hypothesis keeps you focused on the only goal that matters to your – or any – organization: revenue generation. It also allows you to gain the real benefit from testing: moving away from internal debates, speculation, and blind faith.

You can’t test rationally without a hypothesis. A hypothesis is the goal that you rally an organization behind. With a hypothesis, you have a clear change, a rational outcome, and an expectation around how it will economically perform. Without a hypothesis, you’re doing the exact same toxic stuff you were doing before you started testing: testing to settle a debate, testing because a change might be “more persuasive”, or testing because other cool people are testing. They’re cool because they’re acting coolly, not because they’re testing.

Creating hypotheses

A hypothesis is the exact connection point between testing and design research. Why? Because it includes the precise design change you’ll be making – in the service of an A/B test.

How to choose a goal

You can track many things in your testing framework, but to calculate a sample size and attach your goal to a specific design decision, only one goal is able to act as the primary goal.

Let’s say you have a clear sense of what to change in a test. How do you measure it?

This should be obvious. It’s not. Let’s dive into how to configure the best goals possible for a test, and how to pay attention to them when it’s time to call the final result.

What goals to configure

First, Revenue

Your tests should exist to generate revenue for your business – otherwise, there’s no reason for you to run a test. Revenue is the first and most important goal for any A/B test.

Revenue tracking exists in all major A/B testing frameworks; here are how-to guides for VWO and Optimizely. On the confirmation page for your purchase, you should add revenue tracking to your snippet.

When you’re in your framework, then, add revenue tracking as your primary goal, and set the goal to the correct URL. It should look a little like this, configured to your own site:

Revenue tracking screen shot

Next, Confirming the Sale

Just as you have to install revenue tracking on the confirmation page, you should also create a second goal that tracks views of this page. This allows you to confirm that you’re getting enough qualified, wallet-out traffic.

The page should be unique to the customer’s experience of the site. In a SaaS app, your thank you page shouldn’t be the same URL as the product’s dashboard. In an ecommerce site, your thank you page shouldn’t be the same as the order status page.

Then, Track Every Step in the Funnel

You should configure goals that match every step in your funnel, and take them a lot less seriously than the goals that actually close a sale. This allows you to get a whole portrait of your funnel, which is valuable for assessing whether there are any significant drop-off points which may be best addressed with further testing.

Here’s how these goals look for a typical ecommerce site:

Ecommerce site goals screen shot

Finally, Engagement

Both frameworks have “engagement” as a goal, to show the proportion of customers that actually engage with a page. This is basically a freebie, so add it just to make sure the test is generating data.

I don’t think engagement is a valuable metric otherwise – not even on blogs or news sites. If you must measure your readers’ attention – which, to be clear, is very rare in A/B testing – it’s much better for you to come up with more granular metrics.

Bonus for SaaS: Logins

Create a goal that tracks people who log in. Why? Because your funnel doesn’t need to preach to the converted – and there’s a huge difference between disqualifying 70% and 95% of the total traffic coming to your home page.

When calculating sample size, then, use the total number of hits minus the number of views to your login page as your metric.

How to configure a goal

A goal can be configured in a variety of ways, and every major testing framework will have a robust set of criteria for you to use.

CSS Selectors

This tracks clicks and taps on any CSS selector you desire. Here’s a huge reference of them. These are great for:

Clicks to your site’s navigation.
Clicks to your site’s footer.
Clicks anywhere on a pricing <div>, measured against clicks on actual CTA buttons within that <div>.
Clicks to a primary CTA button that’s scattered in multiple places on your home page.
Clicks to an add to cart button, when there are multiple ways to check out (think PayPal, Amazon, on-site, etc).
Attempts to enter a search query, versus submissions of that query.
Focuses on a form field, such as to enter one’s email address. (This is great for fixing validation bugs!)

Overall, CSS selector goals are great for addressing one-off issues with the usability of the site – and they’re especially great for validating the insights that you get from your heat maps.

Form submissions

Frameworks will track form submissions to a specific URL as well. You’ll want to enter the href that the form is going to, if it exists.

If it doesn’t have a direct target, and you don’t have any other forms that someone can use at this point in the funnel, use a wildcard with an asterisk so it can capture all form submissions everywhere.

Page views & link clicks

This is the most common goal by far: tracking views of a specific page. And it comes in two forms.

Page views track hits to a specific page, regardless of source.
Link clicks track clicks to a specific page on the test page you’re measuring.

I use each of these goals to measure each step in a funnel, usually with page view goals. Why? Ultimately, it doesn’t matter how much tire-kicking the person is doing, as long as they convert. And I can get a sense for the quantity of tire-kicking by running a heat map.

You can measure views to a specific page by indicating the exact URL (“URL matches”, in the string-matching queries above). You can measure views to many pages by using other string-matching queries:

“Matches pattern” uses wild-card matching to specify a page. For example, all URLs with blah can be specified with *blah*. All URLs with blah and bleh can be specified with *bl?h*.
“Contains” simply returns all URLs containing that string. As with the blah example above, simply typing blah will do the same thing.
“Starts with” and “ends with” are what you think – but you need to begin “starts with” using the correct trailing protocol. “https://example.com” will only work on “https://example.com”, not “https://www.example.com” or “http://example.com”.
“URL matches regex” uses regular expressions to match your URLs, which is terrific for highly complex string matching. Rather than spend ten thousand words teaching you about regular expressions, I refer you to Reuven Lerner’s authoritative source on the subject.

What if your goal isn’t firing?

You might be using the wrong pattern for your URLs, or the target URL may not have your framework’s tracking code.

In the former case, you need to change the goal so it accurately reflects what’s happening on the site – and you probably need to flush your test’s data and start over.

In the latter case, you need to install the tracking code on the corresponding test pages. Remember that VWO’s tracking code should be placed at the end of the tag, and Optimizely’s should be placed at the beginning. Here are the instructions for installing Google Optimize.

Measure as Carefully as You Test

Don’t forget that it’s quite possible to over-measure – and act on the wrong insights.

First, remember that revenue outweighs all other goals. You are testing to generate revenue, always.

Next, revenue-related goals outweigh other goals. Hits to a confirmation page correspond directly to revenue, as long as you aren’t providing something for free.

Finally, rank behavior-based goals last. Clicks on navigation, engagement, and the like are sometimes useful for framing design decisions, but consider them more like a part of your research process than an actual testing outcome.

How to Calculate Intended Lift

So, you have your primary goal. You have (presumably) assessed the conversion rate of this goal in Google Analytics. (If you haven’t, go do that now). Now, you need to calculate the minimum detectable effect, or the intended lift, of the test.

This depends heavily on the traffic you currently get. The more traffic you have, the smaller the lift you can detect – which means the more likely you are to get small wins from your testing, in addition to home runs.

You calculate your intended lift backwards: by knowing your current traffic, conversion rate, and intended duration of the test (which should be longer than 1 week and shorter than 4 weeks), throw your numbers in Evan Miller’s tool for calculating sample size.

Optimizely has a ton more on minimum detectable effect and how to determine whether you have enough traffic to get reliably significant results.

Prioritization

How do you know what to test next? We wrote about this in our definitive guide to prioritization.

Minimum sample size

Testing is worthless without valid statistical significance for your findings – and you need enough qualified, wallet-out traffic to get there. If you want a lengthy mathematical explanation for this, here’s one.

What does this mean for you in practice? Before you run any test, you need to calculate the amount of people that should be visiting it. Fortunately, there’s an easy way to calculate your minimum traffic. Let’s go into how to do this!

Maximum timeframe

First off, you should be getting the minimum traffic for an A/B test within a month’s time. Why a month? Several reasons:

It won’t be worth your organization’s time or resources to run tests so infrequently. You are unlikely to get a high ROI from A/B testing within a year of effort.
One-off fluctuations in signups – either by outreach campaigns, holidays, or other circumstances – are more likely to influence your test results.
Your organization will be more likely to spend time fighting over the meaning of small variations in data. That is not a positive outcome of A/B testing.
You will not be able to call tests unless they’re total home runs, for reasons I’ll describe below.

Sample size is calculated with two numbers:

Your conversion rate. If you don’t have this already calculated, you should configure a goal for your “thank you” page in Google Analytics – and calculate your conversion rate accordingly.
The minimum detectable effect (or MDE) you want from the test, in relative percentage to your conversion rate. This is subjective, and contingent on your hypothesis.

A note on minimum detectable effect

The lower the minimum detectable effect, the more visitors you need to call a test. Do you think that a new headline will double conversions? Great, your minimum detectable effect is 100%. Do you think it’ll move the needle less? Then your minimum detectable effect should be lower.

Put another way, if you want to be certain that a test causes a small lift in revenue-generating conversions – let’s say 5% – then you will need more traffic than a hypothesis that causes your conversions to double. This is because it’s easier to statistically call big winners than small winners. It also means that the less traffic you have, the fewer tests you’ll be able to call.

You should not reverse-engineer your minimum detectable effect from your current traffic levels. A test either fulfills your hypothesis or it doesn’t, and science is historically quite unkind to those who try to cheat statistics.

How to calculate sample size

I use Evan Miller’s sample size calculator for all of my clients. You throw your conversion rate and MDE numbers in there, and calculate the level of confidence you want your test to be at.

I recommend at least 95% confidence for all tests. Why? Because anything less means you still have a high chance for a null result in practice. Lower confidence raises the chance that you’ll run a test, see a winner, roll it out, and still have it lose in the long run.

Let’s say your conversion rate is 3% and your hypothesis’s MDE is 10% – so you’re trying to run a test that conclusively lifts your conversion rate to 3.3%. Here’s an example of how I fill this form out.

Note that the resulting number there is per variation. Are you running a typical A/B test with a control and 1 variant? You’ll need to double the resulting number to get your true minimum traffic. Are you running a test with 3 variants? Quadruple the number. You get the idea. This can result in very high numbers very quickly.

If you see a number that’s clearly beyond the traffic you’d ever expect to get in a month, work on one-off optimizations to your funnel instead. Don’t A/B test. It’ll be a waste of your company’s time and resources. Testing isn’t how the struggling get good, it’s how the good get better.

“But nickd, I can launch a giant AdWords campaign, right?”

You could, yes. But first, ask yourself these questions:

Is AdWords traffic more or less likely to convert than organic traffic?
Is AdWords traffic sustainable across multiple years of A/B tests?
Are AdWords customers the right kinds of customers for my business?

Put another way: will you get a decent ROI on AdWords, or are you just running a big ad campaign so you can juice the numbers of your A/B test?

A better way: outreach, writing, and PR

If you have too little traffic for testing right now, there is fortunately a surefire way to get more. You should write about your field of expertise, guest post and podcast on others’ sites to increase your reach, and overall educate your audience about your specific point of view.

I have found no better substitute for this – and lord knows I walk the walk. If you want traffic, you need to toot your own horn, period.

Building the test

Most tests are built using a framework’s WYSIWYG in-page editor. In there, you click on certain elements, change what needs to be changed in a given variant, and the framework’s tracking code uses JavaScript DOM-editing fanciness to deliver the variant to the right proportion of your customers.

This is terrific for smaller tests on coloration, copy, and imagery. But what happens when you want to test a more radical rework? Or if you have some more dynamic, complicated content? That’s when you have to get others involved – and yes, you’ll have to get your hands dirty with code.

Working with code is a core part of any solid A/B testing strategy. You can’t just change calls to action and expect them to land every single time. You will need to work with code for most of your tests.

The Context

For a very long time, most testing efforts were hard-coded and self-hosted. People would create their own frameworks that handled all the work of delivery, tracking, and reporting. For example, take a look at A/Bingo, a Ruby framework for running your own tests. That’s one of the more sophisticated open source ones, should you still wish to roll your own.

Around the same time, Google Website Experiments, now rolled under the Analytics tarp, did roughly the same thing – but you didn’t have the WYSIWYG component, it was hard to analyze (this is Google Analytics, after all, where nothing is allowed to be easy or convenient), and most of your experiments still needed to be server-side.

We’ve come a long way since then. You can click, change text, add some goals, and deploy your test. If it weren’t for the convenience provided by modern testing frameworks, Draft Revise wouldn’t exist.

Worst case, you can leave the analysis up to the two big testing framework providers now. You don’t have to sweat the statistical analysis (much) or wring your hands over whether a variant is actually generating more revenue for you. That’s extremely good – not only for your own testing efforts, but for getting more people in the fold as well.

Yet this is still not enough.

When The Framework Alone Isn’t Enough

When isn’t your framework’s WYSIWYG editor enough?

Any dynamic content. Testing pricing? You’ll want to haul a developer in. Testing something that changes from region to region? That won’t work, either. WYSIWYG editors are only good at handling static content – unless you plan to remove the control’s element entirely and code it by hand, which could trigger any of a litany of JavaScript bugs.
Big, fancy JavaScript-y things. Got a carousel that dances across the screen for no good reason? You are likely bracing for a world of hurt if you want to manually rewrite all the JavaScript to turn that off directly in your framework. This is a case where it’s not impossible to use a WYSIWYG editor, but development work will be substantially easier for you if you develop your own solution.
Multi-page tests. Do you want to test changing the name of your product from, say, “Amazon Echo” to “Amazon Badonkadonk?” I mean, it’s your company, Jeff, but it seems like you might have “Amazon Echo” listed in many different places, including in images and <title> tags. Frameworks have become terrific at tracking across sessions, but unless you have the whole site set to declare Amazon Echo as a variable, and you’re ready to swap it out at a moment’s notice, a single line of injected JavaScript won’t do the trick – you’ll probably want something on the back end to change it out everywhere.
Massive site reworks. Did you just redesign your site, and you’re looking to measure the new thing’s performance against the old thing? Yeah, I would strongly caution against using a WYSIWYG editor for such an undertaking.

This is a lot, right? And some of it seems pretty critical. Bold reworks? Pricing changes? That all is testing 101 for most of us. My take: if you want to dabble in testing by doing most of the stuff with the least impact, by all means rely solely on a framework. But if you really want to build a solid practice that will endure no matter what you want to do to your site, you’re going to have to work with code – or enlist the help of someone who can do it for you.

Redirecting the page: the 20% Solution

All frameworks should let you create a split page test, which shunts the right proportion of visitors to a whole separate variant page. Put another way, if you run control on /home and variant on /home/foo, 50% of your visitors should be automatically redirected to /home/foo, and measured accordingly. This allows you to bypass the WYSIWYG editor for a home-rolled solution.

You can always create a new page and forward people there – and that allows you to make whatever changes you’d like. This only works for tests on a single page, though: if you’re making changes that affect any other pages on your site (say, with pricing), you’ll want to go with something a little more involved. Fortunately, most of the work is in your first-time setup, and you’ll be able to reuse it in your future testing efforts.

Creating a whole new site: the big rework solution

For situations where you’re taking the wrecking ball to your whole site and evaluating a new version, you can create a whole new set of pages, static, on the root of the site. (Rails allows you to do this in a /public/ folder.) Consider making the home page /home/index.html, /welcome/index.html, etc – and then create synonyms for the others: /plans/ instead of /pricing/, /join/ instead of /signup/, etc.

Then, use your split page test to send your variant’s traffic to /welcome/ instead of /home/. This is particularly good for SaaS businesses whose funnels are typically three pages (home, pricing, sign up); for ecommerce sites that involve many different pages and a lot of dynamic functionality, you may want to deploy to a whole different server, and redirect people to a subdomain like (say) shop.example.com instead of example.com.

GET queries: the 80% solution

For most substantial tests, I recommend clients set up a solution that redirects the page using a GET query.

For those who don’t know, GET queries are when you have an appended string that looks like ?foo=1 in your URL, which allows a page to grab variables for its own dynamic processing. For example, on most New York Times articles of a sufficient length, appending ?pagewanted=all to the end of a URL allows you to view the whole thing on one page. Additional GET queries are delimited with an ampersand, so ?pagewanted=all&foo=1 sets two variables: pagewanted to all and foo to 1. You can then pull GET queries in using your dynamic language of choice.

You’ll be employing a split page test here as well: just redirect your variant’s traffic to something like http://example.com/?v=1. Then, index.php pulls in the GET variable, determines that it’s been set to 1, and serves a variant (or changes the appropriate content) instead.

So, here’s some really crappy pseudocode:

$variant = $_GET['v'];
if $variant == 1 {
load "index-variant.php";
append "?v=1" to every link that points to a local site page;
} else {
load "index-control.php";
}

Two tactics are important here:

I always maintain two different pages that allow me to vet what’s on control and what’s on variant. That way, I don’t need to go mucking around in the actual code. I can always use variables to switch stuff like pricing out on the pages themselves.
See that second line of code, about appending ?v=1 to all local A tags? That’s so we can keep people on a variant through the whole conversion funnel. You don’t just want people to hit index.php?v=1; you want people to go to /pricing/?v=1, /signup/?v=1, etc.

What if the customer notices ?v=1 being appended, and deletes it in a refresh for some outlandish reason? Frameworks throw a cookie that persists across a session, and they’ll recognize that you’re supposed to be receiving a variant, so ?v=1 will come back – like a very persistent cockroach of URL modification.

What if the customer doesn’t have cookies enabled? They won’t get the test in the first place, and they’ll always be served control. Their decision will keep them from being counted in our test’s final analysis as well. (This goes for any tests you run, ever – not just these crazy ones.)

This Should Cover Most Things

I’m sure that you’re reading this and thinking of some situation where everything would catastrophically break for you. And that may be true for your context. But a little upfront reworking will go a long way in testing – and it’ll allow you considerably more freedom from the constraints of your testing framework’s WYSIWYG editor. Doing this work is critical to ensuring that you have a durable, long-term testing strategy that makes you money into perpetuity.

Note, in each of these cases, how the testing framework is never fully cut out of the picture. You’re still using it to redirect traffic and, crucially, gather insights into what your customers are doing. You’re still using it to plan and call a test. But you’re offloading the variant generation onto your own plate – which allows you near-infinite latitude in what you can test.

Please trust me here, as I speak from a wealth of experience: you don’t want to come up with a really promising test idea, run into the limitations of your framework, and then get into a political battle about what to do next. It will be enormously frustrating for you.

It’s far easier to have clarity in how to proceed – and creating the tests on your own allows you to proceed with confidence. Otherwise, you’ll probably just deploy the variant to production without testing it. And that’s not really the point of having a testing strategy, now is it?

Learning from tests

Once you get the testing process going, you can even mine past tests for research insights. How do A/B tests themselves give us more research data?

Most testing frameworks offer heat & scroll maps of both the control & the variant – so even if you get a winning variant that changes customer behavior significantly, you’re able to maintain an up-to-date heat map.
All testing frameworks integrate with analytics software, in order to provide greater insight into the economic impact of your design decisions.
And finally, every test you run should go into a database of past test results that you maintain for record-keeping purposes. This allows you to understand what’s worked and what hasn’t in the past – which give you greater insight into what to test next.

And don’t forget: you should always be researching while tests run. Why? Doing so lets you come up with the next set of tests, from idea to prototype. And maximizing the amount of time that a test is actively running will make testing maximally valuable for your store.

Concluding thoughts

If you have enough traffic to get statistically significant results, you should be doing what you can to optimize your store now. (Better to start now than two weeks before the holidays!) Off-the-shelf themes always need to be improved, and optimization can bump conversion rates by at least 1%. That said, research should always happen for stores of any size, and there isn’t really much of a downside to it.

Testing is daunting for many – not because it’s hard (it isn’t!), but because it requires a mindset shift: one which focuses on customer inquiry and careful, incremental improvement.

← Back Store →