Out now: Value-Based Design, the definitive way to prove your design’s worth. Read it.

How to Run an A/B Test

 

You’ve probably heard about A/B testing, but you have no idea where to start. A/B testing is one tool in the arsenal of a value-based designer, and its primary purpose is to measure the economic impact of a change in order to determine whether or not to make a new design decision. In its simplest form, you split traffic evenly between control and variant, and see what happens.

I’ve run over 500 experiments for all sorts of clients since 2013, and the number 1 thing I hear from store owners, designers, and developers is: what should I test?

Creating an A/B test isn’t so hard (you can use Google Optimize to do so), and most frameworks have a toolkit together for analyzing and acting on their findings. But without good ideas, you have no reason to be testing in the first place.

We all want to know what works, of course, but it’s not quite so easy as changing your headline and watching the numbers go up. Fortunately, there is a surefire winning strategy: through research.

Good testing ideas take time and effort to generate, and research is the best and most reliable way to do so. For one, research results in a higher success rate – which means more revenue for your store. But on ecommerce in particular, your off-the-shelf theme needs to be optimized in order to perform as well as it possibly can. With this in mind, I’ll present the best ways to quickly and cheaply conduct research for your own store.

What do we mean by research?

First, let’s get on the same page. What is research? I define research as any information-collecting process that requires direct communication with people who use – or could use – your product. Research can take many forms; here’s just a few:

  • Literally calling and interviewing customers. Recruit people on your website and get on the phone with them. Talk about what they deal with on your product, and try to assess how it fits into the broader context of their lives.
  • Tracking people. Heat maps, analytics, scroll maps, and referral tracking are all forms of this. If you have looked at data and acted on it, that’s a form of research.
  • Surveying people. Throw something on Wufoo or Typeform, call it your business’s annual survey, and ask questions about their use, demographics, and desires.
  • Usability testing. You have a prototype or an existing product. You want to see how people use it. So you sit someone down at a computer, ask them to complete a predetermined set of tasks, and assess how successful (or unsuccessful) they were. I use UserTesting to run usability tests.

What to do first

Research as much or as little as you’d like, but you should always use research to support your testing decisions. There are a few things you should do no matter what:

Heat & Scroll Maps

You should always run heat and scroll maps on key pages in your funnel, in order to understand where people are clicking and how many of them are scrolling. I typically use Hotjar to do this, although Crazy Egg, Clicktale, and Mouseflow do roughly the same thing.

Heat maps teach you where people click, and scroll maps help for longish pages – to show whether you’re actually capturing customers’ attention. If people are clicking frequently on an element that isn’t pliant, you probably need to link it somewhere. If people stop scrolling after a given section, you probably need to rework or remove it.

All elements on the page should support conversion and revenue generation, period. Heat maps often support what we know politically: that customers simply don’t care about the ancillary bells & whistles we add to make ourselves feel good about features, instead focusing on the benefits and outcomes that our product can provide.

I can’t tell you how many times I’ve stared at a heat map where the product’s big new marquee feature was a huge dark spot on the page; it is best measured in scientific notation. Most people favor visiting your product page, vetting the price, and they don’t do much beyond skim. Heat maps teach us that your page needs to be as tightened and conversion-focused as humanly possible.

Google Analytics

Next, install Google Analytics on the site, if it’s not there already. No matter what, you need to spend some time fine-tuning your Google Analytics installation to make it clear how people are really behaving.

I guarantee you that your Google Analytics installation has significant room for improvement. You probably installed it and forgot about it. Or you just look at how many people are hitting the site.

Google Analytics is a byzantine horror. It’s insanely painful and distracting to work with in any detailed fashion. Nobody wants to do it. This means, as an A/B tester, you have a hugely expensive problem that people with money want solved.

It behooves you to wrangle the beast. Learn Google Analytics, establish solid goals that reflect conversion and revenue, and monitor it continuously to ensure that you’re meeting your metrics. You can use Google Analytics to learn:

  • What people are doing. Where are they going? How do they typically interact with your funnel?
  • The impact of every feature on every page. How focused are each of your funnel’s features around conversion and revenue generation? Heat and scroll maps help with this, too.
  • Whether specific browsers are hurting conversion. Go to “Audience → Technology → Browser & OS report” in Google Analytics. This tells you whether a specific browser or operating system is leaking revenue for you. Now you have a development problem on your hands.
  • Whether any pages are too slow. Speed absolutely matters, especially with mobile. Go to “Behavior → Site Speed → Page Timings” and check if there are any outliers. Fix them. You will make money.

ConversionXL has more resources on Google Analytics: how to configure it and how to set up goals, segments, and events.

Heuristic Evaluations

“Heuristic evaluation” sounds like something you pay a doctor to do, but it’s not that hard. It essentially means you create a check list of criteria that a website should have in order to support best practices in conversion, and you evaluate whether your site succeeds or fails in each of these criteria.

The notion of heuristic evaluation goes back to the early days of usability research in the mid-Eighties. It’s likely that you have a practice quite similar to it in your own organization: think of it as unit testing, KPIs, or brand guidelines, and you’re on the right track.

Here are some of the oldest criteria for heuristic evaluation, by Jakob Nielsen. Craft a series of heuristics that best fit your site’s revenue goals, and get at least two others to evaluate the site alongside you.

Surveys

Finally, run a survey of prospective customers – and get post-purchase surveys set up for all future orders. Why? Because any research-driven process should provide a mixture of quantitative (analytics, heat maps, A/B testing results, etc) and qualitative (stories, interviews, etc) information. Gathering qualitative insights can teach you about your customers in ways you couldn’t even imagine.

You can use Typeform or Wufoo to configure the survey and collect the responses. Just include a survey callout at the top of your home page, and put respondents in a contest to win a free month of your service.

Some interesting questions include:

  • Why did you choose us?
  • What do you use us for?
  • What value have you gotten out of our product lately?
  • Did you take a look at any of our competitors?
  • Are there any aspects to our service that you find frustrating, or which you’d be likely to change?
  • How easy was it to check out? (This is a great one to put on the “thank you” page!)
  • What new things would you like to see from us?
  • How were you recommended to use the service?
  • On a scale from 0 to 10, how likely are you to recommend us to a friend or colleague?

All of these should give you ample information for crafting the right pitch, addressing the right concerns in a marketing page, and shepherding people through the process.

Synthesis

Now that you have some research, you need to make sense of it. Synthesis is the process of generating insights from your research. It forms the core of the design process – yet it is something you and I do every single day, regardless of our career paths or design know-how.

When making sense of research, ask two questions:

  • Am I observing a trend? One interview is well and good, but it may be an outlier. Better to confirm your findings with other sources first – even if it’s just a couple.
  • Does this represent a lost revenue opportunity for my store? If you find any revenue leaks in your research, you need to plug them. Some of these mean fixing bugs (a page loading slowly on mobile, say); others mean running a test to hedge risk (such as revising a pitch, or creating a totally new value proposition).

Here are a few questions that you should ask of any testable idea:

  • What research backs up this change?
  • What does this change have to do with the business’s overarching goals?
  • What is the next step on this page?
  • In what ways is the next step clear to the customer?
  • Does the page appear credible? How?
  • Does the change we’re trying to make aid in credibility?
  • Does the change improve usability? How?

These questions largely reflect the ones that ConversionXL and Baymard Institute both use when analyzing various ecommerce sites for what works and what doesn’t.

A/B testing is an application of the scientific method to your design process – and good scientific processes start with a hypothesis.

In testing, a hypothesis contains three components:

  1. The design change. Rewording the headline, changing your CTA, reworking the layout, etc.
  2. The goal. Conversions from trial, average revenue per user (ARPU), etc.
  3. The intended lift. An increase by 5%, say. Note that this involves a specific number. This is because you cannot run a test without a specific magnitude of the change in mind.

A hypothesis does not need to contain:

  • The reason why. “The text is more persuasive.” Oh, really? What about it is more persuasive? Do you really know your customers’ motivations with that level of granularity? Are you a mind-reader?
  • The impact on other goals. You have a goal and its expected change. That’s it. Other goals are going to be harder to measure, because their initial rates differ, so the relative change will differ as well (for example, more people are proportionally signing up for your mailing list than converting from trial).
  • The impact on other elements on the page. You’ve changed something in your masthead. Will it result in people paying less attention to something further down the page? That matters for your design, sure, but not for your hypothesis.
  • Literally anything else. You are probably thinking many things right now. Trust me, your hypothesis does not contain those things, either. It contains only those three things.

So, for those of you who are just starting out with their first tests, you may feel it necessary to skip the creation of a hypothesis. Don’t. It will be disastrous for your testing practice.

Why a Hypothesis is Essential

A hypothesis keeps you focused on the only goal that matters to your – or any – organization: revenue generation. It also allows you to gain the real benefit from testing: moving away from internal debates, speculation, and blind faith.

You can’t test rationally without a hypothesis. A hypothesis is the goal that you rally an organization behind. With a hypothesis, you have a clear change, a rational outcome, and an expectation around how it will economically perform. Without a hypothesis, you’re doing the exact same toxic stuff you were doing before you started testing: testing to settle a debate, testing because a change might be “more persuasive”, or testing because other cool people are testing. They’re cool because they’re acting coolly, not because they’re testing.

How You Create Hypotheses

A hypothesis is the exact connection point between testing and design research. Why? Because it includes the precise design change you’ll be making – in the service of an A/B test.

How to Choose a Goal

You can track many things in your testing framework, but to calculate a sample size and attach your goal to a specific design decision, only one goal is able to act as the primary goal.

Let’s say you have a clear sense of what to change in a test. How do you measure it?

This should be obvious. It’s not. Let’s dive into how to configure the best goals possible for a test, and how to pay attention to them when it’s time to call the final result.

What goals to configure

First, Revenue

Your tests should exist to generate revenue for your business – otherwise, there’s no reason for you to run a test. Revenue is the first and most important goal for any A/B test.

Revenue tracking exists in all major A/B testing frameworks; here are how-to guides for VWO and Optimizely. On the confirmation page for your purchase, you should add revenue tracking to your snippet.

When you’re in your framework, then, add revenue tracking as your primary goal, and set the goal to the correct URL. It should look a little like this, configured to your own site:

Revenue tracking screen shot

Next, Confirming the Sale

Just as you have to install revenue tracking on the confirmation page, you should also create a second goal that tracks views of this page. This allows you to confirm that you’re getting enough qualified, wallet-out traffic.

The page should be unique to the customer’s experience of the site. In a SaaS app, your thank you page shouldn’t be the same URL as the product’s dashboard. In an ecommerce site, your thank you page shouldn’t be the same as the order status page.

Then, Track Every Step in the Funnel

You should configure goals that match every step in your funnel, and take them a lot less seriously than the goals that actually close a sale. This allows you to get a whole portrait of your funnel, which is valuable for assessing whether there are any significant drop-off points which may be best addressed with further testing.

Here’s how these goals look for a typical ecommerce site:

Ecommerce site goals screen shot

Finally, Engagement

Both frameworks have “engagement” as a goal, to show the proportion of customers that actually engage with a page. This is basically a freebie, so add it just to make sure the test is generating data.

I don’t think engagement is a valuable metric otherwise – not even on blogs or news sites. If you must measure your readers’ attention – which, to be clear, is very rare in A/B testing – it’s much better for you to come up with more granular metrics.

Bonus for SaaS: Logins

Create a goal that tracks people who log in. Why? Because your funnel doesn’t need to preach to the converted – and there’s a huge difference between disqualifying 70% and 95% of the total traffic coming to your home page.

When calculating sample size, then, use the total number of hits minus the number of views to your login page as your metric.

How to configure a goal

A goal can be configured in a variety of ways, and every major testing framework will have a robust set of criteria for you to use.

CSS Selectors

This tracks clicks and taps on any CSS selector you desire. Here’s a huge reference of them. These are great for:

  • Clicks to your site’s navigation.
  • Clicks to your site’s footer.
  • Clicks anywhere on a pricing <div>, measured against clicks on actual CTA buttons within that <div>.
  • Clicks to a primary CTA button that’s scattered in multiple places on your home page.
  • Clicks to an add to cart button, when there are multiple ways to check out (think PayPal, Amazon, on-site, etc).
  • Attempts to enter a search query, versus submissions of that query.
  • Focuses on a form field, such as to enter one’s email address. (This is great for fixing validation bugs!)

Overall, CSS selector goals are great for addressing one-off issues with the usability of the site – and they’re especially great for validating the insights that you get from your heat maps.

Form submissions

Frameworks will track form submissions to a specific URL as well. You’ll want to enter the href that the form is going to, if it exists.

If it doesn’t have a direct target, and you don’t have any other forms that someone can use at this point in the funnel, use a wildcard with an asterisk so it can capture all form submissions everywhere.

Page views & link clicks

This is the most common goal by far: tracking views of a specific page. And it comes in two forms.

  • Page views track hits to a specific page, regardless of source.
  • Link clicks track clicks to a specific page on the test page you’re measuring.

I use each of these goals to measure each step in a funnel, usually with page view goals. Why? Ultimately, it doesn’t matter how much tire-kicking the person is doing, as long as they convert. And I can get a sense for the quantity of tire-kicking by running a heat map.

You can measure views to a specific page by indicating the exact URL (“URL matches”, in the string-matching queries above). You can measure views to many pages by using other string-matching queries:

  • “Matches pattern” uses wild-card matching to specify a page. For example, all URLs with blah can be specified with *blah*. All URLs with blah and bleh can be specified with *bl?h*.
  • “Contains” simply returns all URLs containing that string. As with the blah example above, simply typing blah will do the same thing.
  • “Starts with” and “ends with” are what you think – but you need to begin “starts with” using the correct trailing protocol. “https://example.com” will only work on “https://example.com”, not “https://www.example.com” or “http://example.com”.
  • “URL matches regex” uses regular expressions to match your URLs, which is terrific for highly complex string matching. Rather than spend ten thousand words teaching you about regular expressions, I refer you to Reuven Lerner’s authoritative source on the subject.

What if your goal isn’t firing?

You might be using the wrong pattern for your URLs, or the target URL may not have your framework’s tracking code.

In the former case, you need to change the goal so it accurately reflects what’s happening on the site – and you probably need to flush your test’s data and start over.

In the latter case, you need to install the tracking code on the corresponding test pages. Remember that VWO’s tracking code should be placed at the end of the tag, and Optimizely’s should be placed at the beginning. Here are the instructions for installing Google Optimize.

Measure as Carefully as You Test

Don’t forget that it’s quite possible to over-measure – and act on the wrong insights.

First, remember that revenue outweighs all other goals. You are testing to generate revenue, always.

Next, revenue-related goals outweigh other goals. Hits to a confirmation page correspond directly to revenue, as long as you aren’t providing something for free.

Finally, rank behavior-based goals last. Clicks on navigation, engagement, and the like are sometimes useful for framing design decisions, but consider them more like a part of your research process than an actual testing outcome.

How to Calculate Intended Lift

So, you have your primary goal. You have (presumably) assessed the conversion rate of this goal in Google Analytics. (If you haven’t, go do that now). Now, you need to calculate the minimum detectable effect, or the intended lift, of the test.

This depends heavily on the traffic you currently get. The more traffic you have, the smaller the lift you can detect – which means the more likely you are to get small wins from your testing, in addition to home runs.

You calculate your intended lift backwards: by knowing your current traffic, conversion rate, and intended duration of the test (which should be longer than 1 week and shorter than 4 weeks), throw your numbers in Evan Miller’s tool for calculating sample size.

Optimizely has a ton more on minimum detectable effect and how to determine whether you have enough traffic to get reliably significant results.

Prioritization

How do you know what to test next? There are tons of different resources for this, many of which have their own TLAs:

  • PIE, by WiderFunnel.
  • ICE, a classic business-101 way to look at optimization problems.
  • PXL, by ConversionXL, which weighs many factors in what to test next.

Obviously, we have our own methodology in the Draft Method as well. Here’s what you can do to prioritize new test ideas quickly.

First, ignore any ideas that are not – or can’t be – supported by research. We’ve mentioned this over and over in the past, but researched test ideas always outperform vague guessing.

PXL’s framework accounts for this by devoting four criteria explicitly to research; the Draft Method accounts for this by making research a prerequisite for any test idea. If an idea hasn’t been researched, you need to follow the hunch and confirm it through analytics, heat maps, usability tests, or customer interviews.

There are 3 parameters that you should assess for every test idea, scored from 1 to 10:

Parameter 1: Feasibility

In short, how hard is this to build out? Does it require development effort, new prototypes, wireframes, or conditional logic?

Score a 10 if you can knock this out in 5 minutes using your testing framework alone; score a 1 if it requires major refactors of the software, new testing frameworks, building out something in-house, etc.

Parameter 2: Impact

“Impact” is usually the area with the most guesswork. How likely is this change to make an impact on the overall site?

Here are some questions to ask when assessing impact:

  • Does the change apply to a large swath of prospective customers?
  • Does the change apply to a segment of customers that broadly converts below the norm?
  • How drastic is the change being made? Are you just changing a button color, or are you throwing the button away and replacing it with something entirely new?
  • Is the element located above the fold, especially on mobile?
  • Does the element map directly to customer motivations?
  • Does the element simplify the page?
  • Is the element noticeable within 5 seconds?
  • Does the test occur on a high-traffic page?
  • Does the test occur on a page that’s within the conversion funnel?
  • How much of the business’s revenue depends on this particular element performing better?

Score a 10 if this is changing a highly load-bearing element in a radical way; score a 1 if you’re changing your button from green to blue. (Don’t do that.)

Parameter 3: Business Alignment

On a scale from 1 to 10, how much does this align with your business’s goals?

Keep in mind that this is not an answer to how much someone in sales wants the test to happen, nor is it an answer to how passionately the CEO cares. Business strategy is always a long-term play.

If tests are obviously contradictory to the core branding, values, or terms of the business, they should only go live with the greatest of caution and care – and only after considerable research in support of them.

On the other hand, if the business’s long-term strategy jibes strongly with the sort of test you’re putting together, then that’s a solid motivator for your launching the test sooner.

Add ‘Em Up

You now have an aggregate number from 3 to 30.

At this point, I usually throw away any tests that score below 10; there’s always lower-hanging fruit to be found, even if it requires more research from the team.

And then I sort the rest, going slowly down the list from 30 to 11 – and always trying to find more high-ranking tests in the meantime. Tool-wise, we’ve already established that Trello – or some similar sort of kanban board – is a great way to manage the order of tests. Here’s the template we use at Draft for all of our clients, if you need a place to get started.

Remember that you’re never done generating new test ideas, and refreshing the list with high-priority, high-impact tests is the best way to keep an optimization practice fresh!

Minimum sample size

Testing is worthless without valid statistical significance for your findings – and you need enough qualified, wallet-out traffic to get there. If you want a lengthy mathematical explanation for this, here’s one.

What does this mean for you in practice? Before you run any test, you need to calculate the amount of people that should be visiting it. Fortunately, there’s an easy way to calculate your minimum traffic. Let’s go into how to do this!

Maximum timeframe

First off, you should be getting the minimum traffic for an A/B test within a month’s time. Why a month? Several reasons:

  • It won’t be worth your organization’s time or resources to run tests so infrequently. You are unlikely to get a high ROI from A/B testing within a year of effort.
  • One-off fluctuations in signups – either by outreach campaigns, holidays, or other circumstances – are more likely to influence your test results.
  • Your organization will be more likely to spend time fighting over the meaning of small variations in data. That is not a positive outcome of A/B testing.
  • You will not be able to call tests unless they’re total home runs, for reasons I’ll describe below.

Sample size is calculated with two numbers:

  1. Your conversion rate. If you don’t have this already calculated, you should configure a goal for your “thank you” page in Google Analytics – and calculate your conversion rate accordingly.
  2. The minimum detectable effect (or MDE) you want from the test, in relative percentage to your conversion rate. This is subjective, and contingent on your hypothesis.

A note on minimum detectable effect

The lower the minimum detectable effect, the more visitors you need to call a test. Do you think that a new headline will double conversions? Great, your minimum detectable effect is 100%. Do you think it’ll move the needle less? Then your minimum detectable effect should be lower.

Put another way, if you want to be certain that a test causes a small lift in revenue-generating conversions – let’s say 5% – then you will need more traffic than a hypothesis that causes your conversions to double. This is because it’s easier to statistically call big winners than small winners. It also means that the less traffic you have, the fewer tests you’ll be able to call.

You should not reverse-engineer your minimum detectable effect from your current traffic levels. A test either fulfills your hypothesis or it doesn’t, and science is historically quite unkind to those who try to cheat statistics.

How to calculate sample size

I use Evan Miller’s sample size calculator for all of my clients. You throw your conversion rate and MDE numbers in there, and calculate the level of confidence you want your test to be at.

I recommend at least 95% confidence for all tests. Why? Because anything less means you still have a high chance for a null result in practice. Lower confidence raises the chance that you’ll run a test, see a winner, roll it out, and still have it lose in the long run.

Let’s say your conversion rate is 3% and your hypothesis’s MDE is 10% – so you’re trying to run a test that conclusively lifts your conversion rate to 3.3%. Here’s an example of how I fill this form out.

Note that the resulting number there is per variation. Are you running a typical A/B test with a control and 1 variant? You’ll need to double the resulting number to get your true minimum traffic. Are you running a test with 3 variants? Quadruple the number. You get the idea. This can result in very high numbers very quickly.

If you see a number that’s clearly beyond the traffic you’d ever expect to get in a month, work on one-off optimizations to your funnel instead. Don’t A/B test. It’ll be a waste of your company’s time and resources. Testing isn’t how the struggling get good, it’s how the good get better.

“But nickd, I can launch a giant AdWords campaign, right?”

You could, yes. But first, ask yourself these questions:

  • Is AdWords traffic more or less likely to convert than organic traffic?
  • Is AdWords traffic sustainable across multiple years of A/B tests?
  • Are AdWords customers the right kinds of customers for my business?

Put another way: will you get a decent ROI on AdWords, or are you just running a big ad campaign so you can juice the numbers of your A/B test?

A better way: outreach, writing, and PR

If you have too little traffic for testing right now, there is fortunately a surefire way to get more. You should write about your field of expertise, guest post and podcast on others’ sites to increase your reach, and overall educate your audience about your specific point of view.

I have found no better substitute for this – and lord knows I walk the walk. If you want traffic, you need to toot your own horn, period.

Building the test

Most tests are built using a framework’s WYSIWYG in-page editor. In there, you click on certain elements, change what needs to be changed in a given variant, and the framework’s tracking code uses JavaScript DOM-editing fanciness to deliver the variant to the right proportion of your customers.

This is terrific for smaller tests on coloration, copy, and imagery. But what happens when you want to test a more radical rework? Or if you have some more dynamic, complicated content? That’s when you have to get others involved – and yes, you’ll have to get your hands dirty with code.

Working with code is a core part of any solid A/B testing strategy. You can’t just change calls to action and expect them to land every single time. You will need to work with code for most of your tests.

The Context

For a very long time, most testing efforts were hard-coded and self-hosted. People would create their own frameworks that handled all the work of delivery, tracking, and reporting. For example, take a look at A/Bingo, a Ruby framework for running your own tests. That’s one of the more sophisticated open source ones, should you still wish to roll your own.

Around the same time, Google Website Experiments, now rolled under the Analytics tarp, did roughly the same thing – but you didn’t have the WYSIWYG component, it was hard to analyze (this is Google Analytics, after all, where nothing is allowed to be easy or convenient), and most of your experiments still needed to be server-side.

We’ve come a long way since then. You can click, change text, add some goals, and deploy your test. If it weren’t for the convenience provided by modern testing frameworks, Draft Revise wouldn’t exist.

Worst case, you can leave the analysis up to the two big testing framework providers now. You don’t have to sweat the statistical analysis (much) or wring your hands over whether a variant is actually generating more revenue for you. That’s extremely good – not only for your own testing efforts, but for getting more people in the fold as well.

Yet this is still not enough.

When The Framework Alone Isn’t Enough

When isn’t your framework’s WYSIWYG editor enough?

  • Any dynamic content. Testing pricing? You’ll want to haul a developer in. Testing something that changes from region to region? That won’t work, either. WYSIWYG editors are only good at handling static content – unless you plan to remove the control’s element entirely and code it by hand, which could trigger any of a litany of JavaScript bugs.
  • Big, fancy JavaScript-y things. Got a carousel that dances across the screen for no good reason? You are likely bracing for a world of hurt if you want to manually rewrite all the JavaScript to turn that off directly in your framework. This is a case where it’s not impossible to use a WYSIWYG editor, but development work will be substantially easier for you if you develop your own solution.
  • Multi-page tests. Do you want to test changing the name of your product from, say, “Amazon Echo” to “Amazon Badonkadonk?” I mean, it’s your company, Jeff, but it seems like you might have “Amazon Echo” listed in many different places, including in images and <title> tags. Frameworks have become terrific at tracking across sessions, but unless you have the whole site set to declare Amazon Echo as a variable, and you’re ready to swap it out at a moment’s notice, a single line of injected JavaScript won’t do the trick – you’ll probably want something on the back end to change it out everywhere.
  • Massive site reworks. Did you just redesign your site, and you’re looking to measure the new thing’s performance against the old thing? Yeah, I would strongly caution against using a WYSIWYG editor for such an undertaking.

This is a lot, right? And some of it seems pretty critical. Bold reworks? Pricing changes? That all is testing 101 for most of us. My take: if you want to dabble in testing by doing most of the stuff with the least impact, by all means rely solely on a framework. But if you really want to build a solid practice that will endure no matter what you want to do to your site, you’re going to have to work with code – or enlist the help of someone who can do it for you.

Redirecting the page: the 20% Solution

All frameworks should let you create a split page test, which shunts the right proportion of visitors to a whole separate variant page. Put another way, if you run control on /home and variant on /home/foo, 50% of your visitors should be automatically redirected to /home/foo, and measured accordingly. This allows you to bypass the WYSIWYG editor for a home-rolled solution.

You can always create a new page and forward people there – and that allows you to make whatever changes you’d like. This only works for tests on a single page, though: if you’re making changes that affect any other pages on your site (say, with pricing), you’ll want to go with something a little more involved. Fortunately, most of the work is in your first-time setup, and you’ll be able to reuse it in your future testing efforts.

Creating a whole new site: the big rework solution

For situations where you’re taking the wrecking ball to your whole site and evaluating a new version, you can create a whole new set of pages, static, on the root of the site. (Rails allows you to do this in a /public/ folder.) Consider making the home page /home/index.html, /welcome/index.html, etc – and then create synonyms for the others: /plans/ instead of /pricing/, /join/ instead of /signup/, etc.

Then, use your split page test to send your variant’s traffic to /welcome/ instead of /home/. This is particularly good for SaaS businesses whose funnels are typically three pages (home, pricing, sign up); for ecommerce sites that involve many different pages and a lot of dynamic functionality, you may want to deploy to a whole different server, and redirect people to a subdomain like (say) shop.example.com instead of example.com.

GET queries: the 80% solution

For most substantial tests, I recommend clients set up a solution that redirects the page using a GET query.

For those who don’t know, GET queries are when you have an appended string that looks like ?foo=1 in your URL, which allows a page to grab variables for its own dynamic processing. For example, on most New York Times articles of a sufficient length, appending ?pagewanted=all to the end of a URL allows you to view the whole thing on one page. Additional GET queries are delimited with an ampersand, so ?pagewanted=all&foo=1 sets two variables: pagewanted to all and foo to 1. You can then pull GET queries in using your dynamic language of choice.

You’ll be employing a split page test here as well: just redirect your variant’s traffic to something like http://example.com/?v=1. Then, index.php pulls in the GET variable, determines that it’s been set to 1, and serves a variant (or changes the appropriate content) instead.

So, here’s some really crappy pseudocode:

$variant = $_GET[‘v’];
if $variant == 1 {
load “index-variant.php”;
append “?v=1” to every link that points to a local site page;
} else {
load “index-control.php”;
}

Two tactics are important here:

  • I always maintain two different pages that allow me to vet what’s on control and what’s on variant. That way, I don’t need to go mucking around in the actual code. I can always use variables to switch stuff like pricing out on the pages themselves.
  • See that second line of code, about appending ?v=1 to all local A tags? That’s so we can keep people on a variant through the whole conversion funnel. You don’t just want people to hit index.php?v=1; you want people to go to /pricing/?v=1, /signup/?v=1, etc.

What if the customer notices ?v=1 being appended, and deletes it in a refresh for some outlandish reason? Frameworks throw a cookie that persists across a session, and they’ll recognize that you’re supposed to be receiving a variant, so ?v=1 will come back – like a very persistent cockroach of URL modification.

What if the customer doesn’t have cookies enabled? They won’t get the test in the first place, and they’ll always be served control. Their decision will keep them from being counted in our test’s final analysis as well. (This goes for any tests you run, ever – not just these crazy ones.)

This Should Cover Most Things

I’m sure that you’re reading this and thinking of some situation where everything would catastrophically break for you. And that may be true for your context. But a little upfront reworking will go a long way in testing – and it’ll allow you considerably more freedom from the constraints of your testing framework’s WYSIWYG editor. Doing this work is critical to ensuring that you have a durable, long-term testing strategy that makes you money into perpetuity.

Note, in each of these cases, how the testing framework is never fully cut out of the picture. You’re still using it to redirect traffic and, crucially, gather insights into what your customers are doing. You’re still using it to plan and call a test. But you’re offloading the variant generation onto your own plate – which allows you near-infinite latitude in what you can test.

Please trust me here, as I speak from a wealth of experience: you don’t want to come up with a really promising test idea, run into the limitations of your framework, and then get into a political battle about what to do next. It will be enormously frustrating for you.

It’s far easier to have clarity in how to proceed – and creating the tests on your own allows you to proceed with confidence. Otherwise, you’ll probably just deploy the variant to production without testing it. And that’s not really the point of having a testing strategy, now is it?

Learning from tests

Once you get the testing process going, you can even mine past tests for research insights. How do A/B tests themselves give us more research data?

  • Most testing frameworks offer heat & scroll maps of both the control & the variant – so even if you get a winning variant that changes customer behavior significantly, you’re able to maintain an up-to-date heat map.
  • All testing frameworks integrate with analytics software, in order to provide greater insight into the economic impact of your design decisions.
  • And finally, every test you run should go into a database of past test results that you maintain for record-keeping purposes. This allows you to understand what’s worked and what hasn’t in the past – which give you greater insight into what to test next. Here’s a guide to maintaining a Trello board of test ideas, which I usually package with my course The A/B Testing Manual.

And don’t forget: you should always be gathering research insights while tests run. Why? Doing so lets you come up with the next set of tests, from idea to prototype. And maximizing the amount of time that a test is actively running will make testing maximally valuable for your store.

Concluding thoughts

If you have enough traffic to get statistically significant results, you should be doing what you can to optimize your store now. (Better to start now than two weeks before the holidays!) Off-the-shelf themes always need to be improved, and optimization can bump conversion rates by at least 1%. That said, research should always happen for stores of any size, and there isn’t really much of a downside to it.

Testing is daunting for many – not because it’s hard (it isn’t!), but because it requires a mindset shift: one which focuses on customer inquiry and careful, incremental improvement.

← Back to the Blog Check Out Draft’s Store →