Today, we will talk about A/B testing, or more specifically, some of the statistics behind A/B testing. People tend to abuse this a little, so it helps to have an intuitive understanding of what’s going on under the hood, so to speak.
First, we should ask why we do A/B testing at all. The basic idea is to try to increase revenue. For example, say somebody has an idea for how they want to change their website, perhaps by changing the checkout procedure, or other ways that users or customers interface with the website – something as simple as a design change. This can affect search results, or the way that traffic is rerouted through the website; it could affect multiple different things.
Let’s define a couple of terms here. Think back to ninth grade science experiments: controls and variants.
The control is how things currently stand. Take an example of a Groupon banner. It’s normally green in color, so it’s our control.
Next, we have the variant treatment. Say for example we’ve decided to redesign the Groupon banner and make it hot pink, and we think that this will help the product to sell like hotcakes. Now, chances are that probably is not the case, but this is what we need to determine.
So the question is this: which banner is going to be better – the original control, or the hot pink variant?
Can we just test them both and see which does better? Well, to understand this we need to get into the math and understand some terms that we use in statistics. John Forbes Nash, the man who inspired the film A Beautiful Mind, once said, “You don’t have to be a mathematician to have a feel for numbers.”
We realize that math is not everyone’s favorite topic. Some people like it but some people get a little math-phobic. Let’s do an intuitive overview – no equations, but a few graphs, just to get a feel for how the numbers can fluctuate.
Let’s say I have ten pennies and ten dimes and a room with 20 people. There is nothing special about these coins, I just took them from my change jar.
I’m going to distribute the pennies amongst people in one area, and ask them to flip the coins. If they get heads, they should raise their hands. I’ll then distribute these dimes to the people over here, and if they get heads, they should raise their hands.
My hypothesis is that one of these types of coins, either pennies or dimes, will be better than the other at producing heads. If that’s true, then whenever we do a coin toss, we can use that knowledge to predict the result. We will know that the statistical chance of that coin producing heads will be higher or lower.
The ten people with pennies flipped their coins, and seven out of ten produced heads. The ten people with dimes flipped their coins, and five out of ten produced heads.
We have found that pennies do better, with a 40% lift in comparison to dimes. When we say lift, we mean relative difference. Our experiment produced 40% more heads with pennies than dimes. So we should always use pennies if we are tossing a coin and want to win by calling heads.
Is This Reliable So Far?
So what’s wrong with this hypothesis? We don’t have significance.
Significance is the chance that you would see a result, given that there is no difference between the control and variant. In fact, it’s actually likely to get seven or five heads if you are only flipping ten coins.
Significance is what helps you to limit your false-positive rate, which is when you think there is a difference but in fact there is not. There is actually no difference. When you start flipping a coin, you begin with a 50:50 chance of heads or tails. We can look at the distribution of possible heads that we can get. Some people might nitpick and say it’s binomial for small numbers, but it’s mostly a normal distribution.
Some people refer to this graph as a Gaussian curve, others as a bell curve, although that term is falling out of favor these days. This graph shows the likelihood of achieving the same number of heads if you flip a coin ten times.
The mean average is that you will get five heads, however, you are still likely to get four, six, even three heads or seven heads as we saw. Ten random coin flips can produce any of these results. If you only flip ten coins in your experiment, you cannot say deterministically what the result will be. Each coin can be very different from another.
However, as you start increasing your number of flips, something interesting happens.
The width of normal distribution is known as the standard error of the mean. This means that we have an idea of what our mean average is, what our mean is, and that there’s some error, whether you were to do 10, 100 or 1000 coin flips. To determine how much error we would have from the true mean, you’ll see that as we flip more and more coins, the spread gets narrower and narrower. We keep getting closer and closer to the true mean, at least in terms of percentages.
While we have been hypothesizing the number of heads, what we care about here are percentages. If we relate this back to A/B testing, we are concerned with conversion rates – for example, the percentage of website users that actually make a purchase.
The whole idea here is that the more we flip the coins, the closer we get to our true mean. Now, it turns out that normal distributions have some very nice properties about them. This is known as the 68 95 99.7 rule, or the Empirical Rule.
The Empirical Rule
The Empirical Rule states that 68% of the time you will be within one standard error of the mean. So in this case, we’re flipping our coins. If you flip your coins 100 times, then 68% of the time, you’ll be within 45 to 55 heads.
Similarly, 95% of the time you’ll be within two standard errors of the mean. So you’ll be between 40 and 60 heads. We should mention five heads is a standard error of the mean.
If you’re flipping a hundred times, similarly again, if we go down to three standard errors of the mean, then we will be within that range 99.7% of the time. This is useful because, again, normal distributions are ideal. Our data should be normally distributed.
Now let’s say we want to start comparing means, which is what we did with our pennies and dimes. It’s what you want to do in you’re A/B testing. You want to see if your controls or if your variants are doing really well. You want to compare what the mean conversion rate is between them. To achieve this, we conduct a standard statistical test known as the t-test.
Ultimately, a t-test measures the difference in mean averages, in units of our uncertainty and units of our standard error of the mean. If the difference in mean is much larger, then we can say that it is statistically significant.
There are two different ways that this can happen.
- The first is to look at a big difference: for example, if we have a weighted coin then we don’t need to flip as much in order to see a big difference between that versus a fair coin. If we take what we think is a terrible user experience, we will quickly see a big difference.
- The other way is to get more users, to increase traffic. Just like doing more coin flips, this will end up decreasing our uncertainty, and so the ratio of the two will end up increasing.
So, how big do we need to go?
This is where the term p-values comes in.
Some people talk about p-hacking, which we will discuss later. Basically, p-values are when there is no difference at all between your control and your variant. T-values are also normally distributed again.
What people will end up doing is choosing a common cutoff value; they talk about a p-value of less
than five. This means that if you conduct an A/B test and there truly is no difference between control
and variant, and you’re literally testing control versus control, then this is the spread of possible t-values that we’ll get.
95% of the time you will be within a t-value of -1.96 to 1.96.
You will be in that range 95% of the time – you will only be above it 2.5% of the time, and below it for the remaining 2.5% of the time. Therefore, there is only a 5% chance that you will be outside that range and this is why we call it a p-value: less than 5%.
However, again it assumes this on the given that there is already no difference between the means. We should mention that for biostatistics, people use p-values of less than 5%.
A/B testing is very common in astronomy or physics, where we tend to use a more strenuous value like 0.3%. It depends on the field, but in terms of betting odds, 5% is very good.
Something that people don’t talk about as much is statistical power.
This is just as important as statistical significance, but it is not discussed as much. Your result will be statistically significant if there actually is a true difference, a large difference, between the mean of the control and the mean of the variant.
What are the chances that we would be able to detect this? In other words, this limits your false-negative rate – that is, a difference that you don’t think is significant. You want to limit this as much as possible because this is a source of revenue that you could be tapping into.
A common threshold for this is 80% power. Basically, if your difference in means is equal to your sensitivity, you need to set a sensitivity for your power. You decide how much lift you want to be sensitive to – so if your difference in means is equal to that sensitivity, then four out of five times you’ll be able to pick it up.
It is difficult to determine what you want your sensitivity to be, because ideally we want to see all lifts, but in practice this does not work out well.
On the other hand, in an A/B test it depends on how many samples you have. As you start sampling more and more, you have a greater and greater likelihood that your A/B test will be seen as significant, until ideally you’re somewhere at the bottom, where about 80% of the A/B test area lies to your significance.
We should mention too that in this case, B is doing better than A. If A was doing better than B, it would be migrating the other way. If your variant is worse, it’s good to know that something is doing worse. It depends on what the actual difference between your means is. This ties into what our sensitivity is.
There’s only a 1.1% lift, it’s going to migrate very slowly and we’re going to need a lot of users to be able to see that fact. If we go from a 1% lift to only a 0.5% lift between the two means, I’m going to need four times as many users to be able to see that difference.
So as you get to finer and finer sensitivity, you start getting penalized more in order to see the results, in terms of the amount of time required for testing.
On the other hand, that can also work in your favor. We just care about the big lift, so if we want to go from a 1% to a 2% lift, and we only care about 2% lifts or greater, then we now only need a quarter of the users. Therefore, we can do the test in a quarter of the time.
Now that we have covered some of the basics, let’s discuss some common mistakes.
There are many ways to incorrectly calculate statistics. The first one is misinterpreting p-values. Again, the true definition of a p-value or significance is that there is no real difference between your A and B; there’s a 5% chance that your p-value is less than 5%; similarly, your t-value is going to be greater than 1.96 or less than -1.96.
Some people turn this around and think that if they have a p-value of less than 5%, then that means there is only a 5% chance that there is no real difference. This is not true. You cannot properly attribute the causation there. They may think that, since there is only a 5% chance that there is no difference, then that must mean there’s a 95% chance there’s a real difference. This error is ubiquitous.
You need to be very careful when you’re talking about this.
If you are familiar with Planet 9, this was actually very disappointing evidence for a distant giant planet. The last line of the original paper stated that such a clustering has only a probability of 0.007% to be due to chance. This was because they got a p-value of 0.007%. This was awful because they were actually misinterpreting probability. They shouldn’t be doing this but it’s ubiquitous.
Scientists make this mistake all the time and it’s important to understand that the reason is because of the paradox of the false-positive, which is testing things that are unlikely to be true.
Let’s imagine a scenario. Say we’ve just watched the news and there is a terrible new disease on the outbreak called cat flu. There’s bird flu, the swine flu, but cat flu causes you to cough up hairballs. It’s terrible. The good news is that it is very rare: only one in a million people actually have it.
So, that’s good but being a hypochondriac I go to my doctor. The doctor says, don’t worry, the good news is it’s 95% accurate and there’s only a 5% chance of false-positives. I say, great. Give me the test.
I take the test. It comes back positive. Should I be worried?
No. I should not be worried.
There is a 5% chance that the test is inaccurate, or a one-in-a-million chance that I actually have cat flu.
So, you balance these two things out and you can see that you can’t necessarily say that, just because the p-value is less than 5%, that doesn’t mean there is only a 5% chance there’s no real difference.
How Does This Apply To Affiliate Marketing?
Let’s say we’ve created a version of a website, and we believe it will earn tons of revenue, but it turns out that the users can’t actually buy anything. The value then starts descending very quickly, and we have a balance. On one hand, we have product men who want to stop this test because we’re losing revenue. On the other hand, we have the statistician who points out that we haven’t reached the sample size yet, and therefore we need to keep hemorrhaging money until we’re sure that this is not a good situation to be in.
There are a couple of ways around this.
- Arguably the best way is group sequential testing, which is a method that was developed back in the 70s, and then again in the 90s, originally for cardiac tests. It was used for heart medication studies, synthetic valve studies, artificial valve studies – a test that is important to stop because it can be a matter of life or death. Of course, this is not life or death, but it ensures a suitable mathematical method.
- An easy way is known as the Biddle test. If your t-value ever gets below -3, you should stop the test because it’s very unlikely to be a false-positive.
- A better way is the O’Brien Fleming test, where you actually have changing p-values over time, so you get closer and closer to that 1.96 value as you approach your sample size.
Another issue is multiple variants. Say you want to change the color of the “Buy” button. It’s currently green, and you want to see if red incites more clicks. You also want to see if orange or blue does better, and you want to test fuchsia too. You want to come up with tons of variants and test all of them.
The problem here is that each of those tests has a 5% chance to produce a false-positive, to make you think that there is a difference, when there actually isn’t. If you actually work out the math, it turns out that if you have ten comparisons, you have about a 40% chance of producing a false-positive.
So, we need to correct for that. We can do what’s known as a Bonferroni correction, which is where we actually pull back the p-value a little bit, so instead of testing a p-value of 5%, if we’re testing ten things we want to test to a p-value of 0.5%. We also need to figure out the t-value of what the statistical significance threshold is.
A slightly more sophisticated method for correction is known as an ANOVA – this is similar to a t-test, but it is referred to as an f-test. ANOVA allows us to test them all together, so if there are any differences in this group then we can start doing pairwise t-tests between the control and variants. However, be careful with your time series.
Be careful with this, in terms of your conversion rate: if you are running your test for a long time, you will have a hard time making the case that you’re only looking at a single population. Your conversion rate might be changing over time. More people are coming to the site, more people are buying products. T-testing assumes that you have a single population that you are sampling from, so saying that you are sampling from a single population doesn’t necessarily hold true anymore.
The big factor here is to be sure that you’re testing your treatments simultaneously. Someone comes in, you use a random number generator, they go to one treatment or another. Don’t test one and then test another, because let’s say we test that “other” on Black Friday – maybe it will look great, maybe it will have way more conversions, but you need to be careful with this.
Bigger Is Not Always Better
A bad way to test another possible issue is a huge sample size. This is a big problem for people who don’t have a lot of traffic. If you’re a small site and you want to do A/B testing, for example you want to get down to 0.5% lifts and want to be able to see anything 0.5% lift or bigger. To achieve this, you would need 1020 million users. This is not very feasible for low traffic websites. The experiment would take far too long; you don’t want to run a test for 600 days. Again, this is the problem of the time series.
There are a couple of solutions here. First of all, focus on intermediate metrics. People talk about the funnel: first a user sees a product, and then they decide to put in their checkout box, then they click on the “Buy” button and then they actually purchase it. This is a very simplified view of the funnel.
If you go up in the funnel and look at some of those metrics, checkout views for example, it’s a far more likely event to occur, and as a result you need far fewer in your sample size in order to do it.
So, for example, instead of saying we want to test two coins, now we want to see what the chances are of getting heads with this group of coins, versus the coins that we don’t need a large sample size of, because there’s a 50% chance I’m going to get heads now. Instead, let’s say there was only a 1% chance that we will get heads. The ability to see a small lift means that I’m going to need a lot more users, because we need to flip a lot more coins, to even see that 1% coming up. So focus on metrics that are happening frequently, that are far more likely to happen.
In this case, use checkout views as the metric, because as it turns out, users actually buying products is difficult to get to.
We hope this A/B testing analysis was helpful.