A/B Testing Mastery & Statistical Fundamentals For Testing (Review)
Week 2 of a 12 Week Series on Growth Marketing
Each week for 12 weeks, I am writing about what I am learning through the Growth Marketing Mini-Degree from the CXL Institute. This week, I worked through the last two courses in Module 2: A/B Testing Mastery and Statistics Fundamentals for Testing.
The first course, A/B Testing Mastery, was a deep dive into everything A/B testing: what are they, when can you use them, how to implement them, common mistakes, and the best practices that work to get optimized results.
What is A/B Testing? Right now, I’m running an A/B test on my website to see if a video on the sales page is effective or not. So, I am running an A/B test by sending 50% of my traffic to the control page (the original page with the video) and the other 50% to the variation page, which has no video on the sales page. This is a simple A/B Test. It’s important to note that to accurately assess which variation “wins”, I need to give my test (1) ample amount of time to run, and (2) enough data.
A/B Testing has been changing and improving by leaps and bounds particularly within the past 10–12 years, and in 2021, A/B testing is now somewhat democratized with software that lets businesses A/B test anything they want to. Along with the flood of more and more data available, A/B testing is easier than ever to do. So, how do we know where to start and what to test?
“Our success at Amazon is a function of how many experiments we do per year, per month, per week, per day…” — Jeff Bezos
When should we run experiments and how many? The ROAR model is a four phase optimization model that helps us figure this out. The four phases of ROAR are: Risk, Optimization, Automation, Re-Think. Below, on the horizontal axis we have time span, and the vertical axis we have conversions per month.
As you can see, borders exist between phases when it comes to thresholds for enough data and assessing risk, and that brings us to the next question: Do we have enough data to run an A/B test? We need to make sure we have enough statistical power to run an A/B test which means we need to meet a threshold for data, among other things. Statistical Power is the probability of reaching a statistically significant outcome. If we’re working with the ROAR model, we need at least 1000 conversions per month, otherwise, the amount of data needed for a true experiment is just too low. Conversions don’t have to be just sales, they can also be clicks, downloads, etc., whatever metic we are measuring. Another border exists at the 10,000 conversions/month threshold. At this point, we can run or start four A/B tests per week, or 200 tests per year and we can also start to assign a structure to the A/B testing, as well as a team to carry out various tests. But what do you do if you’re in the risk phase, below 1000 conversions per month? Take more risk. Run tests at the rate that works for your business, and you scale your testing alongside your conversions as they grow.
Another important factor in running an A/B test is setting the hypothesis. It’s important to do this before the test because then everyone is aligned on the problem, the proposed solution, and the projected outcome. Simply put, finding the “why” will help us test the right things.
In presenting the A/B test findings, we consider the differences between Bayesian statistics and Frequentist statistics in choosing how we analyze the outcome and talk about our findings. In reality, this is really just more of a philosophical paradigm lens to see your results through. While it may not matter overall in the long run if you are a Bayesian or Frequentist, I found the description in the next course Statistical Fundamentals for Testing to be very helpful in bringing to life the differences between Bayesian statistics and Frequentist statistics.
In Bayesian statistics, the hypothesis has been assigned a probability of occurring, which makes it somewhat more intuitive to discuss because the hypothesis and prior knowledge can interact. In Frequentist statistics, there is no probability assigned to the hypothesis. But how do we see this difference play out in the natural world?
The black dog example was so helpful to me: A Bayesian view, for example, would say that if you can only see one side of a dog that appears black, that you are looking at a “black dog”, thus inferring the overwhelmingly large likelihood that the side of the dog we can’t see is black, and that the dog is not split down the middle and white on the other side. But from a Frequentist view, we would not be able to use a priori knowledge to inform our result, therefore we can’t call the dog conclusively black.
This illuminates the difference between how we talk about data, but also how our brains are wired to work. By nature, we operate from a Bayesian perspective simply because we are always using context clues to help problem solve. And usually, when we get better knowledge, we update and integrate that into our business. That is the essence of the Bayesian approach. Both Bayesian and Frequentist are attempting to solve the same problem, just from different approaches.
In the context of statistics, getting caught in the above philosophical debate is just one of the four statistics traps to be aware of. What are the other three?
The first mistake is “regression of the mean and sampling error” meaning if we stop the test too early and don’t reach our sample size, our results may turn out to be a false positive or false negative. A sampling error occurs when we don’t know our true conversion rate. It’s tempting to stop the test when we think we have statistical significance, but it’s important to wait the full duration of the test to avoid error.
The second mistake is testing too many variants at once; this can cause problems with the validity of the the test. You can correct for this using software, or calculate it yourself, but its always good to have the hypothesis driving the test’s “why”, and when that is occurring, it’s easier to avoid running too many variants at the same time.
The third mistake is giving too much importance to data that is considered “Micro-Metrics” (a click to a product page or an “add to cart”) because a Micro Metric boost does not always correlate to “Macro-Metrics” boosting, aka profits or conversion rate.
That’s it for this week! Next week, I will discuss the Google Analytics course and what I am learning in Module 3. Thanks for following along!
Emily