A/B testing @ Outbrain – Wabbit
What Is A/B Testing
A/B testing is a method widely used to validate assumptions about web site optimizations. With A/B tests we can test two configurations, configuration A and configuration B, of a web page design and compare them according to some metrics that define what a success result is. In other words, you test your new design against the current design and measure which one produces better results. To decide which design is better than the other, you split the traffic to your web page between these two configurations and then you can measure which configuration had better performance and apply this configuration as the default configuration of your site.
What To Test?
The choice of what to test depends on your goals. In Outbrain each configuration is called an A/B test variant. The idea of Outbrain’s A/B testing is to allow publishers to test two different designs of their widgets, and measure which design had better Click Through Rate (CTR) and Revenue Per 1,000 Impressions (RPM) performance.
In the core of the system there are more than 450 settings that define the configuration of each widget, which is installed on a blog or a group of sites.
There are more than two hundred online settings that directly affect the widget. Each of these settings can be tested within A/B test variants. For example, one of these online settings is called “Widget Structure”. This setting configures the look and feel of the widget.
Widget structure – look and feel of the widget
If your goal is to test an addition of a new widget structure, you can configure the variant A with the new widget structure addition, against variant B that uses the original design of the widget structure and serves as the control group.
When the test comes to an end many questions may come up. How did it affect the customers? Did the new design of the widget structure deliver better CTR and RPM performance? Maybe if we changed the title of the new widget structure it would have resulted in better performance? Maybe if we changed the images size of the old widget structure, it would have resulted in better performance? All of these questions can be answered one by one if we set appropriate A/B test variants.
Even though each A/B test in our system is unique, there are certain widget settings that are usually tested for every variant:
- Number of paid recommendations
- Number of organic recommendations
- Image size in the widget
- The number of recommendations on the widget unit
- Widget structure
A/B Tests in Outbrain
Once you decided that you want to create a new A/B test, you can do it using an internal tool named Wabbit – Widget A/B testing tool. The tool gives you the ability to create/edit an existing A/B test or to pull internal reports with Key Performance Indicator (KPI) performance for the test.
The A/B test can be defined on a specific widget on one site or it can be done on a group of sites that use the same widget.
When the test ends, we pull the A/B test report to measure which configuration had better performance. If the data indicates one of the configurations is an improvement according to our KPIs and the test has experienced enough traffic to be considered significant, we give the option to apply the new configuration as the default for the widget.
- In Outbrain we recommend running experiments for at least two weeks and no more than a month. The main reason for that is to eliminate the “day of the week” effect because users who visit the site on the weekend might represent different segment than those who visit the site during the week.
- On the other hand, running an A/B test more than a month leads to unreliable test results, such as cookie expiration that causes the users to start to see different configurations which compromise the consistency of the test.
- At Outbrain, we also recommend allocating at least 5% of traffic toward an AB test to increase the probability of ending the test with results that have more than a 90% confidence level based on statistical analysis. Here’s a calculator from KissMetrics that will allow you to easily figure out if you’re A/B test results are significant.