A/B and Multi-armed Bandit Tests

Zachary Greenberg
4 min readJul 10, 2021

The job of a data scientist is to make sense of the data. With that in mind, there are definitely some crossovers into many other fields that thrive on the scientific method and experimentation like medicine. When learning about the process of experimental design, we learn that things can be measured as long as they have something to be compared to. There are many ways to do these kinds of tests, but for now I am going to shine the light on A/B Testing and Multi-armed Bandit Tests.

What is A/B Testing?

We have two different groups of randomly selected individuals or events, after the experiment is completed we can make our conclusions through comparison of our groups. During the experiment, we are exploring and observing the groups. After the experiment, we are exploiting the information we gain from it. These two concepts — exploration and exploitation — are particularly important for Multi-armed Bandit tests, but I will introduce them now.

As I mentioned above, A/B testing is a prime example of what goes on in the medical field. Using it, we are able to discover the possible significance of a breakthrough drug. The results of this process is often carried out through hypothesis testing.

As data scientists, hypothesis testing is fundamental but also extremely simple to perform. Honestly, the hard part is collecting the data for the experiment. For a brief example, let’s say our experiment has run and we have our data. We want to measure the SAT scores for those receiving tutoring and those not:

#Declare what you are testing! The test below is a TWO-SIDED test!#NULL HYPOTHESIS: There is no difference between tutoring and non-#tutoring groups.
#ALT HYPOTHESIS: There is a difference between tutoring and non-#tutoring groups
from scipy import stats#we use the .pvalue attribute to obtain the pvalue or the .statistic #method to get the test statisticstats.ttest_ind(tutor_group, no_tutor_group).pvalue

Under the hood of the stats module, ttest_ind will perform the calculation for a two sample t-test of independent samples. This is exactly how we would go about an A/B test in Python.

What are Multi-armed Bandit tests?

Multi-armed Bandit tests are in the same vein of A/B tests, however, they gradually exploit the best option while experimentation is still happening. Here is a great diagram to show these two tests side by side:

A/B testing waits until the end to determine the best version, whilst Multi-armed Bandit test tries to figure out the best version during the overall ongoing process.

With Multi-armed Bandit tests, there are many options to go about them. In my opinion, the most successful ones utilize machine learning to automate the process. One of the best ways to do this is with the epsilon greedy method. I will say there are a plethora of algorithms to do this, however it is a dense topic and I will stick to something more simple for now.

What is the Epsilon Greedy method?

It is fairly simple. Essentially it structures how you would go about the exploitation in a smart way. It allows you to check and make sure that you are going with the best option. Epsilon is basically a threshold that you set. By generating random numbers, we can determine which action to take by comparing it to epsilon. In the case below, if it is above, we explore, and if not, we go for exploitation. Here is a simple way to get this algorithm set up in Python:

import random
import statistics
def track_exploitation(group):
#This helper function keeps track of the means of the groups.
#Usually the higher the mean the higher the reward. This
#signifies what we want to exploit.
return statistics.mean(group)def epsilon_greedy(epsilon, groups):
#This function takes in epsilon and the groups. It will randomly
#generate a number. If it is above the threshold, it will choose
#a random group (ie. explore). If below, it will choose the
#group with the highest mean (ie. exploitation)
num = random.random()

if num > epsilon:
pick = random.choice(groups)

return pick
else:
#looks for the group with the highest mean
means= []

for n in groups:
group_mean = track_exploitation(n)
means.append(group_mean)

for i, n in enumerate(means):
if max(means) == n:
return groups[i]

It’s actually quite as simple as you see above. Of course, the means — ie. the metric that is tracking performance — will change overtime, thus the group that is best to exploit will also be updated.

But A/B Test is great, why should we use this?

The choice to implement a Multi-armed Bandit test is situational. The best example I’ve heard is testing ads for a holiday sale. It is a time sensitive case and companies need to find the most effective ads before the season ends. If they wait for the A/B test to conclude, they would have missed out on a whole other group of people who received the less effective ad.

In Conclusion

The world of experimental design is highly applicable across many focuses. This was just a brief taste of some of its wonders. A/B Testing and Multi-arm Bandit tests are two tools data scientists have in the arsenal to draw conclusions from their data. Like a lot of other techniques, the choice to apply these two are dependent on what you are setting out to achieve, what kind of data you have at your grasp, and what situations or circumstances surround your work.

References:

A/B Testing — https://www.alexbirkett.com/ab-testing/

Diagram — https://persona.ly/glossary/performance-metrics/creative-testing-multi-armed-bandit-vs-a-b-testing/

Implementation of Bandit Testing — https://www.geeksforgeeks.org/epsilon-greedy-algorithm-in-reinforcement-learning/

Multi-armed Bandit Testing — https://cxl.com/blog/bandit-tests/

Scipy Documentation — https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

--

--