Skip to main content

Command Palette

Search for a command to run...

Multi Armed Bandits - What, Why and How ?

Updated
3 min read
Multi Armed Bandits - What, Why and How ?

Bandit algorithms are a method of solving a typical tradeoff known as the Exploration-Exploitation tradeoff. In this system, a learning model needs to repeatedly make a set of decisions in a limited knowledge discrete environment and make sure that the model picks the best possible result. The dilemma here though, is that whether the model needs to continue picking the best solution to the system it has found so far (exploitation) or pick any of the other solutions available under the impression that they would yield far greater rewards (exploration). The main goal is to maximize the total gain from the system.

What are Multi-Armed bandits (MABs)

  1. MABs sometimes referred to as k-Armed Bandits arise from slot machines that typically have a single arm and we "pull" the arm to get a reward. Our goal is to get a maximum reward each time we pull the slot machine.
  2. MABs use a set of probability distributions (\( \beta \) distributions) based on the reward received on an arm pull and decides to whether explore or exploit in the next arm pull based on the updated distributions.
  3. Just like in a real-world situation a gambler will gain an understanding of the right time to pull the lever in order to get the best reward, a MAB will cumulatively gain an understanding for the right arm pull based on the updated distributions in order to maximize its rewards.

Why MABs

MABs are used in scenarios where there is an uncertainty in reward expected amongst multiple possible solutions. Ex: Recommendation systems and Online advertisements, where we need to decide on a certain image/font/color that needs to be used on the landing page of a website.

  1. Unlike a typical A/B test where we discretely jump between exploration and exploitation, a MAB allows for a smooth transition between the two experiments.
  2. They smoothly decrease the amount of exploring they do over time instead of requiring you to make a sudden switch between arms.
  3. MABs don't tend to explore the solutions completely, in a way that they never tend to choose the worst-performing arm during the exploitation phase.
  4. Generally, when we have multiple experiments to run, it is also impossible to choose between A/B/C/x/x/x tests as they don't provide enough conclusive evidence against a certain choice. All good bandit algorithms eventually converge at a good solution provided you run enough experiments.

How MABs are used

There are multiple strategies in implementing MABs to solve the exploration-exploitation problem.

  1. The strategies revolve around finding the optimal arm selection strategies, or policies. Ex: Epsilon Greedy Algorithm, SoftMax algorithm, Upper Confidence Bound, and Thompson Sampling.
  2. The Epsilon Greedy Algorithm exploits the current best action all the time, even if the actions leads to a bad outcome this time around. This is a greedy approach to solving this problem. However, the term epsilon comes in for the time the algorithm actually explores rather than exploits. So say if the epsilon parameter is set to \( \epsilon = 0.5 \) the model explores 5% of the total time and 95% of the time it will exploit.
  3. Sometimes, it will be useful for the algorithm to consider a set of \( d-dimensional \) vectors which contain some information that might dictate the action before choosing the right arm. This strategy of utilising a set of control variables before arriving at a solution is called the Contextual MABs. The learner uses these context vectors along with the rewards of the arms played in the past to make the choice of the arm to play in the current iteration.

image.png

In a future post, I will talk about another self-regulating algorithm that explore and exploits using a bayesian formulation - Thompson Sampling - by walking through a code example.