make good selections when it begins out realizing nothing and may solely be taught by means of trial and error?
That is precisely what one of many easiest however most essential fashions in reinforcement studying is all about:
A multi-armed bandit is an easy mannequin for studying by trial and error.
Identical to we do.
We’ll discover why the choice between making an attempt one thing new (exploration) and sticking to what works (exploitation) is trickier than it appears. And what this has to do with AI, on-line advertisements and A/B testing.
Why is it essential to know this idea?
The multi-armed bandit introduces one of many core dilemmas of reinforcement studying: How you can make good selections underneath uncertainty.
It isn’t solely related for AI, knowledge science and behavioral fashions, but additionally as a result of it displays how we people be taught by means of trial and error.
What machines be taught by trial and error shouldn’t be so completely different from what we people do intuitively.
The distinction?
Machines do it in a mathematically optimized method.
Let’s think about a easy instance:
We’re standing in entrance of a slot machine. This machine has 10 arms and every of those arms has an unknown likelihood of successful.
Some levers give greater rewards, others decrease ones.
We are able to pull the levers as typically as we like, however our aim is to win as a lot as potential.
Which means that we now have to search out out which arm is the very best (= yields probably the most revenue) with out realizing from the beginning which one it’s.
The mannequin may be very harking back to what we regularly expertise in on a regular basis life:
We take a look at out completely different methods. In some unspecified time in the future, we use the one which brings us probably the most pleasure, enjoyment, cash, and many others. No matter it’s that we’re aiming for.
In behavioral psychology, we communicate of trial-and-error studying.
Or we are able to additionally consider reward studying in cognitive psychology: Animals in a laboratory experiment discover out over time at which lever there’s meals as a result of they get the best achieve at that individual lever.
Now again to the idea of multi-armed bandits:
It serves as an introduction to decision-making underneath uncertainty and is a cornerstone for understanding reinforcement studying.
I wrote about reinforcement studying (RL) intimately within the final article “Reinforcement Learning Made Simple: Build a Q-Learning Agent in Python”. However at its core, it’s about an agent studying to make good selections by means of trial and error. It’s a subfield of machine studying. The agent finds itself in an surroundings, decides on sure actions and receives rewards or penalties for them. The aim of the agent is to develop a method (coverage) that maximizes the long-term general profit.
So we now have to search out out within the multi-armed bandits:
- Which levers are worthwhile in the long run?
- When ought to we exploit a lever additional (exploitation)?
- When ought to we check out a brand new lever (exploration)?
These final two questions leads us on to the central dilemma of reinforcement studying:
Central dilemma in Reinforcement Studying: Exploration vs. Exploitation
Have you ever ever held on to a superb choice? Solely to search out out later that there’s a greater one? That’s exploitation successful over exploration.
That is the core downside of studying by means of expertise:
- Exploration: We attempt one thing new with a view to be taught extra. Perhaps we uncover one thing higher. Or possibly not.
- Exploitation: We use the very best of what we now have realized to date. With the goal of gaining as a lot reward as potential.
The issue with this?
We by no means know for certain whether or not we now have already discovered the most suitable choice.
Selecting the arm with the best reward to date means counting on what we all know. That is known as exploitation. Nonetheless, if we commit too early to a seemingly good arm, we might overlook a fair higher choice.
Making an attempt a distinct or hardly ever used arm offers us new data. We achieve extra information. That is exploration. We’d discover a higher choice. Nevertheless it may be that we discover a worse choice.
That’s the dilemma on the coronary heart of reinforcement studying.

What we are able to conclude from this:
If we solely exploit too early, we might miss out on the higher arms (right here arm 3 as an alternative of arm 1). Nonetheless, an excessive amount of exploration additionally results in much less general yield (if we already know that arm 1 is nice).
Let me clarify the identical factor once more in non-techy language (however considerably simplified):
Let’s think about we all know a superb restaurant. We’ve gone to the identical restaurant for 10 years as a result of we prefer it. However what if there’s a higher, cheaper place simply across the nook? And we now have by no means tried it? If we by no means attempt one thing new, we’ll by no means discover out.
Apparently, this isn’t only a downside in AI. It’s well-known in psychology and economics too:
The exploration vs. exploitation dilemma is a first-rate instance of decision-making underneath uncertainty.
The psychologist and Nobel Prize winner Daniel Kahnemann and his colleague Amos Tversky have proven that folks typically don’t make rational selections when confronted with uncertainty. As an alternative, we observe heuristics, i.e. psychological shortcuts.
These shortcuts typically mirror both behavior (=exploitation) or curiosity (=exploration). It’s exactly this dynamic that can be seen within the Multi-Armed Bandit:
- Will we play it secure (=recognized arm with excessive reward)
or - can we danger one thing new (=new arm with unknown reward)?
Why does this matter for reinforcement studying?
We face the dilemma between exploration vs. exploitation in every single place in reinforcement studying (RL).
An RL agent should continuously resolve whether or not it ought to persist with what has labored finest to date (=exploitation) or ought to attempt one thing new to find even higher methods (=exploration).
You may see this trade-off in motion in advice methods: Ought to we hold displaying customers content material they already like or danger suggesting one thing new they could love?
And what methods are there to pick out the very best arm? Motion choice methods
Motion choice methods decide how an agent decides which arm to pick out within the subsequent step. In different phrases, how an agent offers with the exploration vs. exploitation dilemma.
Every of the next methods (additionally insurance policies/guidelines) solutions one easy query: How can we select the following motion once we don’t know for certain what’s finest?
Technique 1 – Grasping
That is the only technique: We all the time select the arm with the best estimated reward (= the best Q(a)). In different phrases, all the time go for what appears finest proper now.
The benefit of this technique is that the reward is maximized within the quick time period and that the technique may be very easy.
The drawback is that there isn’t any exploration. No danger is taken to attempt one thing new, as a result of the present finest all the time wins. The agent would possibly miss higher choices that merely haven’t found but.
The formal rule is as follows:

Let’s take a look at a simplified instance:
Think about we attempt two new pizzerias. And the second is sort of good. From then on, we solely return to that one, although there are six extra we’ve by no means tried. Perhaps we’re lacking out on the very best Pizzas on the town. However we’ll by no means know.
Technique 2 – ε-Grasping:
As an alternative of all the time choosing the best-known choice, we enable on this technique some randomness:
- With chance ε, we discover (attempt one thing new).
- With chance 1-ε, we exploit (persist with the present finest).
This technique intentionally mixes likelihood into the choice and is subsequently sensible and infrequently efficient.
- The upper ε is chosen, the extra exploration occurs.
- The decrease ε is chosen, the extra we exploit what we already know.
For instance, if ε = 0.1, exploration happens in 10% of circumstances, whereas exploitation happens in 90% of circumstances.
The benefit of ε-Grasping is that it’s simple to implement and offers good primary efficiency.
The drawback is that selecting the best ε is troublesome: If ε is chosen too massive, quite a lot of exploration takes place and the lack of rewards might be too nice. If ε is simply too small, there’s little exploration.
If we stick with the pizza instance:
We roll the cube earlier than each restaurant go to. If we get a 6, we check out a brand new pizzeria. If not, we go to the common pizza.
Technique 3 – Optimistic Preliminary Values:
The purpose on this technique is that every one Q0(a) begin with artificially excessive values (e.g. 5.0 as an alternative of 0.0). At first, the agent assumes all choices are nice.
This encourages the agent to attempt all the pieces (exploration). It needs to disprove the excessive preliminary worth. As quickly as an motion has been tried, the agent sees that it’s price much less and adjusts the estimate downwards.
The benefit of this technique is that exploration happens robotically. That is notably appropriate in deterministic environments the place rewards don’t change.
The drawback is that the technique works poorly if the rewards are already excessive.
If we take a look at the restaurant instance once more, we’d charge every new restaurant with 5 stars firstly. As we attempt them, we alter the scores based mostly on actual expertise.
To place it merely, Grasping is pure routine conduct. ε-Grasping is a combination of behavior and curiosity conduct. Optimistic Preliminary Values is corresponding to when a baby initially thinks each new toy is nice – till it has tried it out.
On my Substack Data Science Espresso, I repeatedly share sensible guides and bite-sized updates from the world of Knowledge Science, Python, AI, Machine Studying and Tech — made for curious minds like yours. Take a look — and subscribe if you wish to keep within the loop.
How the agent learns which choices are worthwhile: Estimating Q-values
For an agent to make good selections, it should estimate how good every particular person arm is. It wants to search out out which arm will carry the best reward in the long run.
Nonetheless, the agent doesn’t know the true reward distribution.
This implies the agent should estimate the common reward of every arm based mostly on expertise. The extra typically an arm is drawn, the extra dependable this estimate turns into.
We use an estimated worth Q(a) for this:
Q(a) ≈ anticipated reward if we select arm a
Our goal right here is for our estimated worth Qt(a) to get higher and higher. So good till it comes as shut as potential to the true worth q∗(a):

The agent needs to be taught from his expertise in such a method that his estimated valuation Qt(a) corresponds in the long term to the common revenue of arm a in the long run.
Let’s look once more at our easy restaurant instance:
We think about that we wish to learn the way good a selected café is. Each time we go there, we get some suggestions by giving it 3, 4 or 5 stars, for instance. Our aim is that the perceived common will ultimately match the actual common that we’d get if we went infinitely typically.
There are two primary methods wherein an agent calculates this Q worth:

Technique 1 – Pattern common technique
This technique calculates the common of the noticed rewards and is definitely so simple as it sounds.
All earlier rewards for this arm are checked out and the common is calculated.

- n: Variety of instances arm a was chosen
- Ri: Reward on the i-th time
The benefit of this technique is that it’s easy and intuitive. And it’s statistically appropriate for secure, stationary issues.
The drawback is that it reacts too slowly to modifications. Particularly in non-stationary environments, the place situations shift over time.
For instance, think about a music advice system: A consumer would possibly instantly develop a brand new style. The consumer used to desire rock, however now they take heed to jazz. If the system retains averaging over all previous preferences, it reacts very slowly to this transformation.
Equally, within the mult-armed bandit setting, if arm 3 instantly begins giving significantly better rewards from spherical 100 onwards, the working common can be too sluggish to mirror that. The early knowledge nonetheless dominates and hides the development.
Technique 2 – Incremental Implementation
Right here the Q worth is adjusted instantly with every new reward – with out saving all earlier knowledge:

- α: Studying charge (0 < αalphaα ≤ 1)
- Rn: Newly noticed reward
- Qn(a): Earlier estimated worth
- Qn+1: Up to date estimated worth
If the surroundings is secure and rewards don’t change, the pattern common technique works finest. But when issues change over time, the incremental technique with a continuing studying charge α adapts extra shortly.

Closing Ideas: What do we’d like it for?
Multi-armed bandits are the idea for a lot of real-world functions akin to advice engines or internet marketing.
On the similar time, it’s the right stepping stone into reinforcement studying. It teaches us the mindset: Studying by means of suggestions, appearing underneath uncertainty and balancing exploration and exploitation.
Technically, multi-armed bandits are a simplified type of Reinforcement Studying: There are not any states, no future planning, however solely the rewards proper now. However the logic behind them reveals up many times in superior strategies like Q-learning, coverage gradients, and deep reinforcement studying.
Curious to go additional?
On my Substack Data Science Espresso, I share guides like this one. Breaking down advanced AI matters into digestible, practicable steps. For those who loved this, subscribe here to remain within the loop.