Intro
In Reinforcement Learning, exploration vs exploitation is a tradeoff in which an agent must decide between
- Exploration: Trying new actions to discover better strategies
- helps avoid local optima
- Exploitation: Using known information to maximize immediate rewards
- often leads to immedate high performance
![]()
One of the most foundational challenges in reinforcement learning — the exploration vs. exploitation dilemma. This isn’t just an algorithmic decision — it’s a philosophical one that shapes how an agent learns everything it knows.
Balancing Exploration and Exploitation
The agent must strike the right balance, because soley exploiting can miss out on better strategies, leading to suboptimal strategies while overexploration reduces performance by wasting time and risking poor outcomes.
The Multi-Armed Bandit Problem
Scenario
Agambler is faced with a row of slot machines (“one-arm bandits”), and doesn’t know the payout rate of each
Dilemma
At every step, he must choose between
- Exploitation: pick the option that has given the best reward so far
- Exploration: try something uncertain to gather more information
The time spent switching machines and spending money to estimate which has the highest payout (exploration) is time not spent optimizing winnings (exploitation).
Our goal is to start making money as quickly as possible, but in the long term, we want to make the most money possible.
Solutions
Epsilon Greedy
- Usually exploits the best arm but explores randomly with probability ϵ
- Allows escaping from early greedy mistakes!



