Intro
Q-learning is a model-free Reinforcement Learning algorithm that learns an action-value function Q(s,a), estimating the expected cumulative reward of taking action in state . Its objective is to find the optimal policy that maximizes long-term reward in a Markov Decision Process (MDP).
It enables an agent to learn optimal actions through trial-and-error by interacting with the environment, without requiring knowledge of the transition function or reward model. Q-values are stored in a Q-table (or approximated with a function) and are updated iteratively using the Bellman update rule until convergence to the optimal policy.
Instead of learning the environment, Q-learning learns:
“How good is it to take action a in state s?”
Q-Function
Q-learning learns the value function: 1
which estimates the expected cumulative reward of taking action in state .
The optimal Policy is derived as:
ε-greedy Policy
The ε-greedy policy is used during training to balance exploration and exploitation:
- with probability → choose a random action (explore)
- with probability → choose the best known action (exploit)
import numpy as np
def select_action(q_values, epsilon, n_actions):
# With probability epsilon - explore
if np.random.rand() < epsilon:
return np.random.randint(n_actions)
# With probability 1 - epsilon - exploit
else:
return np.argmax(q_values)
Update Rule
Q-learning gradually shifts old estimates toward new evidence, balancing prior knowledge with newly observed outcomes.
Where:
- α = learning rate (Update speed- controls how much new info overrides old Q)
- γ = discount factor (importance of future reward)
- r = immediate reward
- s’ = next state
- max Q(s’,a’) = best possible future value
---
Q-Learning Hyperparameters
Learning Rate ()
Range:
Controls how much new information overrides old Q-values.
Effect:
- High () → fast learning, unstable updates
- Low () → slow but stable learning
Discount Factor ()
Range:
Controls importance of future rewards vs immediate rewards.
Effect:
- → only immediate reward matters (short-sighted)
- → long-term rewards matter heavily (far-sighted)
Exploration Rate ()
Range:
Controls probability of choosing a random action instead of the best-known action.
Effect:
- High () → more exploration
- Low () → more exploitation
Typically:
starts high and decays over time as the agent learns
Parameter Overview
Parameter Name What it Controls Effect Learning Rate How fast Q-values update Speed vs stability Discount Factor Importance of future rewards Short-term vs long-term planning Exploration Rate Random vs greedy actions Exploration vs exploitation
Tuning Recommendationn (Millington)
- Learning Rate: 0.3 (Can vary from 0.1 to 0.7)
- Discount Rate: 0.75
- Randomness for exploration
- Online: 0.1
- Offline: 0.2
Algorithm
Q-learning iteratively updates the current Q-value by blending it with a target value made from the immediate reward plus the estimated best future value of the next state.
- Initialize Q(s,a) arbitrarily
- Observe current state s
- Choose action a (ε-greedy)
- Execute action, observe r and s’
- Update Q(s,a)
- Set s ← s’
- Repeat
Performance
- Time Complexity: - Where i is number of learning iterations
- Space complexity: - Where = number of states and = number of actions



