Intro

Q-learning is a model-free Reinforcement Learning algorithm that learns an action-value function Q(s,a), estimating the expected cumulative reward of taking action $a$ in state $s$ . Its objective is to find the optimal policy that maximizes long-term reward in a Markov Decision Process (MDP).

It enables an agent to learn optimal actions through trial-and-error by interacting with the environment, without requiring knowledge of the transition function or reward model. Q-values are stored in a Q-table (or approximated with a function) and are updated iteratively using the Bellman update rule until convergence to the optimal policy.

Instead of learning the environment, Q-learning learns:

“How good is it to take action a in state s?”

Q-Function

Q-learning learns the value function: ¹

Q (s, a)

which estimates the expected cumulative reward of taking action $a$ in state $s$ .

The optimal Policy is derived as:

π^{*} (s) = ar g a max Q (s, a)

ε-greedy Policy

The ε-greedy policy is used during training to balance exploration and exploitation:

with probability $ϵ$ → choose a random action (explore)
with probability $1 - ϵ$ → choose the best known action (exploit)

a = {random action with probability ϵ ar g max_{a} Q (s, a) with probability 1 - ϵ

import numpy as np
 
def select_action(q_values, epsilon, n_actions):
    # With probability epsilon - explore
    if np.random.rand() < epsilon:
        return np.random.randint(n_actions)
    # With probability 1 - epsilon - exploit
    else:
        return np.argmax(q_values)

Update Rule

Q-learning gradually shifts old estimates toward new evidence, balancing prior knowledge with newly observed outcomes.

Q (s, a) \leftarrow Q (s, a) + α [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)]

Where:

α = learning rate (Update speed- controls how much new info overrides old Q)
γ = discount factor (importance of future reward)
r = immediate reward
s’ = next state
max Q(s’,a’) = best possible future value

---

Q-Learning Hyperparameters

Learning Rate ( $α$ )

Range:

0 \leq α \leq 1

Controls how much new information overrides old Q-values.

Effect:

High ( $α \approx 0.9$ ) → fast learning, unstable updates
Low ( $α \approx 0.1$ ) → slow but stable learning

Discount Factor ( $γ$ )

Range:

0 \leq γ \leq 1

Controls importance of future rewards vs immediate rewards.

Effect:

$γ \approx 0$ → only immediate reward matters (short-sighted)
$γ \approx 1$ → long-term rewards matter heavily (far-sighted)

Exploration Rate ( $ϵ$ )

Range:

0 \leq ϵ \leq 1

Controls probability of choosing a random action instead of the best-known action.

Effect:

High ( $ϵ$ ) → more exploration
Low ( $ϵ$ ) → more exploitation

Typically:

$ϵ$ starts high and decays over time as the agent learns

Parameter Overview

Parameter Name What it Controls Effect
$α$ Learning Rate How fast Q-values update Speed vs stability
$γ$ Discount Factor Importance of future rewards Short-term vs long-term planning
$ϵ$ Exploration Rate Random vs greedy actions Exploration vs exploitation

Parameter	Name	What it Controls	Effect
$α$	Learning Rate	How fast Q-values update	Speed vs stability
$γ$	Discount Factor	Importance of future rewards	Short-term vs long-term planning
$ϵ$	Exploration Rate	Random vs greedy actions	Exploration vs exploitation

Tuning Recommendationn (Millington)

Learning Rate: 0.3 (Can vary from 0.1 to 0.7)
Discount Rate: 0.75
Randomness for exploration
- Online: 0.1
- Offline: 0.2

Algorithm

Q-learning iteratively updates the current Q-value by blending it with a target value made from the immediate reward plus the estimated best future value of the next state.

Initialize Q(s,a) arbitrarily
Observe current state s
Choose action a (ε-greedy)
Execute action, observe r and s’
Update Q(s,a)
Set s ← s’
Repeat

Performance

Time Complexity: $O (i)$ - Where i is number of learning iterations
Space complexity: $O (S A)$ - Where $S$ = number of states and $A$ = number of actions

https://cs50.harvard.edu/ai/notes/4/#q-learning ↩

🧗‍♂️Random Restart

Explorer

Recent Notes

Joy Package

Power

Markov Property

Q-Learning

Intro

Q-Function

ε-greedy Policy

Update Rule

Q-Learning Hyperparameters

Learning Rate ( $α$ )

Discount Factor ( $γ$ )

Exploration Rate ( $ϵ$ )

Tuning Recommendationn (Millington)

Algorithm

Performance

Graph View

Table of Contents

Backlinks

🧗‍♂️Random Restart

Explorer

Recent Notes

Joy Package

Power

Markov Property

Q-Learning

Intro

Q-Function

ε-greedy Policy

Update Rule

Q-Learning Hyperparameters

Learning Rate (α)

Discount Factor (γ)

Exploration Rate (ϵ)

Tuning Recommendationn (Millington)

Algorithm

Performance

Footnotes

Graph View

Table of Contents

Backlinks

Learning Rate ( $α$ )

Discount Factor ( $γ$ )

Exploration Rate ( $ϵ$ )