Intro

Q-learning is a model-free Reinforcement Learning algorithm that learns an action-value function Q(s,a), estimating the expected cumulative reward of taking action in state . Its objective is to find the optimal policy that maximizes long-term reward in a Markov Decision Process (MDP).

It enables an agent to learn optimal actions through trial-and-error by interacting with the environment, without requiring knowledge of the transition function or reward model. Q-values are stored in a Q-table (or approximated with a function) and are updated iteratively using the Bellman update rule until convergence to the optimal policy.

description

Instead of learning the environment, Q-learning learns:

“How good is it to take action a in state s?”


Q-Function

Q-learning learns the value function: 1

which estimates the expected cumulative reward of taking action in state .

The optimal Policy is derived as:

ε-greedy Policy

The ε-greedy policy is used during training to balance exploration and exploitation:

  • with probability → choose a random action (explore)
  • with probability → choose the best known action (exploit)
import numpy as np
 
def select_action(q_values, epsilon, n_actions):
    # With probability epsilon - explore
    if np.random.rand() < epsilon:
        return np.random.randint(n_actions)
    # With probability 1 - epsilon - exploit
    else:
        return np.argmax(q_values)
 

Update Rule

Q-learning gradually shifts old estimates toward new evidence, balancing prior knowledge with newly observed outcomes.

Where:

  • α = learning rate (Update speed- controls how much new info overrides old Q)
  • γ = discount factor (importance of future reward)
  • r = immediate reward
  • s’ = next state
  • max Q(s’,a’) = best possible future value
description ---

Q-Learning Hyperparameters

Learning Rate ()

Range:

Controls how much new information overrides old Q-values.

Effect:

  • High () → fast learning, unstable updates
  • Low () → slow but stable learning

Discount Factor ()

Range:

Controls importance of future rewards vs immediate rewards.

Effect:

  • → only immediate reward matters (short-sighted)
  • → long-term rewards matter heavily (far-sighted)

Exploration Rate ()

Range:

Controls probability of choosing a random action instead of the best-known action.

Effect:

  • High () → more exploration
  • Low () → more exploitation

Typically:

starts high and decays over time as the agent learns


Parameter Overview

ParameterNameWhat it ControlsEffect
Learning RateHow fast Q-values updateSpeed vs stability
Discount FactorImportance of future rewardsShort-term vs long-term planning
Exploration RateRandom vs greedy actionsExploration vs exploitation

Tuning Recommendationn (Millington)

  • Learning Rate: 0.3 (Can vary from 0.1 to 0.7)
  • Discount Rate: 0.75
  • Randomness for exploration
    • Online: 0.1
    • Offline: 0.2

Algorithm

Q-learning iteratively updates the current Q-value by blending it with a target value made from the immediate reward plus the estimated best future value of the next state.

  1. Initialize Q(s,a) arbitrarily
  2. Observe current state s
  3. Choose action a (ε-greedy)
  4. Execute action, observe r and s’
  5. Update Q(s,a)
  6. Set s ← s’
  7. Repeat

Performance

  • Time Complexity: - Where i is number of learning iterations
  • Space complexity: - Where = number of states and = number of actions

Footnotes

  1. https://cs50.harvard.edu/ai/notes/4/#q-learning