Intro

A Markov decision process (MDP) is a mathematical framework for sequential decision-making and forms the foundation of Reinforcement Learning. MDPs are closely related to Markov Chains, but extend them by adding actions and rewards.

The Multi-Armed Bandit Problem is a “single state” MDP.

Definition

A Markov Decision Process has the following properties

S: finite set of states
A: finite set of actions
P(s’ | s, a): Transition model
R(s, a, s’): Reward function

Policy

A Policy, π, determines which actions to take in any given state.

Reward Function AKA Feedback

R(s, a, s’): Reward function

Transition Function AKA Forward Model

P(s’ | s, a): Transition model
How the current state and chosen action determine the probability distribution over next states
Example: A robot robot trying to move to the right has a 90% chance of moving to the right, and 10% chance of staying at same position.

Solving MDPs

An Markov Decision Process is solved by finding the optimal policy which tells you the best action in every state.:

π^{*} (s)

All methods solve some form of the Bellman optimality equation:

V^{*} (s) = a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{*} (s^{'})]

Dynamic Programming

When the transition probabilities and reward functions are known, MDPs can be solved with the following methods:

Value Iteration

V (s) \leftarrow a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γV (s^{'})]

Iteratively update state values
Converges to optimal values
Then derive policy from $V (s)$

Policy Iteration

Alternate between:

Policy Evaluation:

V^{π} (s) = s^{'} \sum P (s^{'} ∣ s, π (s)) [R (s, π (s), s^{'}) + γ V^{π} (s^{'})]

Policy Improvement:

π (s) \leftarrow ar g a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γV (s^{'})]

Repeat until policy stops changing.

Reinforcement Learning

When the model is not known, Reinforcement Learning techniques are used

Q-learning

Q (s, a) \leftarrow Q (s, a) + α [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)]

In Q-Learning, the agent learns directly from experience
No need for $P$ or $R$
Eventually approximates optimal policy

🧗‍♂️Random Restart

Explorer

Recent Notes

Joy Package

Power

Markov Property

Markov Decision Process (MDP)