Intro
A Markov decision process (MDP) is a mathematical framework for sequential decision-making and forms the foundation of Reinforcement Learning. MDPs are closely related to Markov Chains, but extend them by adding actions and rewards.
The Multi-Armed Bandit Problem is a “single state” MDP.
Definition
A Markov Decision Process has the following properties
S: finite set of statesA: finite set of actionsP(s’ | s, a): Transition modelR(s, a, s’): Reward function
Policy
A Policy, π, determines which actions to take in any given state.
Reward Function AKA Feedback
R(s, a, s’): Reward function
Transition Function AKA Forward Model
P(s’ | s, a): Transition model- How the current state and chosen action determine the probability distribution over next states
- Example: A robot robot trying to move to the right has a 90% chance of moving to the right, and 10% chance of staying at same position.
Solving MDPs
An Markov Decision Process is solved by finding the optimal policy which tells you the best action in every state.:
All methods solve some form of the Bellman optimality equation:
Dynamic Programming
When the transition probabilities and reward functions are known, MDPs can be solved with the following methods:
Value Iteration
- Iteratively update state values
- Converges to optimal values
- Then derive policy from
Policy Iteration
Alternate between:
Policy Evaluation:
Policy Improvement:
Repeat until policy stops changing.
Reinforcement Learning
When the model is not known, Reinforcement Learning techniques are used
Q-learning
- In Q-Learning, the agent learns directly from experience
- No need for or
- Eventually approximates optimal policy



