Intro

A Markov decision process (MDP) is a mathematical framework for sequential decision-making and forms the foundation of Reinforcement Learning. MDPs are closely related to Markov Chains, but extend them by adding actions and rewards.

description

The Multi-Armed Bandit Problem is a “single state” MDP.

Definition

A Markov Decision Process has the following properties

  • S: finite set of states
  • A: finite set of actions
  • P(s’ | s, a): Transition model
  • R(s, a, s’): Reward function

Policy

A Policy, π, determines which actions to take in any given state.

Reward Function AKA Feedback

  • R(s, a, s’): Reward function

Transition Function AKA Forward Model

  • P(s’ | s, a): Transition model
  • How the current state and chosen action determine the probability distribution over next states
  • Example: A robot robot trying to move to the right has a 90% chance of moving to the right, and 10% chance of staying at same position.

Solving MDPs

An Markov Decision Process is solved by finding the optimal policy which tells you the best action in every state.:

All methods solve some form of the Bellman optimality equation:

Dynamic Programming

When the transition probabilities and reward functions are known, MDPs can be solved with the following methods:

Value Iteration

  • Iteratively update state values
  • Converges to optimal values
  • Then derive policy from

Policy Iteration

Alternate between:

Policy Evaluation:

Policy Improvement:

Repeat until policy stops changing.

Reinforcement Learning

When the model is not known, Reinforcement Learning techniques are used

Q-learning

  • In Q-Learning, the agent learns directly from experience
  • No need for or
  • Eventually approximates optimal policy