Markov Decision Process (MDP)
What is a Markov Decision Process?
A Markov Decision Process (MDP) provides a mathematical framework for modeling sequential decision-making problems where outcomes are partly random and partly controlled by the agent.
Think of it like chess, but with dice: In regular chess, the outcome of each move is deterministic. In an MDP, when you move your piece, there’s some randomness in where it actually ends up - but you still have control over which piece to move.
Formal Definition
An MDP is defined by a 5-tuple: (\mathcal{S}, \mathcal{A}, P, R, \gamma)
- \mathcal{S}: Set of all possible states
- \mathcal{A}: Set of all possible actions
- P: Transition probability function P(s'|s,a) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)
- R: Reward function R(s,a,s') or R(s,a)
- \gamma: Discount factor \gamma \in [0,1]
The Markov Property
Key assumption: The future depends only on the present state, not the entire history.
\Pr(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)
Intuition: To predict where you’ll be tomorrow, you only need to know where you are today and what you do today - not your entire life history.
This makes problems tractable! Without this assumption, we’d need infinite memory.
MDP Components in Detail
States (\mathcal{S})
The state captures all relevant information needed to make optimal decisions.
Examples: - Grid World: (x, y) coordinates - Blackjack: (player sum, dealer showing card, usable ace?) - Robot Navigation: (position, velocity, battery level, sensor readings) - Stock Trading: (prices, portfolio, market indicators)
Complete vs. Incomplete Information: - Complete: Agent observes true state s_t - Incomplete: Agent observes o_t (observation), true state s_t may be hidden
Actions (\mathcal{A})
Available choices the agent can make. May depend on state: \mathcal{A}(s).
Examples: - Grid World: {North, South, East, West} - Blackjack: {Hit, Stand, Double Down, Split} - Stock Trading: {Buy, Sell, Hold}
Transition Probabilities (P)
P(s'|s,a) defines the probability of reaching state s' after taking action a in state s.
Properties: - P(s'|s,a) \geq 0 for all s, a, s' - \sum_{s'} P(s'|s,a) = 1 for all s, a (probability distribution)
Examples: - Deterministic: Robot moves exactly where commanded - Stochastic: Robot moves in intended direction 80% of time, slips left/right 10% each
Rewards (R)
Immediate feedback signal that guides learning.
Forms: - R(s,a,s'): Reward depends on state, action, and next state - R(s,a): Reward depends only on state and action - R(s): Reward depends only on state
Design principles: - Positive rewards for good outcomes - Negative rewards (penalties) for bad outcomes - Small negative rewards for each step (encourages efficiency)
The Agent-Environment Loop
t=0: Agent observes initial state s₀
↓
t=1: Agent chooses action a₀
Environment returns reward r₁ and new state s₁
↓
t=2: Agent chooses action a₁
Environment returns reward r₂ and new state s₂
↓
... continues forever (or until terminal state)
Trajectory: s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, ...
Example: Grid World MDP
Setup: 4×4 grid, agent starts at bottom-left, goal at top-right
States: \mathcal{S} = \{(i,j) : i,j \in \{1,2,3,4\}\}
Actions: \mathcal{A} = \{\text{North, South, East, West}\}
Transitions: - 80% chance of moving in intended direction - 10% chance of moving perpendicular to intended direction - Can’t move outside grid (stay in place)
Rewards: - R = +10 for reaching goal state (4,4) - R = -1 for each step (encourages efficiency) - R = -10 for hitting obstacles
Types of MDPs
Episodic vs. Continuing Tasks
Episodic: Natural stopping points (episodes) - Episodes have terminal states - Agent resets after each episode - Examples: Games, maze navigation
Continuing: No natural stopping points - Runs indefinitely - Need discounting to prevent infinite returns - Examples: Stock trading, server management
Finite vs. Infinite MDPs
Finite: Finite state and action spaces Infinite: Continuous states and/or actions (requires function approximation)
Value Functions in MDPs
Return (Cumulative Reward)
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
State Value Function
V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]
Action-Value Function
Q^\pi(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]
Bellman Equations
Bellman Equation for V^\pi
V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]
Bellman Equation for Q^\pi
Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]
Intuition: The value of a state equals the immediate reward plus the discounted value of future states, weighted by their probabilities.
Optimal Policies and Value Functions
Optimal state-value function: V^*(s) = \max_\pi V^\pi(s)
Optimal action-value function: Q^*(s,a) = \max_\pi Q^\pi(s,a)
Bellman Optimality Equations: V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]
Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]
Key Properties
- Existence: Every finite MDP has at least one optimal policy
- Deterministic: There exists an optimal policy that is deterministic
- Uniqueness: V^* and Q^* are unique
- Policy Extraction: \pi^*(s) = \arg\max_a Q^*(s,a)
Common Challenges
- Curse of Dimensionality: State space grows exponentially with problem complexity
- Exploration vs. Exploitation: Need to balance trying new actions vs. using known good ones
- Partial Observability: Often can’t observe true state
- Continuous Spaces: Real-world problems often have continuous state/action spaces
- Non-stationarity: Environment may change over time
Real-World Applications
- Autonomous Driving: States = traffic situations, Actions = driving maneuvers
- Game Playing: States = board positions, Actions = legal moves
- Resource Allocation: States = resource levels, Actions = allocation decisions
- Medical Treatment: States = patient conditions, Actions = treatment options
The MDP framework provides the theoretical foundation for most reinforcement learning algorithms. Understanding MDPs is crucial for designing effective RL solutions!