Markov Decision Process (MDP)

What is a Markov Decision Process?

A Markov Decision Process (MDP) provides a mathematical framework for modeling sequential decision-making problems where outcomes are partly random and partly controlled by the agent.

Think of it like chess, but with dice: In regular chess, the outcome of each move is deterministic. In an MDP, when you move your piece, there’s some randomness in where it actually ends up - but you still have control over which piece to move.

Formal Definition

An MDP is defined by a 5-tuple: (\mathcal{S}, \mathcal{A}, P, R, \gamma)

\mathcal{S}: Set of all possible states
\mathcal{A}: Set of all possible actions
P: Transition probability function P(s'|s,a) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)
R: Reward function R(s,a,s') or R(s,a)
\gamma: Discount factor \gamma \in [0,1]

The Markov Property

Key assumption: The future depends only on the present state, not the entire history.

\Pr(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)

Intuition: To predict where you’ll be tomorrow, you only need to know where you are today and what you do today - not your entire life history.

This makes problems tractable! Without this assumption, we’d need infinite memory.

MDP Components in Detail

States (\mathcal{S})

The state captures all relevant information needed to make optimal decisions.

Examples: - Grid World: (x, y) coordinates - Blackjack: (player sum, dealer showing card, usable ace?) - Robot Navigation: (position, velocity, battery level, sensor readings) - Stock Trading: (prices, portfolio, market indicators)

Complete vs. Incomplete Information: - Complete: Agent observes true state s_t - Incomplete: Agent observes o_t (observation), true state s_t may be hidden

Actions (\mathcal{A})

Available choices the agent can make. May depend on state: \mathcal{A}(s).

Examples: - Grid World: {North, South, East, West} - Blackjack: {Hit, Stand, Double Down, Split} - Stock Trading: {Buy, Sell, Hold}

Transition Probabilities (P)

P(s'|s,a) defines the probability of reaching state s' after taking action a in state s.

Properties: - P(s'|s,a) \geq 0 for all s, a, s' - \sum_{s'} P(s'|s,a) = 1 for all s, a (probability distribution)

Examples: - Deterministic: Robot moves exactly where commanded - Stochastic: Robot moves in intended direction 80% of time, slips left/right 10% each

Rewards (R)

Immediate feedback signal that guides learning.

Forms: - R(s,a,s'): Reward depends on state, action, and next state - R(s,a): Reward depends only on state and action - R(s): Reward depends only on state

Design principles: - Positive rewards for good outcomes - Negative rewards (penalties) for bad outcomes - Small negative rewards for each step (encourages efficiency)

The Agent-Environment Loop

t=0: Agent observes initial state s₀
     ↓
t=1: Agent chooses action a₀
     Environment returns reward r₁ and new state s₁
     ↓
t=2: Agent chooses action a₁
     Environment returns reward r₂ and new state s₂
     ↓
     ... continues forever (or until terminal state)

Trajectory: s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, ...

Example: Grid World MDP

Setup: 4×4 grid, agent starts at bottom-left, goal at top-right

States: \mathcal{S} = \{(i,j) : i,j \in \{1,2,3,4\}\}

Actions: \mathcal{A} = \{\text{North, South, East, West}\}

Transitions: - 80% chance of moving in intended direction - 10% chance of moving perpendicular to intended direction - Can’t move outside grid (stay in place)

Rewards: - R = +10 for reaching goal state (4,4) - R = -1 for each step (encourages efficiency) - R = -10 for hitting obstacles

Types of MDPs

Episodic vs. Continuing Tasks

Episodic: Natural stopping points (episodes) - Episodes have terminal states - Agent resets after each episode - Examples: Games, maze navigation

Continuing: No natural stopping points - Runs indefinitely - Need discounting to prevent infinite returns - Examples: Stock trading, server management

Finite vs. Infinite MDPs

Finite: Finite state and action spaces Infinite: Continuous states and/or actions (requires function approximation)

Value Functions in MDPs

Return (Cumulative Reward)

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

State Value Function

V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

Action-Value Function

Q^\pi(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

Bellman Equations

Bellman Equation for V^\pi

V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]

Bellman Equation for Q^\pi

Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]

Intuition: The value of a state equals the immediate reward plus the discounted value of future states, weighted by their probabilities.

Optimal Policies and Value Functions

Optimal state-value function: V^*(s) = \max_\pi V^\pi(s)

Optimal action-value function: Q^*(s,a) = \max_\pi Q^\pi(s,a)

Bellman Optimality Equations: V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]

Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

Key Properties

Existence: Every finite MDP has at least one optimal policy
Deterministic: There exists an optimal policy that is deterministic
Uniqueness: V^* and Q^* are unique
Policy Extraction: \pi^*(s) = \arg\max_a Q^*(s,a)

Common Challenges

Curse of Dimensionality: State space grows exponentially with problem complexity
Exploration vs. Exploitation: Need to balance trying new actions vs. using known good ones
Partial Observability: Often can’t observe true state
Continuous Spaces: Real-world problems often have continuous state/action spaces
Non-stationarity: Environment may change over time

Real-World Applications

Autonomous Driving: States = traffic situations, Actions = driving maneuvers
Game Playing: States = board positions, Actions = legal moves
Resource Allocation: States = resource levels, Actions = allocation decisions
Medical Treatment: States = patient conditions, Actions = treatment options

The MDP framework provides the theoretical foundation for most reinforcement learning algorithms. Understanding MDPs is crucial for designing effective RL solutions!