Temporal Difference (TD) learning is one of the most important ideas in reinforcement learning. It combines the best aspects of Monte Carlo methods and Dynamic Programming:
Like Monte Carlo: Model-free, learns from experience
Like Dynamic Programming: Bootstraps from current estimates, doesn’t wait for episodes to end
Key insight: You can learn from incomplete episodes by using current estimates of future value.
The TD Learning Idea
Traditional approach (MC): Wait until episode ends, then update: V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)]
TD approach: Update immediately after each step: V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]
This is the difference between the estimated value and the “better” estimate using the immediate reward plus the discounted next-state value.
TD(0) Algorithm
The simplest TD algorithm updates the value function after every step:
Initialize V(s) arbitrarily for all s ∈ S
For each episode:
Initialize S
For each step of episode:
A = action given by policy for S
Take action A, observe R, S'
V(S) = V(S) + α[R + γV(S') - V(S)]
S = S'
Until S is terminal
Key differences from MC: 1. Online learning: Updates happen during the episode 2. Bootstrapping: Uses current estimate V(S’) rather than actual return 3. Lower variance: Single-step updates reduce variance 4. Biased initially: Estimates are biased until V converges
Why TD Learning Works
The Driving Example
Scenario: You’re driving to work and estimating arrival time.
States: Home → Highway → Office Value function: Estimated time to reach office
Monte Carlo approach: - Only update your estimate after you arrive at work - If you get stuck in traffic, you learn nothing until the end
TD approach: - Update your estimate when you reach the highway - If highway is clear, immediately revise your time estimate - Learn from partial experience
Initialize Q(s,a) arbitrarily for all s ∈ S, a ∈ A
For each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., ε-greedy)
For each step of episode:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q
Q(S,A) = Q(S,A) + α[R + γQ(S',A') - Q(S,A)]
S = S'; A = A'
Until S is terminal
On-policy: The policy being evaluated and improved is the same as the policy being used to generate behavior.
SARSA Characteristics
Conservative: Learns the value of the policy it’s actually following
Safe: Takes exploration into account when learning
Converges: Guaranteed to converge to optimal policy under certain conditions
Q-Learning: Off-Policy TD Control
Q-learning learns the optimal action-value function regardless of the policy being followed.
Initialize Q(s,a) arbitrarily for all s ∈ S, a ∈ A
For each episode:
Initialize S
For each step of episode:
Choose A from S using policy derived from Q (e.g., ε-greedy)
Take action A, observe R, S'
Q(S,A) = Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
S = S'
Until S is terminal
Off-policy: The policy being learned (greedy w.r.t. Q) is different from the policy generating behavior (ε-greedy).
Q-Learning Characteristics
Aggressive: Learns optimal policy even while exploring
Robust: Can learn from suboptimal behavior
Converges: Guaranteed to converge to Q* under certain conditions
SARSA vs. Q-Learning: The Cliff Walking Problem
Setup: Agent must navigate from start to goal, avoiding a cliff.
Reward structure: - Normal steps: -1 - Cliff: -100 (return to start) - Goal: 0
SARSA behavior: - Learns a “safe” path away from cliff - Takes exploration into account - Finds suboptimal but safe policy
Q-learning behavior: - Learns optimal path close to cliff - Ignores exploration in learning - Finds optimal policy
Cliff Walking Comparison
Expected SARSA
Idea: Instead of sampling the next action, use the expected value over all possible actions.
For ε-greedy policy:\mathbb{E}_\pi[Q(S_{t+1}, A_{t+1})] = \frac{\epsilon}{|\mathcal{A}|} \sum_a Q(S_{t+1}, a) + (1-\epsilon) \max_a Q(S_{t+1}, a)
Advantages: - Reduces variance compared to SARSA - More stable learning - Can be used in both on-policy and off-policy settings
Double Q-Learning
Problem: Q-learning can overestimate action values due to the max operation.
Solution: Use two Q-functions and update them alternately.
Algorithm:
Initialize Q₁(s,a) and Q₂(s,a) arbitrarily for all s ∈ S, a ∈ A
For each episode:
Initialize S
For each step of episode:
Choose A using policy derived from Q₁ + Q₂
Take action A, observe R, S'
With probability 0.5:
Q₁(S,A) = Q₁(S,A) + α[R + γQ₂(S', argmax_a Q₁(S',a)) - Q₁(S,A)]
Else:
Q₂(S,A) = Q₂(S,A) + α[R + γQ₁(S', argmax_a Q₂(S',a)) - Q₂(S,A)]
S = S'
Until S is terminal
Final Performance Comparison:
----------------------------------------
Q-Learning - Last 50 episodes:
Average reward: 1.76
Average length: 9.2
SARSA - Last 50 episodes:
Average reward: 2.34
Average length: 8.7
Key Insights:
• Q-Learning learns the optimal policy faster
• SARSA is more conservative, considering exploration during learning
• Both converge to good policies, but through different paths
Applications
Game playing: Learning to play Atari games, board games
Healthcare: Treatment recommendation, drug discovery
Advantages of TD Learning
Online learning: Can learn during episodes
Model-free: No need for environment model
Efficient: Lower variance than Monte Carlo
Flexible: Works with continuing tasks
Bootstrapping: Can learn from partial experience
Disadvantages of TD Learning
Initially biased: Estimates are biased until convergence
Hyperparameter sensitive: Requires tuning of step size
Exploration required: Need adequate exploration for convergence
Function approximation challenges: Convergence issues with approximation
Key Takeaways
TD learning is fundamental: Combines best of MC and DP
Bootstrapping is powerful: Can learn from incomplete episodes
On-policy vs. off-policy: SARSA vs. Q-learning serve different purposes
Exploration matters: Both algorithms need adequate exploration
Foundation for modern RL: TD methods underpin most current algorithms
Temporal difference learning represents a breakthrough in reinforcement learning, enabling agents to learn continuously from experience without requiring models or complete episodes. It forms the foundation for many modern RL algorithms!