So far, we’ve focused on value-based methods that learn value functions and derive policies from them. Policy gradient methods take a different approach: they directly optimize the policy without explicitly computing value functions.
Key insight: Instead of learning “how good is this state?” (value), learn “what should I do in this state?” (policy).
Why Policy Gradient Methods?
Advantages over Value-Based Methods
Natural handling of continuous action spaces
Value-based: Need to find \max_a Q(s,a) (difficult in continuous spaces)
Policy-based: Directly sample from \pi(a|s)
Stochastic policies
Can naturally represent stochastic optimal policies
Built-in exploration through policy stochasticity
Smoother convergence
Small parameter changes lead to small policy changes
More stable than value-based methods in some cases
Works with function approximation
Can directly optimize parameterized policies
No need for separate value function approximation
Disadvantages
High variance
Gradient estimates can be very noisy
Requires many samples
Slow convergence
Generally slower than value-based methods
Can get stuck in local optima
Sample efficiency
Typically less sample efficient than value-based methods
Policy Parameterization
We parameterize the policy as \pi(a|s, \boldsymbol{\theta}) where \boldsymbol{\theta} are the parameters.
Intuition: - If return G_t is high, increase probability of action A_t in state S_t - If return G_t is low, decrease probability of action A_t in state S_t
REINFORCE Algorithm
Monte Carlo Policy Gradient
Initialize policy parameters θ
For each episode:
Generate episode: S₀, A₀, R₁, S₁, A₁, R₂, ..., Sₜ₋₁, Aₜ₋₁, Rₜ
For t = 0, 1, ..., T-1:
G = sum of discounted rewards from time t
θ = θ + α * G * ∇_θ log π(A_t|S_t, θ)
Where \delta_t = R_{t+1} + \gamma V(S_{t+1}, \boldsymbol{w}_t) - V(S_t, \boldsymbol{w}_t) is the TD error.
Algorithm:
Initialize actor θ and critic w
For each episode:
Initialize S
For each step:
Choose A ~ π(·|S, θ)
Take action A, observe R, S'
δ = R + γV(S', w) - V(S, w)
w = w + α_w * δ * ∇_w V(S, w)
θ = θ + α_θ * δ * ∇_θ log π(A|S, θ)
S = S'
Advanced Policy Gradient Methods
Natural Policy Gradients
Problem: Parameter space doesn’t match policy space.
Solution: Use natural gradients that account for the geometry of the policy space.
Training REINFORCE (vanilla)...
Episode 0, Average Reward: 34.00
Episode 100, Average Reward: 16.28
Episode 200, Average Reward: 16.03
Episode 300, Average Reward: 16.04
Training REINFORCE with Baseline...
Episode 0, Average Reward: 30.00
Episode 100, Average Reward: 42.02
Episode 200, Average Reward: 16.44
Episode 300, Average Reward: 29.41
REINFORCE Learning on CartPole Environment
Final Performance Summary:
REINFORCE (vanilla): 15.90
REINFORCE + Baseline: 82.18
Improvement with baseline: 416.9%
Key Insights:
• Baseline reduces variance and improves learning stability
• Policy gradients learn stochastic policies directly
• Higher rewards lead to increased action probabilities
• Variance reduction is crucial for policy gradient methods
Autonomous driving: Path planning, decision making
Key Takeaways
Direct policy optimization: Learn policy directly without value functions
Natural for continuous actions: Handle continuous action spaces naturally
High variance challenge: Requires variance reduction techniques
Actor-critic combination: Combines benefits of policy and value methods
Modern deep RL: Foundation for state-of-the-art algorithms
Policy gradient methods provide a powerful framework for reinforcement learning, especially in continuous control and when stochastic policies are beneficial. They form the foundation for many modern deep RL algorithms!