Actor-Critic Methods

Introduction to Actor-Critic Methods

Actor-Critic methods combine the best of both worlds: - Value-based methods: Learn value functions (low variance, biased) - Policy-based methods: Learn policies directly (high variance, unbiased)

Key insight: Use the value function to reduce variance of policy gradient estimates.

The Actor-Critic Architecture

Actor: Policy \pi(a|s, \boldsymbol{\theta}) - Role: Decides what action to take - Learning: Updates policy parameters to maximize expected return

Critic: Value function V(s, \boldsymbol{w}) or Q(s,a, \boldsymbol{w}) - Role: Evaluates how good the actor’s actions are - Learning: Updates value function parameters to minimize prediction error

Interaction: 1. Actor chooses action based on current policy 2. Critic evaluates the action 3. Actor updates policy based on critic’s feedback 4. Critic updates value estimates based on observed rewards

Basic Actor-Critic Algorithm

Update equations: - Critic: \boldsymbol{w}_{t+1} = \boldsymbol{w}_t + \alpha_w \delta_t \nabla_{\boldsymbol{w}} V(S_t, \boldsymbol{w}_t) - Actor: \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha_\theta \delta_t \nabla_{\boldsymbol{\theta}} \log \pi(A_t|S_t, \boldsymbol{\theta}_t)

Where \delta_t = R_{t+1} + \gamma V(S_{t+1}, \boldsymbol{w}_t) - V(S_t, \boldsymbol{w}_t) is the TD error.

Algorithm:

Initialize actor parameters θ and critic parameters w
For each episode:
    Initialize state S
    For each step:
        Choose action A ~ π(·|S, θ)
        Take action A, observe reward R and next state S'
        
        # Critic update
        δ = R + γV(S', w) - V(S, w)
        w = w + α_w * δ * ∇_w V(S, w)
        
        # Actor update  
        θ = θ + α_θ * δ * ∇_θ log π(A|S, θ)
        
        S = S'
    Until S is terminal

Advantage Actor-Critic (A2C)

Key improvement: Use advantage function instead of raw TD error.

Advantage function: A(s,a) = Q(s,a) - V(s)

Interpretation: How much better is action a compared to the average action in state s?

TD error as advantage estimate: \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \approx A(S_t, A_t)

Benefits: - Reduces variance of policy gradient estimates - Centers the gradient updates around zero - More stable learning

Asynchronous Advantage Actor-Critic (A3C)

Problem: Single-threaded learning can be slow and unstable.

Solution: Run multiple parallel actors collecting experience asynchronously.

Key innovations: 1. Parallel actors: Multiple agents exploring different parts of environment 2. Asynchronous updates: Each actor updates global parameters independently 3. Decorrelated experience: Parallel exploration reduces correlation

Algorithm:

Global parameters: θ (actor), w (critic)
For each parallel actor:
    Initialize local parameters: θ' = θ, w' = w
    For each episode:
        Collect trajectory of length T
        For each step in trajectory:
            Calculate advantages using local critic
        
        # Calculate gradients
        dθ = sum of policy gradients weighted by advantages
        dw = sum of value function gradients
        
        # Update global parameters
        θ = θ + α_θ * dθ
        w = w + α_w * dw
        
        # Sync local parameters
        θ' = θ, w' = w

Advantages: - Faster learning through parallelization - Better exploration through diversity - More stable due to decorrelated updates

Generalized Advantage Estimation (GAE)

Problem: Bias-variance tradeoff in advantage estimation.

Solution: Use exponentially weighted average of n-step advantages.

GAE formula: A_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}

Where \delta_{t+l} = R_{t+l+1} + \gamma V(S_{t+l+1}) - V(S_{t+l})

Intuition: - \lambda = 0: Use only 1-step TD error (low variance, high bias) - \lambda = 1: Use full Monte Carlo return (high variance, low bias) - \lambda \in (0,1): Balance between bias and variance

Proximal Policy Optimization (PPO)

Problem: Large policy updates can be harmful.

Solution: Constrain policy updates to stay close to old policy.

PPO-Clip objective: L^{CLIP}(\boldsymbol{\theta}) = \mathbb{E}_t\left[\min\left(r_t(\boldsymbol{\theta}) \hat{A}_t, \text{clip}(r_t(\boldsymbol{\theta}), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]

Where: - r_t(\boldsymbol{\theta}) = \frac{\pi_{\boldsymbol{\theta}}(A_t|S_t)}{\pi_{\boldsymbol{\theta}_{old}}(A_t|S_t)} (probability ratio) - \hat{A}_t is advantage estimate - \epsilon is clipping parameter (typically 0.2)

Algorithm:

For each iteration:
    Collect trajectories using current policy
    Compute advantages using GAE
    
    For multiple epochs:
        For each minibatch:
            Update policy using PPO-Clip loss
            Update value function using MSE loss

Soft Actor-Critic (SAC)

For continuous control: Combines actor-critic with maximum entropy RL.

Key idea: Maximize both reward and policy entropy.

Objective: J(\boldsymbol{\theta}) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T R(s_t, a_t) + \alpha \mathcal{H}(\pi_\theta(\cdot|s_t))\right]

Where \mathcal{H} is entropy and \alpha is temperature parameter.

Components: - Actor: Stochastic policy \pi_\theta(a|s) - Critic: Twin Q-functions Q_{\phi_1}(s,a), Q_{\phi_2}(s,a) - Target networks: For stable learning

Twin Delayed Deep Deterministic Policy Gradient (TD3)

Improvements over DDPG: 1. Twin critics: Use two Q-functions, take minimum 2. Delayed policy updates: Update policy less frequently than critics 3. Target policy smoothing: Add noise to target policy

Algorithm:

For each step:
    # Collect experience
    a = μ(s) + noise
    Execute a, observe r, s'
    Store (s, a, r, s') in replay buffer
    
    # Update critics
    Sample batch from replay buffer
    Update both Q-functions using Bellman equation
    
    # Update actor (delayed)
    If step % d == 0:
        Update policy using deterministic policy gradient
        Update target networks

Comparison of Actor-Critic Methods

Method	Type	Key Features	Best For
A2C	On-policy	Advantage estimation	Discrete actions
A3C	On-policy	Asynchronous parallel learning	Fast learning
PPO	On-policy	Clipped updates, stable	General purpose
SAC	Off-policy	Maximum entropy, stochastic	Continuous control
TD3	Off-policy	Deterministic, twin critics	Continuous control

Code Example: A2Cimport torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class ActorCritic(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=128):
        super().__init__()
        
        # Shared layers
        self.shared = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU()
        )
        
        # Actor head
        self.actor = nn.Sequential(
            nn.Linear(hidden_size, action_size),
            nn.Softmax(dim=-1)
        )
        
        # Critic head
        self.critic = nn.Linear(hidden_size, 1)
    
    def forward(self, state):
        shared_features = self.shared(state)
        policy = self.actor(shared_features)
        value = self.critic(shared_features)
        return policy, value

class A2C:
    def __init__(self, state_size, action_size, lr=0.001):
        self.model = ActorCritic(state_size, action_size)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        self.gamma = 0.99
        
    def select_action(self, state):
        state = torch.FloatTensor(state).unsqueeze(0)
        policy, value = self.model(state)
        action = torch.multinomial(policy, 1)
        return action.item(), torch.log(policy[0, action]), value
    
    def update(self, trajectories):
        policy_losses = []
        value_losses = []
        
        for trajectory in trajectories:
            states, actions, rewards, log_probs, values = trajectory
            
            # Calculate returns
            returns = []
            G = 0
            for reward in reversed(rewards):
                G = reward + self.gamma * G
                returns.insert(0, G)
            returns = torch.tensor(returns)
            
            # Calculate advantages
            advantages = returns - values
            
            # Policy loss
            policy_loss = -(log_probs * advantages.detach()).mean()
            policy_losses.append(policy_loss)
            
            # Value loss
            value_loss = advantages.pow(2).mean()
            value_losses.append(value_loss)
        
        # Update parameters
        total_loss = sum(policy_losses) + sum(value_losses)
        self.optimizer.zero_grad()
        total_loss.backward()
        self.optimizer.step()

Practical Considerations

Hyperparameter Tuning

Learning rates: - Actor learning rate typically smaller than critic - Common ratio: \alpha_\theta = 0.1 \times \alpha_w

Network architecture: - Shared layers often beneficial - Separate networks sometimes better for complex tasks

Batch size: - Larger batches more stable - Smaller batches faster updates

Common Issues

Instability: Actor and critic learning can interfere
Catastrophic forgetting: Network can forget previous knowledge
Hyperparameter sensitivity: Requires careful tuning
Sample efficiency: Often requires many samples

Debugging Tips

Monitor value function: Should track true returns
Check policy entropy: Should not collapse too quickly
Gradient norms: Should be reasonable magnitude
Advantage distribution: Should be centered around zero

Applications

Continuous control: Robotics, autonomous vehicles
Game playing: Real-time strategy games
Resource allocation: Cloud computing, power grids
Natural language processing: Dialogue systems, text generation
Finance: Trading, portfolio management

Key Takeaways

Best of both worlds: Combines value and policy methods
Variance reduction: Critic reduces policy gradient variance
Parallel learning: A3C-style parallelism improves efficiency
Stability matters: Methods like PPO provide stable updates
Continuous control: Excellent for continuous action spaces

Actor-critic methods represent a major advance in reinforcement learning, providing a principled way to combine the benefits of both value-based and policy-based approaches. They form the foundation for many state-of-the-art deep RL algorithms!