📐 Concept diagram

23-10 — Advanced RL: TRPO, SAC, Model-Based & Multi-Agent

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-10 Prerequisites: 23-09 — PPO Next subject: 24-01 — Information Geometry

Learning Objectives

By the end of this subject, you will be able to:

Derive TRPO's KL-constrained policy update and the natural gradient approximation
Formulate Soft Actor-Critic's maximum entropy objective: $J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)}[r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))]$
Explain how learned world models enable planning and sample-efficient model-based RL
Contrast cooperative, competitive, and mixed multi-agent RL settings
Understand MADDPG's centralized critic with decentralized actor architecture

Core Content

Trust Region Policy Optimization (TRPO)

TRPO (Schulman et al., 2015) is PPO's principled predecessor. While PPO uses clipping as a heuristic trust region, TRPO solves a constrained optimization problem at each iteration:

$$\max_\theta \; \mathbb{E}{s \sim \rho^{\theta{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}^{\theta_{\text{old}}}(s,a)\right]$$

$$\text{subject to} \quad \mathbb{E}{s \sim \rho^{\theta{\text{old}}}}\left[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))\right] \leq \delta$$

where $\delta$ is the KL-divergence budget (typically $\delta = 0.01$).

Why the KL constraint? The surrogate objective (importance-sampled policy gradient) is a valid local approximation to the true expected return. The approximation error is bounded by:

$$\left|J(\theta) - L_{\theta_{\text{old}}}(\theta)\right| \leq C \cdot \max_s D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))$$

So keeping the KL small guarantees the surrogate is faithful. TRPO enforces this as a hard constraint.

TRPO's Solution Method

TRPO approximates the constrained problem using a linear-quadratic expansion:

1. Linearize the objective: Let $g = \nabla_\theta L_{\theta_{\text{old}}}(\theta)|{\theta=\theta{\text{old}}}$ be the policy gradient. The linear approximation:

$$L \approx g^T (\theta - \theta_{\text{old}})$$

2. Quadratic-ize the constraint: The KL divergence has zero gradient and Hessian $F$ (the Fisher Information Matrix) at $\theta = \theta_{\text{old}}$:

$$D_{\text{KL}}(\pi_{\theta_{\text{old}}} | \pi_\theta) \approx \frac{1}{2}(\theta - \theta_{\text{old}})^T F (\theta - \theta_{\text{old}})$$

where $F = \mathbb{E}{s,a \sim \pi{\theta_{\text{old}}}}[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^T]$.

3. Solve the constrained quadratic program:

$$\max_{\Delta\theta} g^T \Delta\theta \quad \text{s.t.} \quad \frac{1}{2}\Delta\theta^T F \Delta\theta \leq \delta$$

The solution (by KKT conditions) is the natural gradient:

$$\Delta\theta = \sqrt{\frac{2\delta}{g^T F^{-1} g}} \cdot F^{-1} g$$

⚠️ CRITICAL: Computing $F^{-1}g$ exactly is $O(d^3)$ — infeasible for neural networks with millions of parameters. TRPO uses conjugate gradient (CG) to solve $Fx = g$ approximately in $O(kd)$ time, where $k \ll d$ (typically $k=10$). A line search along the computed direction ensures the KL constraint is satisfied.

TRPO vs. PPO

Aspect	TRPO	PPO
Constraint	Hard KL bound $\leq \delta$	Soft clipping $[1-\varepsilon, 1+\varepsilon]$
Optimization	Second-order (CG + line search)	First-order (SGD/Adam)
Computation	$O(kd)$ per update, complex	$O(d)$ per update, simple
Implementation	Difficult (CG, Fisher-vector products, line search)	Trivial (few lines of code)
Performance	Similar to PPO	Similar to TRPO

PPO "won" in practice due to simplicity, but understanding TRPO illuminates why PPO works.

Soft Actor-Critic (SAC)

SAC (Haarnoja et al., 2018) is a maximum entropy RL algorithm. Instead of maximizing only expected return, SAC maximizes return plus entropy:

$$J(\pi) = \sum_{t=0}^\infty \mathbb{E}{(s_t, a_t) \sim \rho\pi}\left[r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]$$

where $\mathcal{H}(\pi(\cdot|s)) = -\mathbb{E}_{a \sim \pi(\cdot|s)}[\log \pi(a|s)]$ is the policy entropy and $\alpha$ is the temperature parameter.

Intuition: The entropy term $\alpha\mathcal{H}(\pi)$ encourages the policy to remain stochastic and explore. It acts as a built-in exploration bonus proportional to the policy's uncertainty. This provides: - Better exploration (policy doesn't collapse to deterministic too early) - Robustness to model errors (stochastic policies are smoother) - Automatic adjustment via learnable $\alpha$

Modified Bellman Equations for SAC

The entropy-augmented reward modifies the Q-function:

$$Q^\pi(s,a) = \mathbb{E}\left[r(s,a) + \gamma\left(Q^\pi(s', a') - \alpha \log \pi(a'|s')\right)\right]$$

where $a' \sim \pi(\cdot|s')$. The entropy term $\alpha \log \pi(a'|s')$ is subtracted from the next-state Q-value (or equivalently, added to the reward).

The soft value function:

$$V(s) = \mathbb{E}_{a \sim \pi}[Q(s,a) - \alpha \log \pi(a|s)]$$

SAC Architecture: - 2 Q-functions ($Q_{\phi_1}, Q_{\phi_2}$): Double Q-learning to reduce overestimation - 1 policy ($\pi_\theta$): Gaussian policy with reparameterization trick - 1 value function or target Q for stability - Temperature $\alpha$: Learned to maintain target entropy

Policy update (reparameterized):

$$J_\pi(\theta) = \mathbb{E}{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}\left[\alpha \log \pi\theta(f_\theta(s, \epsilon)|s) - \min_{i=1,2} Q_{\phi_i}(s, f_\theta(s, \epsilon))\right]$$

where $a = f_\theta(s, \epsilon) = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon$ is the reparameterized action. This enables gradient flow through the sampling operation.

Temperature learning: SAC automatically tunes $\alpha$ by minimizing:

$$J(\alpha) = \mathbb{E}_{a \sim \pi}\left[-\alpha \log \pi(a|s) - \alpha \bar{\mathcal{H}}\right]$$

where $\bar{\mathcal{H}}$ is a target entropy (typically $-\dim(\mathcal{A})$).

Model-Based RL

In model-free RL (everything we've covered so far), the agent learns directly from environment interactions without modeling the dynamics. In model-based RL, the agent learns (or is given) a world model — an approximation of the transition dynamics $P(s'|s,a)$ and reward function $R(s,a)$.

The Model-Based RL Loop

Collect data using current policy
Learn the model: $\hat{P}(s'|s,a)$ and $\hat{R}(s,a)$ from data
Plan using the learned model to improve policy
Execute improved policy in the real environment
Repeat

Types of World Models

1. Deterministic models: $$\hat{s}{t+1} = f\phi(s_t, a_t)$$

Simple but can't capture stochastic environments. Train with MSE: $\mathcal{L} = |s_{t+1} - f_\phi(s_t, a_t)|^2$.

2. Probabilistic models: $$s_{t+1} \sim \mathcal{N}(\mu_\phi(s_t, a_t), \Sigma_\phi(s_t, a_t))$$

Better for stochastic environments. Train with negative log-likelihood.

3. Latent dynamics models (e.g., Dreamer, PlaNet): Instead of modeling raw observations, learn a compact latent representation: $$z_t = \text{enc}\phi(o_t)$$ $$\hat{z}{t+1} = f_\phi(z_t, a_t)$$

This is far more efficient for high-dimensional observations (images, sensory data).

Planning with Learned Models

Model Predictive Control (MPC): At each step, use the learned model to simulate multiple action sequences, evaluate them, and execute the first action of the best sequence:

$$a_t^* = \arg\max_{a_{t:t+H}} \mathbb{E}{\hat{P}}\left[\sum{h=0}^{H-1} \gamma^h \hat{R}(\hat{s}{t+h}, a{t+h})\right]$$

Dyna-style: Use the model to generate synthetic experience for model-free RL algorithms. The model acts as a data augmenter.

Backpropagation through the model: For differentiable models, compute $\frac{\partial R}{\partial a}$ and optimize actions via gradient descent.

Model Bias and Corrective Strategies

⚠️ CRITICAL: Model bias. Errors in the learned model compound over multi-step rollouts. A model with 1% per-step error after $H=100$ steps can have $1 - 0.99^{100} \approx 63\%$ cumulative error. Strategies: - Short horizons: Limit planning horizon to reduce compounding - Ensemble models: Average predictions from multiple models to reduce variance - Uncertainty-aware planning: Use model uncertainty to penalize exploration of poorly-modeled regions - Hybrid model-free + model-based: Use the model for planning but refine with model-free value functions (e.g., MBPO)

Multi-Agent RL (MARL)

In single-agent RL, the environment is stationary from the agent's perspective. In multi-agent RL, each agent's policy changes create a non-stationary environment for other agents.

Settings

1. Cooperative: All agents share a common reward function. Goal: maximize joint return. 2. Competitive (zero-sum): One agent's gain is another's loss: $R_1 + R_2 = 0$. 3. Mixed (general-sum): Agents have individual reward functions that may partially align or conflict.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG (Lowe et al., 2017) extends DDPG to multi-agent settings using centralized training with decentralized execution:

During training (centralized critic): Each agent $i$'s critic has access to ALL agents' observations and actions: $$Q_i^{\boldsymbol{\pi}}(\mathbf{o}, \mathbf{a})$$

where $\mathbf{o} = (o_1, \dots, o_N)$ and $\mathbf{a} = (a_1, \dots, a_N)$ are the joint observations and actions.

During execution (decentralized actor): Each agent $i$'s policy $\pi_i(a_i | o_i)$ uses only its own observation $o_i$.

Policy gradient for agent $i$: $$\nabla_{\theta_i} J(\theta_i) = \mathbb{E}{\mathbf{o}, \mathbf{a} \sim \mathcal{D}}\left[\nabla{\theta_i} \pi_i(a_i|o_i) \nabla_{a_i} Q_i^{\boldsymbol{\pi}}(\mathbf{o}, \mathbf{a})|_{a_i=\pi_i(o_i)}\right]$$

Critic update: Standard TD learning with the joint action: $$y_i = r_i + \gamma Q_i'(\mathbf{o}', \mathbf{a}')|_{\mathbf{a}' = \boldsymbol{\pi}'(\mathbf{o}')}$$

where $\boldsymbol{\pi}'$ are target policies.

Key insight: Centralized critics can learn to account for other agents' strategies during training. But at test time, agents act independently using only local observations — no communication needed.

Challenges in MARL

Non-stationarity: Other agents' policy changes make each agent's environment non-Markovian from its perspective
Credit assignment: Who contributed to the joint outcome?
Scalability: Joint action space grows exponentially: $|\mathcal{A}_1| \times \dots \times |\mathcal{A}_N|$
Equilibrium selection: Multiple Nash equilibria may exist — which one to converge to?

Key Terms

Collect data
Execute
Learn the model
Model-Based RL
Multi-Agent RL
Plan
TRPO

Worked Examples

Example 1: TRPO Natural Gradient vs. Standard Gradient

A 2D policy parameter $\theta = [\theta_1, \theta_2]$ has policy gradient $g = [1, 10]$ and Fisher matrix $F = \begin{bmatrix} 100 & 0 \ 0 & 1 \end{bmatrix}$. Compare the standard gradient step and the natural gradient step with $\delta = 0.01$.

Solution:

Standard gradient: $\Delta\theta \propto g = [1, 10]$. The step is dominated by $\theta_2$.

Natural gradient: $F^{-1}g = \begin{bmatrix} 1/100 & 0 \ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \ 10 \end{bmatrix} = \begin{bmatrix} 0.01 \ 10 \end{bmatrix}$

Step size: $\beta = \sqrt{2\delta / (g^T F^{-1}g)} = \sqrt{0.02 / (1 \cdot 0.01 + 10 \cdot 10)} = \sqrt{0.02 / 100.01} \approx 0.01414$

$\Delta\theta_{\text{NG}} = 0.01414 \cdot [0.01, 10] = [0.000141, 0.1414]$

$\Delta\theta_{\text{SG}}$ (with same overall step size on $F^{-1}g$ norm): much more biased toward $\theta_2$.

Click for answer

The standard gradient is heavily dominated by $\theta_2$ (ratio 1:10). The natural gradient corrects for the curvature: $\theta_1$ has high Fisher information (steep curvature), requiring smaller steps; $\theta_2$ has low curvature, allowing larger steps. The natural gradient respects the local geometry of the parameter space.

Example 2: SAC Entropy Augmentation

At state $s$, the policy outputs $[\mu, \sigma] = [1.5, 0.3]$. A sampled action $a = 1.2$ with $\log \pi(a|s) = -0.5$. $Q(s,a) = 10$, $\alpha = 0.2$. Compute the entropy-augmented Q-value used in SAC's Bellman backup.

Solution:

Regular Q: $Q(s,a) = 10$

Entropy bonus: $-\alpha \log \pi(a|s) = -0.2(-0.5) = 0.1$

Soft Q (used in target): $Q_{\text{soft}}(s,a) = 10 + 0.1 = 10.1$

For the policy update, the objective is $\alpha \log \pi(a|s) - Q(s,a) = 0.2(-0.5) - 10 = -0.1 - 10 = -10.1$. Minimizing this (or maximizing the negative) encourages actions with high Q and high entropy.

Click for answer

The entropy term adds 0.1 to the effective Q-value, making the action slightly more attractive than its raw Q-value alone. This promotes exploration: actions from high-entropy regions get a bonus.

Example 3: Model-Based Planning Horizon

A learned dynamics model has per-step prediction error $\sigma = 0.05$. The reward at each step is $\sim \mathcal{N}(\mu, 0.1)$. If planning horizon $H=10$ yields a true expected return of $J^* = 50$, what is the approximate expected return from the model's perspective at $H=10$ vs. $H=50$?

Solution:

After $H$ steps, cumulative dynamics error: $\approx H \cdot \sigma = 0.05H$

$H=10$: cumulative error $\approx 0.5$ (states are roughly 0.5 units away from true states) Estimated return: approximately $50 \pm$ (error propagation through reward)

$H=50$: cumulative error $\approx 2.5$. The model's predictions are essentially noise — planning with $H=50$ is counterproductive.

Click for answer

Long horizons amplify model error. At $H=50$, the predicted trajectory bears little resemblance to reality. This is why model-based methods typically use short horizons (5–20 steps) for planning and rely on a learned value function for long-term credit assignment (the "Dyna" approach) or train model-free components on model-generated data (MBPO).

Practice Problems

Problem 1: Derive the natural gradient solution $\Delta\theta = \beta F^{-1}g$ from the constrained optimization $\max g^T \Delta\theta$ s.t. $\frac{1}{2}\Delta\theta^T F \Delta\theta \leq \delta$.

Answer (click to expand)

Form the Lagrangian: $\mathcal{L} = g^T \Delta\theta - \lambda(\frac{1}{2}\Delta\theta^T F \Delta\theta - \delta)$ Stationarity: $\nabla_{\Delta\theta} \mathcal{L} = g - \lambda F \Delta\theta = 0 \implies \Delta\theta = \frac{1}{\lambda} F^{-1}g$ Complementary slackness: either $\lambda = 0$ or $\frac{1}{2}\Delta\theta^T F \Delta\theta = \delta$ If $g \neq 0$, constraint is binding. Substitute $\Delta\theta$: $\frac{1}{2}(\frac{1}{\lambda} g^T F^{-1}) F (\frac{1}{\lambda} F^{-1}g) = \frac{1}{2\lambda^2} g^T F^{-1} g = \delta$ Solve: $\lambda = \sqrt{\frac{g^T F^{-1} g}{2\delta}}$ Therefore: $\Delta\theta = \sqrt{\frac{2\delta}{g^T F^{-1}g}} F^{-1}g$ This is the natural gradient scaled to satisfy the KL constraint.

Problem 2: Show that SAC's soft Bellman equation reduces to the standard Bellman equation when $\alpha \to 0$.

Answer (click to expand)

Soft Bellman: $Q^\pi(s,a) = \mathbb{E}[r + \gamma(Q^\pi(s',a') - \alpha \log \pi(a'|s'))]$ As $\alpha \to 0$: $Q^\pi(s,a) \to \mathbb{E}[r + \gamma Q^\pi(s',a')]$ This is the standard Bellman equation. The entropy term vanishes, and SAC reduces to a standard actor-critic with double Q-learning. When $\alpha = 0$, the policy converges to the deterministic optimal policy (if it exists).

Problem 3: In MADDPG, why can't we just train each agent independently with single-agent RL?

Answer (click to expand)

From agent $i$'s perspective, the environment includes the other agents' policies. As other agents learn and change their behavior, the transition dynamics $P(s'|s, a_i; \pi_{-i})$ become **non-stationary**. This violates the Markov assumption underlying standard RL algorithms. Independent Q-learning may never converge because the TD target $r + \gamma \max Q(s', a')$ changes even when agent $i$'s own policy is fixed. MADDPG addresses this by conditioning the centralized critic on all agents' actions, making the environment effectively stationary during training (the critic sees the full state including other agents' actions).

Problem 4: A model-based RL agent learns transition dynamics with ensemble of 5 models. At a candidate state-action pair, the 5 models predict $\hat{s}' = [10.1, 10.2, 9.9, 15.0, 10.05]$. How should the agent use this uncertainty?

Answer (click to expand)

The ensemble reveals one outlier (15.0). The agent should: 1. Use the mean (excluding the outlier, or with it): $\approx 11.05$ (with) or $10.06$ (without) 2. Use the variance as an uncertainty estimate: $\sigma^2 \approx 4.7$ (with outlier) — high uncertainty means this region is poorly modeled 3. Penalize actions in uncertain regions: modify the planning objective to $R(s,a) - \beta \cdot \text{Var}[\hat{s}']$ where $\beta$ controls risk aversion This is **uncertainty-aware planning**: avoid actions where the model doesn't know what will happen. The outlier model might have seen an unusual transition — valuable for exploration but dangerous for exploitation.

Problem 5: Compare sample efficiency: model-free SAC vs. model-based Dyna-style agent on a task requiring 1M environment steps for SAC to converge. If the learned model is 90% accurate, what's the minimum improvement in sample efficiency from using the model?

Answer (click to expand)

Dyna-style: after each real environment step, generate $K$ synthetic transitions from the model and train the model-free algorithm on them. Effective training steps = $N_{\text{real}} \times (1 + K)$. If $K=10$, the agent gets 11x more training data per environment step. With 90% accuracy, the model bias reduces effective learning rate (some synthetic transitions are misleading). Assuming linear degradation: effective multiplier $\approx 1 + 0.9 \times 10 = 10$x. Convergence at: $1\text{M} / 10 = 100\text{K}$ environment steps — a 10x improvement. In practice, model-based methods often achieve 10–100x sample efficiency gains, but at the cost of computational overhead (training the model + generating synthetic data).

Summary

TRPO enforces a KL constraint on policy updates via second-order optimization (natural gradient + conjugate gradient); PPO's clipping approximates this with first-order methods
SAC maximizes return + entropy: $J = \sum \mathbb{E}[r + \alpha\mathcal{H}(\pi)]$, achieving robust exploration and state-of-the-art sample efficiency
Model-Based RL learns world models $\hat{P}(s'|s,a)$ for planning; suffers from compounding model error — mitigated by short horizons, ensembles, uncertainty penalties, and hybrid approaches
Multi-Agent RL deals with non-stationary environments; MADDPG uses centralized critics (see all agents) during training but decentralized actors (local observations only) during execution
MARL settings: cooperative (shared reward), competitive (zero-sum), mixed (general-sum)

Quiz

Question 1: TRPO's constraint is formulated in terms of:

A. $\ell_2$ norm of parameter change B. KL divergence between old and new policies C. Clipping of the probability ratio D. Entropy of the policy

Correct Answer: B