Math graphic
📐 Concept diagram

23-10 — Advanced RL: TRPO, SAC, Model-Based & Multi-Agent

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-10 Prerequisites: 23-09 — PPO Next subject: 24-01 — Information Geometry


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive TRPO's KL-constrained policy update and the natural gradient approximation
  2. Formulate Soft Actor-Critic's maximum entropy objective: $J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)}[r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))]$
  3. Explain how learned world models enable planning and sample-efficient model-based RL
  4. Contrast cooperative, competitive, and mixed multi-agent RL settings
  5. Understand MADDPG's centralized critic with decentralized actor architecture

Core Content

Trust Region Policy Optimization (TRPO)

TRPO (Schulman et al., 2015) is PPO's principled predecessor. While PPO uses clipping as a heuristic trust region, TRPO solves a constrained optimization problem at each iteration:

$$\max_\theta \; \mathbb{E}{s \sim \rho^{\theta{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}^{\theta_{\text{old}}}(s,a)\right]$$

$$\text{subject to} \quad \mathbb{E}{s \sim \rho^{\theta{\text{old}}}}\left[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))\right] \leq \delta$$

where $\delta$ is the KL-divergence budget (typically $\delta = 0.01$).

Why the KL constraint? The surrogate objective (importance-sampled policy gradient) is a valid local approximation to the true expected return. The approximation error is bounded by:

$$\left|J(\theta) - L_{\theta_{\text{old}}}(\theta)\right| \leq C \cdot \max_s D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))$$

So keeping the KL small guarantees the surrogate is faithful. TRPO enforces this as a hard constraint.

TRPO's Solution Method

TRPO approximates the constrained problem using a linear-quadratic expansion:

1. Linearize the objective: Let $g = \nabla_\theta L_{\theta_{\text{old}}}(\theta)|{\theta=\theta{\text{old}}}$ be the policy gradient. The linear approximation:

$$L \approx g^T (\theta - \theta_{\text{old}})$$

2. Quadratic-ize the constraint: The KL divergence has zero gradient and Hessian $F$ (the Fisher Information Matrix) at $\theta = \theta_{\text{old}}$:

$$D_{\text{KL}}(\pi_{\theta_{\text{old}}} | \pi_\theta) \approx \frac{1}{2}(\theta - \theta_{\text{old}})^T F (\theta - \theta_{\text{old}})$$

where $F = \mathbb{E}{s,a \sim \pi{\theta_{\text{old}}}}[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^T]$.

3. Solve the constrained quadratic program:

$$\max_{\Delta\theta} g^T \Delta\theta \quad \text{s.t.} \quad \frac{1}{2}\Delta\theta^T F \Delta\theta \leq \delta$$

The solution (by KKT conditions) is the natural gradient:

$$\Delta\theta = \sqrt{\frac{2\delta}{g^T F^{-1} g}} \cdot F^{-1} g$$

⚠️ CRITICAL: Computing $F^{-1}g$ exactly is $O(d^3)$ — infeasible for neural networks with millions of parameters. TRPO uses conjugate gradient (CG) to solve $Fx = g$ approximately in $O(kd)$ time, where $k \ll d$ (typically $k=10$). A line search along the computed direction ensures the KL constraint is satisfied.

TRPO vs. PPO

Aspect TRPO PPO
Constraint Hard KL bound $\leq \delta$ Soft clipping $[1-\varepsilon, 1+\varepsilon]$
Optimization Second-order (CG + line search) First-order (SGD/Adam)
Computation $O(kd)$ per update, complex $O(d)$ per update, simple
Implementation Difficult (CG, Fisher-vector products, line search) Trivial (few lines of code)
Performance Similar to PPO Similar to TRPO

PPO "won" in practice due to simplicity, but understanding TRPO illuminates why PPO works.

Soft Actor-Critic (SAC)

SAC (Haarnoja et al., 2018) is a maximum entropy RL algorithm. Instead of maximizing only expected return, SAC maximizes return plus entropy:

$$J(\pi) = \sum_{t=0}^\infty \mathbb{E}{(s_t, a_t) \sim \rho\pi}\left[r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]$$

where $\mathcal{H}(\pi(\cdot|s)) = -\mathbb{E}_{a \sim \pi(\cdot|s)}[\log \pi(a|s)]$ is the policy entropy and $\alpha$ is the temperature parameter.

Intuition: The entropy term $\alpha\mathcal{H}(\pi)$ encourages the policy to remain stochastic and explore. It acts as a built-in exploration bonus proportional to the policy's uncertainty. This provides: - Better exploration (policy doesn't collapse to deterministic too early) - Robustness to model errors (stochastic policies are smoother) - Automatic adjustment via learnable $\alpha$

Modified Bellman Equations for SAC

The entropy-augmented reward modifies the Q-function:

$$Q^\pi(s,a) = \mathbb{E}\left[r(s,a) + \gamma\left(Q^\pi(s', a') - \alpha \log \pi(a'|s')\right)\right]$$

where $a' \sim \pi(\cdot|s')$. The entropy term $\alpha \log \pi(a'|s')$ is subtracted from the next-state Q-value (or equivalently, added to the reward).

The soft value function:

$$V(s) = \mathbb{E}_{a \sim \pi}[Q(s,a) - \alpha \log \pi(a|s)]$$

SAC Architecture: - 2 Q-functions ($Q_{\phi_1}, Q_{\phi_2}$): Double Q-learning to reduce overestimation - 1 policy ($\pi_\theta$): Gaussian policy with reparameterization trick - 1 value function or target Q for stability - Temperature $\alpha$: Learned to maintain target entropy

Policy update (reparameterized):

$$J_\pi(\theta) = \mathbb{E}{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}\left[\alpha \log \pi\theta(f_\theta(s, \epsilon)|s) - \min_{i=1,2} Q_{\phi_i}(s, f_\theta(s, \epsilon))\right]$$

where $a = f_\theta(s, \epsilon) = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon$ is the reparameterized action. This enables gradient flow through the sampling operation.

Temperature learning: SAC automatically tunes $\alpha$ by minimizing:

$$J(\alpha) = \mathbb{E}_{a \sim \pi}\left[-\alpha \log \pi(a|s) - \alpha \bar{\mathcal{H}}\right]$$

where $\bar{\mathcal{H}}$ is a target entropy (typically $-\dim(\mathcal{A})$).

Model-Based RL

In model-free RL (everything we've covered so far), the agent learns directly from environment interactions without modeling the dynamics. In model-based RL, the agent learns (or is given) a world model — an approximation of the transition dynamics $P(s'|s,a)$ and reward function $R(s,a)$.

The Model-Based RL Loop

  1. Collect data using current policy
  2. Learn the model: $\hat{P}(s'|s,a)$ and $\hat{R}(s,a)$ from data
  3. Plan using the learned model to improve policy
  4. Execute improved policy in the real environment
  5. Repeat

Types of World Models

1. Deterministic models: $$\hat{s}{t+1} = f\phi(s_t, a_t)$$

Simple but can't capture stochastic environments. Train with MSE: $\mathcal{L} = |s_{t+1} - f_\phi(s_t, a_t)|^2$.

2. Probabilistic models: $$s_{t+1} \sim \mathcal{N}(\mu_\phi(s_t, a_t), \Sigma_\phi(s_t, a_t))$$

Better for stochastic environments. Train with negative log-likelihood.

3. Latent dynamics models (e.g., Dreamer, PlaNet): Instead of modeling raw observations, learn a compact latent representation: $$z_t = \text{enc}\phi(o_t)$$ $$\hat{z}{t+1} = f_\phi(z_t, a_t)$$

This is far more efficient for high-dimensional observations (images, sensory data).

Planning with Learned Models

Model Predictive Control (MPC): At each step, use the learned model to simulate multiple action sequences, evaluate them, and execute the first action of the best sequence:

$$a_t^* = \arg\max_{a_{t:t+H}} \mathbb{E}{\hat{P}}\left[\sum{h=0}^{H-1} \gamma^h \hat{R}(\hat{s}{t+h}, a{t+h})\right]$$

Dyna-style: Use the model to generate synthetic experience for model-free RL algorithms. The model acts as a data augmenter.

Backpropagation through the model: For differentiable models, compute $\frac{\partial R}{\partial a}$ and optimize actions via gradient descent.

Model Bias and Corrective Strategies

⚠️ CRITICAL: Model bias. Errors in the learned model compound over multi-step rollouts. A model with 1% per-step error after $H=100$ steps can have $1 - 0.99^{100} \approx 63\%$ cumulative error. Strategies: - Short horizons: Limit planning horizon to reduce compounding - Ensemble models: Average predictions from multiple models to reduce variance - Uncertainty-aware planning: Use model uncertainty to penalize exploration of poorly-modeled regions - Hybrid model-free + model-based: Use the model for planning but refine with model-free value functions (e.g., MBPO)

Multi-Agent RL (MARL)

In single-agent RL, the environment is stationary from the agent's perspective. In multi-agent RL, each agent's policy changes create a non-stationary environment for other agents.

Settings

1. Cooperative: All agents share a common reward function. Goal: maximize joint return. 2. Competitive (zero-sum): One agent's gain is another's loss: $R_1 + R_2 = 0$. 3. Mixed (general-sum): Agents have individual reward functions that may partially align or conflict.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG (Lowe et al., 2017) extends DDPG to multi-agent settings using centralized training with decentralized execution:

During training (centralized critic): Each agent $i$'s critic has access to ALL agents' observations and actions: $$Q_i^{\boldsymbol{\pi}}(\mathbf{o}, \mathbf{a})$$

where $\mathbf{o} = (o_1, \dots, o_N)$ and $\mathbf{a} = (a_1, \dots, a_N)$ are the joint observations and actions.

During execution (decentralized actor): Each agent $i$'s policy $\pi_i(a_i | o_i)$ uses only its own observation $o_i$.

Policy gradient for agent $i$: $$\nabla_{\theta_i} J(\theta_i) = \mathbb{E}{\mathbf{o}, \mathbf{a} \sim \mathcal{D}}\left[\nabla{\theta_i} \pi_i(a_i|o_i) \nabla_{a_i} Q_i^{\boldsymbol{\pi}}(\mathbf{o}, \mathbf{a})|_{a_i=\pi_i(o_i)}\right]$$

Critic update: Standard TD learning with the joint action: $$y_i = r_i + \gamma Q_i'(\mathbf{o}', \mathbf{a}')|_{\mathbf{a}' = \boldsymbol{\pi}'(\mathbf{o}')}$$

where $\boldsymbol{\pi}'$ are target policies.

Key insight: Centralized critics can learn to account for other agents' strategies during training. But at test time, agents act independently using only local observations — no communication needed.

Challenges in MARL



Key Terms

Worked Examples

Example 1: TRPO Natural Gradient vs. Standard Gradient

A 2D policy parameter $\theta = [\theta_1, \theta_2]$ has policy gradient $g = [1, 10]$ and Fisher matrix $F = \begin{bmatrix} 100 & 0 \ 0 & 1 \end{bmatrix}$. Compare the standard gradient step and the natural gradient step with $\delta = 0.01$.

Solution:

Standard gradient: $\Delta\theta \propto g = [1, 10]$. The step is dominated by $\theta_2$.

Natural gradient: $F^{-1}g = \begin{bmatrix} 1/100 & 0 \ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \ 10 \end{bmatrix} = \begin{bmatrix} 0.01 \ 10 \end{bmatrix}$

Step size: $\beta = \sqrt{2\delta / (g^T F^{-1}g)} = \sqrt{0.02 / (1 \cdot 0.01 + 10 \cdot 10)} = \sqrt{0.02 / 100.01} \approx 0.01414$

$\Delta\theta_{\text{NG}} = 0.01414 \cdot [0.01, 10] = [0.000141, 0.1414]$

$\Delta\theta_{\text{SG}}$ (with same overall step size on $F^{-1}g$ norm): much more biased toward $\theta_2$.

Click for answer The standard gradient is heavily dominated by $\theta_2$ (ratio 1:10). The natural gradient corrects for the curvature: $\theta_1$ has high Fisher information (steep curvature), requiring smaller steps; $\theta_2$ has low curvature, allowing larger steps. The natural gradient respects the local geometry of the parameter space.

Example 2: SAC Entropy Augmentation

At state $s$, the policy outputs $[\mu, \sigma] = [1.5, 0.3]$. A sampled action $a = 1.2$ with $\log \pi(a|s) = -0.5$. $Q(s,a) = 10$, $\alpha = 0.2$. Compute the entropy-augmented Q-value used in SAC's Bellman backup.

Solution:

Regular Q: $Q(s,a) = 10$

Entropy bonus: $-\alpha \log \pi(a|s) = -0.2(-0.5) = 0.1$

Soft Q (used in target): $Q_{\text{soft}}(s,a) = 10 + 0.1 = 10.1$

For the policy update, the objective is $\alpha \log \pi(a|s) - Q(s,a) = 0.2(-0.5) - 10 = -0.1 - 10 = -10.1$. Minimizing this (or maximizing the negative) encourages actions with high Q and high entropy.

Click for answer The entropy term adds 0.1 to the effective Q-value, making the action slightly more attractive than its raw Q-value alone. This promotes exploration: actions from high-entropy regions get a bonus.

Example 3: Model-Based Planning Horizon

A learned dynamics model has per-step prediction error $\sigma = 0.05$. The reward at each step is $\sim \mathcal{N}(\mu, 0.1)$. If planning horizon $H=10$ yields a true expected return of $J^* = 50$, what is the approximate expected return from the model's perspective at $H=10$ vs. $H=50$?

Solution:

After $H$ steps, cumulative dynamics error: $\approx H \cdot \sigma = 0.05H$

$H=10$: cumulative error $\approx 0.5$ (states are roughly 0.5 units away from true states) Estimated return: approximately $50 \pm$ (error propagation through reward)

$H=50$: cumulative error $\approx 2.5$. The model's predictions are essentially noise — planning with $H=50$ is counterproductive.

Click for answer Long horizons amplify model error. At $H=50$, the predicted trajectory bears little resemblance to reality. This is why model-based methods typically use short horizons (5–20 steps) for planning and rely on a learned value function for long-term credit assignment (the "Dyna" approach) or train model-free components on model-generated data (MBPO).

Practice Problems

Problem 1: Derive the natural gradient solution $\Delta\theta = \beta F^{-1}g$ from the constrained optimization $\max g^T \Delta\theta$ s.t. $\frac{1}{2}\Delta\theta^T F \Delta\theta \leq \delta$.

Answer (click to expand) Form the Lagrangian: $\mathcal{L} = g^T \Delta\theta - \lambda(\frac{1}{2}\Delta\theta^T F \Delta\theta - \delta)$ Stationarity: $\nabla_{\Delta\theta} \mathcal{L} = g - \lambda F \Delta\theta = 0 \implies \Delta\theta = \frac{1}{\lambda} F^{-1}g$ Complementary slackness: either $\lambda = 0$ or $\frac{1}{2}\Delta\theta^T F \Delta\theta = \delta$ If $g \neq 0$, constraint is binding. Substitute $\Delta\theta$: $\frac{1}{2}(\frac{1}{\lambda} g^T F^{-1}) F (\frac{1}{\lambda} F^{-1}g) = \frac{1}{2\lambda^2} g^T F^{-1} g = \delta$ Solve: $\lambda = \sqrt{\frac{g^T F^{-1} g}{2\delta}}$ Therefore: $\Delta\theta = \sqrt{\frac{2\delta}{g^T F^{-1}g}} F^{-1}g$ This is the natural gradient scaled to satisfy the KL constraint.

Problem 2: Show that SAC's soft Bellman equation reduces to the standard Bellman equation when $\alpha \to 0$.

Answer (click to expand) Soft Bellman: $Q^\pi(s,a) = \mathbb{E}[r + \gamma(Q^\pi(s',a') - \alpha \log \pi(a'|s'))]$ As $\alpha \to 0$: $Q^\pi(s,a) \to \mathbb{E}[r + \gamma Q^\pi(s',a')]$ This is the standard Bellman equation. The entropy term vanishes, and SAC reduces to a standard actor-critic with double Q-learning. When $\alpha = 0$, the policy converges to the deterministic optimal policy (if it exists).

Problem 3: In MADDPG, why can't we just train each agent independently with single-agent RL?

Answer (click to expand) From agent $i$'s perspective, the environment includes the other agents' policies. As other agents learn and change their behavior, the transition dynamics $P(s'|s, a_i; \pi_{-i})$ become **non-stationary**. This violates the Markov assumption underlying standard RL algorithms. Independent Q-learning may never converge because the TD target $r + \gamma \max Q(s', a')$ changes even when agent $i$'s own policy is fixed. MADDPG addresses this by conditioning the centralized critic on all agents' actions, making the environment effectively stationary during training (the critic sees the full state including other agents' actions).

Problem 4: A model-based RL agent learns transition dynamics with ensemble of 5 models. At a candidate state-action pair, the 5 models predict $\hat{s}' = [10.1, 10.2, 9.9, 15.0, 10.05]$. How should the agent use this uncertainty?

Answer (click to expand) The ensemble reveals one outlier (15.0). The agent should: 1. Use the mean (excluding the outlier, or with it): $\approx 11.05$ (with) or $10.06$ (without) 2. Use the variance as an uncertainty estimate: $\sigma^2 \approx 4.7$ (with outlier) — high uncertainty means this region is poorly modeled 3. Penalize actions in uncertain regions: modify the planning objective to $R(s,a) - \beta \cdot \text{Var}[\hat{s}']$ where $\beta$ controls risk aversion This is **uncertainty-aware planning**: avoid actions where the model doesn't know what will happen. The outlier model might have seen an unusual transition — valuable for exploration but dangerous for exploitation.

Problem 5: Compare sample efficiency: model-free SAC vs. model-based Dyna-style agent on a task requiring 1M environment steps for SAC to converge. If the learned model is 90% accurate, what's the minimum improvement in sample efficiency from using the model?

Answer (click to expand) Dyna-style: after each real environment step, generate $K$ synthetic transitions from the model and train the model-free algorithm on them. Effective training steps = $N_{\text{real}} \times (1 + K)$. If $K=10$, the agent gets 11x more training data per environment step. With 90% accuracy, the model bias reduces effective learning rate (some synthetic transitions are misleading). Assuming linear degradation: effective multiplier $\approx 1 + 0.9 \times 10 = 10$x. Convergence at: $1\text{M} / 10 = 100\text{K}$ environment steps — a 10x improvement. In practice, model-based methods often achieve 10–100x sample efficiency gains, but at the cost of computational overhead (training the model + generating synthetic data).

Summary


Quiz

Question 1: TRPO's constraint is formulated in terms of:

A. $\ell_2$ norm of parameter change B. KL divergence between old and new policies C. Clipping of the probability ratio D. Entropy of the policy

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. TRPO constrains $\mathbb{E}_s[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) \| \pi_\theta(\cdot|s))] \leq \delta$. - If you chose A: Parameter-space distance is a poor proxy for policy change — a small parameter change can cause a large policy change. - If you chose C: Clipping is PPO's approach, not TRPO's. - If you chose D: Entropy regularization is used in SAC and PPO (as a bonus), not as the TRPO constraint.

Question 2: The "maximum entropy" in SAC refers to:

A. Maximizing the entropy of the environment dynamics B. Adding policy entropy to the reward: $r + \alpha \mathcal{H}(\pi)$ C. Maximizing the entropy of the state distribution D. Using a high-entropy neural network initialization

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. SAC augments the reward with $\alpha \mathcal{H}(\pi(\cdot|s))$, encouraging the policy to remain stochastic. - If you chose A: Environment dynamics are fixed; SAC doesn't modify them. - If you chose C: The state distribution emerges from the policy + dynamics; SAC optimizes the policy's entropy. - If you chose D: Initialization is unrelated to the maximum entropy objective.

Question 3: What is the main challenge of model-based RL with long planning horizons?

A. High computational cost of simulation B. Model errors compound exponentially over time C. The discount factor makes distant rewards irrelevant D. The replay buffer overflows

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. Each step of model rollout introduces error; after $H$ steps, cumulative error can make the predicted trajectory entirely fictitious. - If you chose A: Simulation is cheap compared to real environment interaction — that's the whole point. - If you chose C: Discount factor is a design choice; long-horizon planning can use $\gamma \approx 1$. - If you chose D: Replay buffer management is orthogonal to planning.

Question 4: In MADDPG, the centralized critic has access to:

A. Only the agent's own observations and actions B. All agents' observations and actions C. Only the reward signal D. The true environment model

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. During training, the critic receives joint observations $\mathbf{o}$ and actions $\mathbf{a}$ — it "sees" what all agents see and do. This makes the environment stationary from the critic's perspective. - If you chose A: That would be a decentralized critic, which suffers from non-stationarity. - If you chose C: The critic needs state-action information, not just rewards. - If you chose D: MADDPG is model-free; it doesn't require (or use) an environment model.

Question 5: Why is MARL fundamentally harder than single-agent RL?

A. There are more parameters to optimize B. The environment is non-stationary from each agent's perspective because other agents are also learning C. Rewards are always lower in multi-agent settings D. Neural networks can't handle multiple agents

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. From agent $i$'s view, the transition function $P(s'|s, a_i)$ changes as other agents update their policies — it's a moving-target problem amplified across all agents. - If you chose A: More parameters is an engineering challenge, not the fundamental difficulty. - If you chose C: Rewards depend on the task, not the number of agents. - If you chose D: Neural networks can handle multi-agent settings (MADDPG, QMIX, etc.).

Pitfalls

  1. TRPO without line search: The natural gradient direction alone may violate the KL constraint. TRPO always backtracks with a line search to respect $\delta$.
  2. SAC temperature too low: Small $\alpha$ causes entropy to collapse, losing the benefits of maximum entropy. Use automatic temperature tuning.
  3. Model exploitation: A learned world model has blind spots. The agent may plan actions that exploit model errors, achieving high simulated rewards but poor real-world performance. Always periodically validate on real environment.
  4. MADDPG with shared critics: If all agents share one critic network, the critic sees joint actions but the policy gradients for each agent must still be separate.
  5. Forgetting the non-stationarity problem in MARL: Simply running independent Q-learners in a multi-agent setting often diverges — you need centralized training or opponent modeling.

Next Steps

Congratulations! You've completed Phase 23 — the RL sequence. This feeds into many later topics:




Q8: What is the "centralized training with decentralized execution" paradigm in MARL?

A) All agents share the same neural network during both training and execution B) During training, agents have access to global information (all observations/actions), but during execution, each agent acts using only its local observations C) Training happens on a central server; execution happens on edge devices D) All agents communicate with each other at every time step

Answer and Explanations **Correct: B)** This paradigm (used by MADDPG, QMIX, etc.) allows critics to use joint information during training to learn coordinated strategies while actors are restricted to local observations at test time. This enables deployment in settings where communication is impossible or expensive. - A) Shared-network agents are a different approach (parameter sharing), not the CTDE paradigm. - C) This describes distributed computing, not the CTDE algorithmic paradigm. - D) Full communication at execution time defeats the purpose of decentralized execution — the point is to work without communication.