📐 Concept diagram

23-09 — Proximal Policy Optimization (PPO)

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-09 Prerequisites: 23-08 — Policy Gradient Methods Next subject: 23-10 — Advanced RL

Learning Objectives

By the end of this subject, you will be able to:

Derive the clipped surrogate objective $L^{\text{CLIP}}$ and explain why clipping prevents destructive policy updates
Compute the probability ratio $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ and use it to measure policy change
Implement GAE (Generalized Advantage Estimation) as an exponentially weighted sum of $n$-step advantage estimates
Analyze the connection between TRPO's KL constraint and PPO's clipping mechanism
Connect PPO's objective to the RLHF reward model training from Phase 20

Core Content

The Problem PPO Solves

Standard policy gradient (23-08) suffers from step-size sensitivity: too small a step → slow learning; too large a step → the policy collapses and never recovers because the new policy collects data in a different region of state space.

Trust Region Policy Optimization (TRPO, Schulman et al., 2015) solved this with a KL-divergence constraint on policy updates:

$$\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}(s,a)\right] \quad \text{s.t.} \quad \mathbb{E}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))] \leq \delta$$

TRPO works well but requires computing the Hessian of the KL divergence and solving a constrained optimization problem with conjugate gradient — computationally expensive and hard to implement.

PPO (Schulman et al., 2017) achieves the same effect with a simple clipped objective — no KL constraints, no conjugate gradient, just gradient ascent with clipping.

The Clipped Surrogate Objective

Define the probability ratio:

$$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$$

This measures how much the new policy has changed relative to the old one at a particular $(s_t, a_t)$ pair. When $\theta = \theta_{\text{old}}$, $r_t(\theta) = 1$.

The unclipped surrogate objective is:

$$L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \hat{A}_t\right]$$

where $\hat{A}t$ is an advantage estimate at time $t$. This is the standard policy gradient objective written with importance sampling from $\theta{\text{old}}$. Without constraints, maximizing this can lead to excessively large policy updates.

⚠️ CRITICAL: The CPI objective uses $\hat{A}t$ computed from trajectories collected under $\theta{\text{old}}$. These advantage estimates are only valid for policies close to $\theta_{\text{old}}$ — importance sampling breaks down when the policy changes too much.

PPO's clipped objective:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t\right)\right]$$

where $\varepsilon$ is a hyperparameter (typically $\varepsilon = 0.1$ or $0.2$).

How clipping works:

If $\hat{A}_t > 0$ (action was good): We want to increase $r_t(\theta)$ (increase action probability). But we clip at $1+\varepsilon$, preventing the ratio from exceeding $1+\varepsilon$. The $min$ ensures we don't gain from pushing $r_t(\theta)$ beyond $1+\varepsilon$.
If $\hat{A}_t < 0$ (action was bad): We want to decrease $r_t(\theta)$ (decrease action probability). But we clip at $1-\varepsilon$, preventing the ratio from falling below $1-\varepsilon$. Again, the $min$ picks the clipped value if $r_t$ goes too low.

Visual interpretation of $L^{\text{CLIP}}$:

For $\hat{A} > 0$: $$ L = \begin{cases} r\hat{A} & \text{if } r < 1+\varepsilon \ (1+\varepsilon)\hat{A} & \text{if } r \geq 1+\varepsilon \end{cases} $$

For $\hat{A} < 0$: $$ L = \begin{cases} r\hat{A} & \text{if } r > 1-\varepsilon \ (1-\varepsilon)\hat{A} & \text{if } r \leq 1-\varepsilon \end{cases} $$

In both cases, the objective is "clipped" — there's no incentive to move $r_t$ beyond the $[1-\varepsilon, 1+\varepsilon]$ range. This prevents the policy from changing too much in a single update, mimicking TRPO's KL constraint.

Why Clipping = Trust Region

TRPO constrains $\mathbb{E}[D_{\text{KL}}(\pi_{\text{old}} | \pi_{\text{new}})] \leq \delta$. PPO clips $r_t = \pi_{\text{new}}/\pi_{\text{old}}$ to $[1-\varepsilon, 1+\varepsilon]$. These are related:

For small changes, $D_{\text{KL}}(\pi_{\text{old}} | \pi) \approx \frac{1}{2}\sum_a \pi_{\text{old}}(a|s) (r-1)^2$.

So keeping $|r_t - 1| \leq \varepsilon$ roughly bounds the per-sample KL divergence by $\varepsilon^2/2$. PPO's clipping is a KL penalty in disguise — simpler to implement but equally effective.

Generalized Advantage Estimation (GAE)

PPO needs good advantage estimates. GAE (Schulman et al., 2016) provides a low-variance, low-bias estimate by exponentially blending $n$-step advantage estimates:

$$\hat{A}t^{\text{GAE}(\gamma,\lambda)} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

where $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

The parameter $\lambda \in [0,1]$ controls the bias-variance tradeoff:

$\lambda = 0$: $\hat{A}_t = \delta_t$ (TD(0), high bias, low variance)
$\lambda = 1$: $\hat{A}t = \sum{l=0}^{\infty} \gamma^l \delta_{t+l} = G_t - V(s_t)$ (Monte Carlo, low bias, high variance)

GAE computation (recursive form):

$$\hat{A}t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}{t+1}^{\text{GAE}}$$

This can be computed efficiently in a single backward pass over the collected trajectory, making it practical for on-policy training.

Derivation: Each $\delta_{t+l}$ is an unbiased estimate of $A^\pi(s_{t+l}, a_{t+l})$. GAE forms a weighted sum with geometrically decaying weights:

$$\hat{A}t^{\text{GAE}} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l} = \sum_{l=0}^{\infty} (\gamma\lambda)^l (R_{t+l+1} + \gamma V(s_{t+l+1}) - V(s_{t+l}))$$

This telescopes to:

$$\hat{A}t^{\text{GAE}} = -V(s_t) + R{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$$ $$+ (\gamma\lambda)(-V(s_{t+1}) + R_{t+2} + \gamma R_{t+3} + \dots)$$ $$\dots$$

The effective horizon is controlled by $\lambda$: $(\gamma\lambda)^l$ decays, giving more weight to short-horizon (lower variance) estimates.

Full PPO Algorithm

Initialize policy $\pi_\theta$ and value function $V_\phi$
For $k = 1, 2, \dots$ iterations:
Collect $N$ timesteps of data under $\pi_{\theta_{\text{old}}}$ (set $\theta_{\text{old}} \leftarrow \theta$)
Compute advantages $\hat{A}_1, \dots, \hat{A}_T$ using GAE
Compute value targets $\hat{V}t = \hat{A}_t + V{\phi_{\text{old}}}(s_t)$
For $K$ epochs (typically 4–10):
- Shuffle data into mini-batches
- For each mini-batch:
- Compute $r_t(\theta)$ and $L^{\text{CLIP}}(\theta)$
- Compute value loss $L^{\text{VF}}(\phi) = \frac{1}{2}(V_\phi(s_t) - \hat{V}_t)^2$
- Total loss: $L(\theta, \phi) = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S\pi_\theta$
- Update $\theta, \phi$ via gradient descent
Repeat

The entropy bonus $S\pi_\theta = -\sum_a \pi_\theta(a|s_t)\log \pi_\theta(a|s_t)$ encourages exploration by penalizing deterministic policies.

PPO in RLHF (Connection to Phase 20)

In RLHF (Reinforcement Learning from Human Feedback, Phase 20), PPO is used to fine-tune language model policies. The setup:

Policy $\pi_\theta$: the language model being fine-tuned
Reference policy $\pi_{\text{ref}}$: the initial supervised-fine-tuned model (used as $\pi_{\theta_{\text{old}}}$)
Reward model $R_\phi(x, y)$: trained on human preference data
KL penalty: Prevents the model from deviating too far from $\pi_{\text{ref}}$

The RLHF objective (before PPO clipping):

$$J(\theta) = \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)}\left[R_\phi(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta(\cdot|x) | \pi_{\text{ref}}(\cdot|x))\right]$$

PPO applies clipping to the token-level policy updates, with the advantage computed from the reward model scores. The KL penalty acts as an additional trust region — keeping the model close to the reference to prevent reward hacking.

Key Terms

KL penalty in disguise
KL-divergence constraint
Policy
Reference policy
Reward model

Worked Examples

Example 1: PPO Clipping Mechanics

At a particular $(s_t, a_t)$: $\pi_{\theta_{\text{old}}}(a_t|s_t) = 0.2$. After one gradient step, $\pi_\theta(a_t|s_t) = 0.5$. The advantage $\hat{A}_t = +2.0$ and $\varepsilon = 0.2$. Compute $L^{\text{CPI}}$ and $L^{\text{CLIP}}$.

Solution:

$r_t(\theta) = 0.5 / 0.2 = 2.5$

CPI objective: $L^{\text{CPI}} = r_t \cdot \hat{A}_t = 2.5 \cdot 2.0 = 5.0$

CLIP objective: Since $\hat{A}_t > 0$: $\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) = \text{clip}(2.5, 0.8, 1.2) = 1.2$ $L^{\text{CLIP}} = \min(5.0, 1.2 \cdot 2.0) = \min(5.0, 2.4) = 2.4$

The clipped objective is much lower — PPO "caps" the benefit of making the action too much more probable.

Click for answer

$L^{\text{CPI}} = 5.0$, $L^{\text{CLIP}} = 2.4$. The clipped objective is less than half the unclipped one, preventing the gradient from pushing $r_t$ even higher. If $\pi_\theta$ were 0.22 ($r_t = 1.1$), $L^{\text{CLIP}} = 2.2$, and PPO would continue increasing the probability.

Example 2: GAE Computation

A trajectory has rewards $[1, 2, -1, 3]$ and value predictions $V = [0, 1.5, 2.0, 1.0, 0]$ (terminal state has $V=0$). $\gamma = 0.9$, $\lambda = 0.8$. Compute $\delta_t$ and $\hat{A}_t^{\text{GAE}}$ for $t = 0, 1, 2$.

Solution:

TD errors: $\delta_0 = 1 + 0.9(1.5) - 0 = 1 + 1.35 = 2.35$ $\delta_1 = 2 + 0.9(2.0) - 1.5 = 2 + 1.8 - 1.5 = 2.3$ $\delta_2 = -1 + 0.9(1.0) - 2.0 = -1 + 0.9 - 2.0 = -2.1$ $\delta_3 = 3 + 0.9(0) - 1.0 = 2.0$

GAE (backward recursion, $\hat{A}_3 = \delta_3 = 2.0$): $\hat{A}_2 = \delta_2 + \gamma\lambda \hat{A}_3 = -2.1 + 0.9(0.8)(2.0) = -2.1 + 1.44 = -0.66$ $\hat{A}_1 = \delta_1 + \gamma\lambda \hat{A}_2 = 2.3 + 0.72(-0.66) = 2.3 - 0.475 = 1.825$ $\hat{A}_0 = \delta_0 + \gamma\lambda \hat{A}_1 = 2.35 + 0.72(1.825) = 2.35 + 1.314 = 3.664$

Click for answer

$\hat{A} = [3.664, 1.825, -0.66, 2.0]$. Step 0 has the highest advantage (the trajectory starts well). Step 2 has negative advantage (the reward was worse than predicted by $V$).

Example 3: Entropy Bonus Effect

A policy over 4 actions has probabilities $[0.7, 0.1, 0.1, 0.1]$. Compute the entropy $S[\pi]$ and the gradient of the entropy bonus with $c_2 = 0.01$.

Solution:

$S[\pi] = -\sum_a \pi(a) \log \pi(a) = -(0.7 \log 0.7 + 3 \times 0.1 \log 0.1)$ $= -(0.7(-0.357) + 3(-0.23)) = -(-0.25 - 0.69) = 0.94$

Entropy bonus loss term: $-c_2 S[\pi] = -0.01 \times 0.94 = -0.0094$ (negative because it's subtracted in the total loss, or equivalently, maximizing entropy means adding $+c_2 S$ to the objective).

Gradient: $\nabla_\theta S[\pi] = -\sum_a (\log \pi(a) + 1) \nabla_\theta \pi(a)$ For each action: $-\nabla_\theta \pi(a) \cdot (\log \pi(a) + 1)$

The entropy gradient pushes probabilities toward uniformity.

Click for answer

$S[\pi] = 0.94$. The entropy bonus encourages exploration by penalizing deterministic policies. Without it, PPO can collapse to a near-deterministic policy too early.

Practice Problems

Problem 1: Prove that for $\hat{A}_t > 0$, the gradient of $L^{\text{CLIP}}$ is zero when $r_t(\theta) > 1 + \varepsilon$. Why is this desirable?

Answer (click to expand)

When $\hat{A}_t > 0$ and $r_t > 1+\varepsilon$: $L^{\text{CLIP}} = \min(r_t \hat{A}_t, (1+\varepsilon)\hat{A}_t) = (1+\varepsilon)\hat{A}_t$ (constant w.r.t $\theta$) $\nabla_\theta L^{\text{CLIP}} = 0$ This is desirable because it stops the policy from changing further in this direction — the policy is already much more likely to take this action than before. Further increases in $\pi(a|s)$ don't improve the objective, so gradient descent stops. This is the trust region mechanism.

Problem 2: Derive the recursive GAE formula $\hat{A}t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}{t+1}^{\text{GAE}}$ from the definition $\hat{A}t^{\text{GAE}} = \sum{l=0}^\infty (\gamma\lambda)^l \delta_{t+l}$.

Answer (click to expand)

$$\hat{A}_t^{\text{GAE}} = \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+l} = \delta_t + \sum_{l=1}^\infty (\gamma\lambda)^l \delta_{t+l}$$ $$= \delta_t + \gamma\lambda \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+1+l} = \delta_t + \gamma\lambda \hat{A}_{t+1}^{\text{GAE}}$$ This recursive form enables $O(T)$ computation: compute TD errors forward, then GAE backward in one pass.

Problem 3: Compare PPO's clipping to TRPO's KL constraint. If $\varepsilon = 0.2$, approximately what KL divergence does PPO's clipping tolerate? (Use the second-order Taylor expansion $D_{\text{KL}}(\pi_{\text{old}} | \pi) \approx \frac{1}{2}(r-1)^2$ for small changes.)

Answer (click to expand)

For an individual $(s,a)$ pair with $r = \pi/\pi_{\text{old}}$: $D_{\text{KL}} \approx \frac{1}{2}(r-1)^2$ Maximum tolerated: $|r-1| = \varepsilon = 0.2$ $D_{\text{KL}}^{\max} \approx \frac{1}{2}(0.2)^2 = 0.02$ In expectation over the data distribution, PPO tolerates roughly $D_{\text{KL}} \approx 0.02$ per update. TRPO typically uses $\delta = 0.01$, so PPO with $\varepsilon = 0.2$ is about twice as permissive as typical TRPO settings. However, PPO uses multiple epochs of updates on the same data, so the effective KL per outer iteration may be comparable.

Problem 4: PPO trains for $K = 4$ epochs on the same batch of data. Why doesn't this cause overfitting like in supervised learning? What would happen with $K = 100$?

Answer (click to expand)

PPO avoids overfitting through clipping: once the policy ratio $r_t$ hits the clipping boundary ($1 \pm \varepsilon$), the gradient for that sample becomes zero and the policy stops changing in that direction. Additional epochs have no effect on clipped samples. With $K=100$, the policy would saturate the clipping boundary on nearly all samples after the first few epochs, and subsequent epochs would do nothing — wasted computation. But the policy would also overfit to the advantage estimates (which contain noise), potentially moving it into a bad region of parameter space before hitting the clipper. Typical $K = 3$–$10$ works well.

Problem 5: In RLHF with PPO, the reward model gives scores $R_\phi(x, y)$ for generated responses. If the reference policy has probability 0.01 for a token and the updated policy has 0.03 ($r_t = 3.0$), and $\varepsilon = 0.2$, what happens to the gradient for this token?

Answer (click to expand)

$r_t = 3.0 > 1 + \varepsilon = 1.2$. The ratio is far outside the clipping range. If the token has positive advantage (the reward model liked the response): $L^{\text{CLIP}}$ clips at $(1+\varepsilon)\hat{A}$, gradient is zero — PPO discards this update. The policy has already increased this token's probability too much relative to the reference. If negative advantage: $L^{\text{CLIP}} = r_t \hat{A}_t$ (unclipped, since $\hat{A}_t < 0$ and $r_t > 1$, the min picks the smaller value). But this means the bad token's probability increased — the unclipped loss penalizes this heavily. Gradient pushes $\theta$ to reduce $\pi_\theta$. In RLHF, the additional KL penalty $\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ also pushes back toward the reference. The combination of PPO clipping and KL penalty provides a double safety net against reward hacking.

Summary

PPO simplifies TRPO by replacing the KL constraint with a clipped surrogate objective $L^{\text{CLIP}}$
The probability ratio $r_t = \pi_\theta/\pi_{\theta_{\text{old}}}$ measures policy change; clipping to $[1-\varepsilon, 1+\varepsilon]$ prevents destructive updates
GAE blends $n$-step advantage estimates using $\lambda$ to trade off bias and variance; computed efficiently via backward recursion
PPO's total loss: $L = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S[\pi]$ (clipped policy loss + value loss + entropy bonus)
Multiple epochs ($K=3$–$10$) on the same data are safe because clipping zeroes gradients on saturated samples
PPO is the workhorse of RLHF: it fine-tunes language models using reward model scores as advantage signals

Quiz

Question 1: What does PPO's clipped objective prevent?

A. The value function from overfitting B. The policy from changing too much in a single update C. The reward function from diverging D. The learning rate from decaying

Correct Answer: B