Math graphic
📐 Concept diagram

23-09 — Proximal Policy Optimization (PPO)

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-09 Prerequisites: 23-08 — Policy Gradient Methods Next subject: 23-10 — Advanced RL


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the clipped surrogate objective $L^{\text{CLIP}}$ and explain why clipping prevents destructive policy updates
  2. Compute the probability ratio $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ and use it to measure policy change
  3. Implement GAE (Generalized Advantage Estimation) as an exponentially weighted sum of $n$-step advantage estimates
  4. Analyze the connection between TRPO's KL constraint and PPO's clipping mechanism
  5. Connect PPO's objective to the RLHF reward model training from Phase 20

Core Content

The Problem PPO Solves

Standard policy gradient (23-08) suffers from step-size sensitivity: too small a step → slow learning; too large a step → the policy collapses and never recovers because the new policy collects data in a different region of state space.

Trust Region Policy Optimization (TRPO, Schulman et al., 2015) solved this with a KL-divergence constraint on policy updates:

$$\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}(s,a)\right] \quad \text{s.t.} \quad \mathbb{E}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))] \leq \delta$$

TRPO works well but requires computing the Hessian of the KL divergence and solving a constrained optimization problem with conjugate gradient — computationally expensive and hard to implement.

PPO (Schulman et al., 2017) achieves the same effect with a simple clipped objective — no KL constraints, no conjugate gradient, just gradient ascent with clipping.

The Clipped Surrogate Objective

Define the probability ratio:

$$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$$

This measures how much the new policy has changed relative to the old one at a particular $(s_t, a_t)$ pair. When $\theta = \theta_{\text{old}}$, $r_t(\theta) = 1$.

The unclipped surrogate objective is:

$$L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \hat{A}_t\right]$$

where $\hat{A}t$ is an advantage estimate at time $t$. This is the standard policy gradient objective written with importance sampling from $\theta{\text{old}}$. Without constraints, maximizing this can lead to excessively large policy updates.

⚠️ CRITICAL: The CPI objective uses $\hat{A}t$ computed from trajectories collected under $\theta{\text{old}}$. These advantage estimates are only valid for policies close to $\theta_{\text{old}}$ — importance sampling breaks down when the policy changes too much.

PPO's clipped objective:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t\right)\right]$$

where $\varepsilon$ is a hyperparameter (typically $\varepsilon = 0.1$ or $0.2$).

How clipping works:

Visual interpretation of $L^{\text{CLIP}}$:

For $\hat{A} > 0$: $$ L = \begin{cases} r\hat{A} & \text{if } r < 1+\varepsilon \ (1+\varepsilon)\hat{A} & \text{if } r \geq 1+\varepsilon \end{cases} $$

For $\hat{A} < 0$: $$ L = \begin{cases} r\hat{A} & \text{if } r > 1-\varepsilon \ (1-\varepsilon)\hat{A} & \text{if } r \leq 1-\varepsilon \end{cases} $$

In both cases, the objective is "clipped" — there's no incentive to move $r_t$ beyond the $[1-\varepsilon, 1+\varepsilon]$ range. This prevents the policy from changing too much in a single update, mimicking TRPO's KL constraint.

Why Clipping = Trust Region

TRPO constrains $\mathbb{E}[D_{\text{KL}}(\pi_{\text{old}} | \pi_{\text{new}})] \leq \delta$. PPO clips $r_t = \pi_{\text{new}}/\pi_{\text{old}}$ to $[1-\varepsilon, 1+\varepsilon]$. These are related:

For small changes, $D_{\text{KL}}(\pi_{\text{old}} | \pi) \approx \frac{1}{2}\sum_a \pi_{\text{old}}(a|s) (r-1)^2$.

So keeping $|r_t - 1| \leq \varepsilon$ roughly bounds the per-sample KL divergence by $\varepsilon^2/2$. PPO's clipping is a KL penalty in disguise — simpler to implement but equally effective.

Generalized Advantage Estimation (GAE)

PPO needs good advantage estimates. GAE (Schulman et al., 2016) provides a low-variance, low-bias estimate by exponentially blending $n$-step advantage estimates:

$$\hat{A}t^{\text{GAE}(\gamma,\lambda)} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

where $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

The parameter $\lambda \in [0,1]$ controls the bias-variance tradeoff:

GAE computation (recursive form):

$$\hat{A}t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}{t+1}^{\text{GAE}}$$

This can be computed efficiently in a single backward pass over the collected trajectory, making it practical for on-policy training.

Derivation: Each $\delta_{t+l}$ is an unbiased estimate of $A^\pi(s_{t+l}, a_{t+l})$. GAE forms a weighted sum with geometrically decaying weights:

$$\hat{A}t^{\text{GAE}} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l} = \sum_{l=0}^{\infty} (\gamma\lambda)^l (R_{t+l+1} + \gamma V(s_{t+l+1}) - V(s_{t+l}))$$

This telescopes to:

$$\hat{A}t^{\text{GAE}} = -V(s_t) + R{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$$ $$+ (\gamma\lambda)(-V(s_{t+1}) + R_{t+2} + \gamma R_{t+3} + \dots)$$ $$\dots$$

The effective horizon is controlled by $\lambda$: $(\gamma\lambda)^l$ decays, giving more weight to short-horizon (lower variance) estimates.

Full PPO Algorithm

  1. Initialize policy $\pi_\theta$ and value function $V_\phi$
  2. For $k = 1, 2, \dots$ iterations:
  3. Collect $N$ timesteps of data under $\pi_{\theta_{\text{old}}}$ (set $\theta_{\text{old}} \leftarrow \theta$)
  4. Compute advantages $\hat{A}_1, \dots, \hat{A}_T$ using GAE
  5. Compute value targets $\hat{V}t = \hat{A}_t + V{\phi_{\text{old}}}(s_t)$
  6. For $K$ epochs (typically 4–10):
    • Shuffle data into mini-batches
    • For each mini-batch:
    • Compute $r_t(\theta)$ and $L^{\text{CLIP}}(\theta)$
    • Compute value loss $L^{\text{VF}}(\phi) = \frac{1}{2}(V_\phi(s_t) - \hat{V}_t)^2$
    • Total loss: $L(\theta, \phi) = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S\pi_\theta$
    • Update $\theta, \phi$ via gradient descent
  7. Repeat

The entropy bonus $S\pi_\theta = -\sum_a \pi_\theta(a|s_t)\log \pi_\theta(a|s_t)$ encourages exploration by penalizing deterministic policies.

PPO in RLHF (Connection to Phase 20)

In RLHF (Reinforcement Learning from Human Feedback, Phase 20), PPO is used to fine-tune language model policies. The setup:

The RLHF objective (before PPO clipping):

$$J(\theta) = \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)}\left[R_\phi(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta(\cdot|x) | \pi_{\text{ref}}(\cdot|x))\right]$$

PPO applies clipping to the token-level policy updates, with the advantage computed from the reward model scores. The KL penalty acts as an additional trust region — keeping the model close to the reference to prevent reward hacking.



Key Terms

Worked Examples

Example 1: PPO Clipping Mechanics

At a particular $(s_t, a_t)$: $\pi_{\theta_{\text{old}}}(a_t|s_t) = 0.2$. After one gradient step, $\pi_\theta(a_t|s_t) = 0.5$. The advantage $\hat{A}_t = +2.0$ and $\varepsilon = 0.2$. Compute $L^{\text{CPI}}$ and $L^{\text{CLIP}}$.

Solution:

$r_t(\theta) = 0.5 / 0.2 = 2.5$

CPI objective: $L^{\text{CPI}} = r_t \cdot \hat{A}_t = 2.5 \cdot 2.0 = 5.0$

CLIP objective: Since $\hat{A}_t > 0$: $\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) = \text{clip}(2.5, 0.8, 1.2) = 1.2$ $L^{\text{CLIP}} = \min(5.0, 1.2 \cdot 2.0) = \min(5.0, 2.4) = 2.4$

The clipped objective is much lower — PPO "caps" the benefit of making the action too much more probable.

Click for answer $L^{\text{CPI}} = 5.0$, $L^{\text{CLIP}} = 2.4$. The clipped objective is less than half the unclipped one, preventing the gradient from pushing $r_t$ even higher. If $\pi_\theta$ were 0.22 ($r_t = 1.1$), $L^{\text{CLIP}} = 2.2$, and PPO would continue increasing the probability.

Example 2: GAE Computation

A trajectory has rewards $[1, 2, -1, 3]$ and value predictions $V = [0, 1.5, 2.0, 1.0, 0]$ (terminal state has $V=0$). $\gamma = 0.9$, $\lambda = 0.8$. Compute $\delta_t$ and $\hat{A}_t^{\text{GAE}}$ for $t = 0, 1, 2$.

Solution:

TD errors: $\delta_0 = 1 + 0.9(1.5) - 0 = 1 + 1.35 = 2.35$ $\delta_1 = 2 + 0.9(2.0) - 1.5 = 2 + 1.8 - 1.5 = 2.3$ $\delta_2 = -1 + 0.9(1.0) - 2.0 = -1 + 0.9 - 2.0 = -2.1$ $\delta_3 = 3 + 0.9(0) - 1.0 = 2.0$

GAE (backward recursion, $\hat{A}_3 = \delta_3 = 2.0$): $\hat{A}_2 = \delta_2 + \gamma\lambda \hat{A}_3 = -2.1 + 0.9(0.8)(2.0) = -2.1 + 1.44 = -0.66$ $\hat{A}_1 = \delta_1 + \gamma\lambda \hat{A}_2 = 2.3 + 0.72(-0.66) = 2.3 - 0.475 = 1.825$ $\hat{A}_0 = \delta_0 + \gamma\lambda \hat{A}_1 = 2.35 + 0.72(1.825) = 2.35 + 1.314 = 3.664$

Click for answer $\hat{A} = [3.664, 1.825, -0.66, 2.0]$. Step 0 has the highest advantage (the trajectory starts well). Step 2 has negative advantage (the reward was worse than predicted by $V$).

Example 3: Entropy Bonus Effect

A policy over 4 actions has probabilities $[0.7, 0.1, 0.1, 0.1]$. Compute the entropy $S[\pi]$ and the gradient of the entropy bonus with $c_2 = 0.01$.

Solution:

$S[\pi] = -\sum_a \pi(a) \log \pi(a) = -(0.7 \log 0.7 + 3 \times 0.1 \log 0.1)$ $= -(0.7(-0.357) + 3(-0.23)) = -(-0.25 - 0.69) = 0.94$

Entropy bonus loss term: $-c_2 S[\pi] = -0.01 \times 0.94 = -0.0094$ (negative because it's subtracted in the total loss, or equivalently, maximizing entropy means adding $+c_2 S$ to the objective).

Gradient: $\nabla_\theta S[\pi] = -\sum_a (\log \pi(a) + 1) \nabla_\theta \pi(a)$ For each action: $-\nabla_\theta \pi(a) \cdot (\log \pi(a) + 1)$

The entropy gradient pushes probabilities toward uniformity.

Click for answer $S[\pi] = 0.94$. The entropy bonus encourages exploration by penalizing deterministic policies. Without it, PPO can collapse to a near-deterministic policy too early.

Practice Problems

Problem 1: Prove that for $\hat{A}_t > 0$, the gradient of $L^{\text{CLIP}}$ is zero when $r_t(\theta) > 1 + \varepsilon$. Why is this desirable?

Answer (click to expand) When $\hat{A}_t > 0$ and $r_t > 1+\varepsilon$: $L^{\text{CLIP}} = \min(r_t \hat{A}_t, (1+\varepsilon)\hat{A}_t) = (1+\varepsilon)\hat{A}_t$ (constant w.r.t $\theta$) $\nabla_\theta L^{\text{CLIP}} = 0$ This is desirable because it stops the policy from changing further in this direction — the policy is already much more likely to take this action than before. Further increases in $\pi(a|s)$ don't improve the objective, so gradient descent stops. This is the trust region mechanism.

Problem 2: Derive the recursive GAE formula $\hat{A}t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}{t+1}^{\text{GAE}}$ from the definition $\hat{A}t^{\text{GAE}} = \sum{l=0}^\infty (\gamma\lambda)^l \delta_{t+l}$.

Answer (click to expand) $$\hat{A}_t^{\text{GAE}} = \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+l} = \delta_t + \sum_{l=1}^\infty (\gamma\lambda)^l \delta_{t+l}$$ $$= \delta_t + \gamma\lambda \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+1+l} = \delta_t + \gamma\lambda \hat{A}_{t+1}^{\text{GAE}}$$ This recursive form enables $O(T)$ computation: compute TD errors forward, then GAE backward in one pass.

Problem 3: Compare PPO's clipping to TRPO's KL constraint. If $\varepsilon = 0.2$, approximately what KL divergence does PPO's clipping tolerate? (Use the second-order Taylor expansion $D_{\text{KL}}(\pi_{\text{old}} | \pi) \approx \frac{1}{2}(r-1)^2$ for small changes.)

Answer (click to expand) For an individual $(s,a)$ pair with $r = \pi/\pi_{\text{old}}$: $D_{\text{KL}} \approx \frac{1}{2}(r-1)^2$ Maximum tolerated: $|r-1| = \varepsilon = 0.2$ $D_{\text{KL}}^{\max} \approx \frac{1}{2}(0.2)^2 = 0.02$ In expectation over the data distribution, PPO tolerates roughly $D_{\text{KL}} \approx 0.02$ per update. TRPO typically uses $\delta = 0.01$, so PPO with $\varepsilon = 0.2$ is about twice as permissive as typical TRPO settings. However, PPO uses multiple epochs of updates on the same data, so the effective KL per outer iteration may be comparable.

Problem 4: PPO trains for $K = 4$ epochs on the same batch of data. Why doesn't this cause overfitting like in supervised learning? What would happen with $K = 100$?

Answer (click to expand) PPO avoids overfitting through clipping: once the policy ratio $r_t$ hits the clipping boundary ($1 \pm \varepsilon$), the gradient for that sample becomes zero and the policy stops changing in that direction. Additional epochs have no effect on clipped samples. With $K=100$, the policy would saturate the clipping boundary on nearly all samples after the first few epochs, and subsequent epochs would do nothing — wasted computation. But the policy would also overfit to the advantage estimates (which contain noise), potentially moving it into a bad region of parameter space before hitting the clipper. Typical $K = 3$–$10$ works well.

Problem 5: In RLHF with PPO, the reward model gives scores $R_\phi(x, y)$ for generated responses. If the reference policy has probability 0.01 for a token and the updated policy has 0.03 ($r_t = 3.0$), and $\varepsilon = 0.2$, what happens to the gradient for this token?

Answer (click to expand) $r_t = 3.0 > 1 + \varepsilon = 1.2$. The ratio is far outside the clipping range. If the token has positive advantage (the reward model liked the response): $L^{\text{CLIP}}$ clips at $(1+\varepsilon)\hat{A}$, gradient is zero — PPO discards this update. The policy has already increased this token's probability too much relative to the reference. If negative advantage: $L^{\text{CLIP}} = r_t \hat{A}_t$ (unclipped, since $\hat{A}_t < 0$ and $r_t > 1$, the min picks the smaller value). But this means the bad token's probability increased — the unclipped loss penalizes this heavily. Gradient pushes $\theta$ to reduce $\pi_\theta$. In RLHF, the additional KL penalty $\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ also pushes back toward the reference. The combination of PPO clipping and KL penalty provides a double safety net against reward hacking.

Summary


Quiz

Question 1: What does PPO's clipped objective prevent?

A. The value function from overfitting B. The policy from changing too much in a single update C. The reward function from diverging D. The learning rate from decaying

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. Clipping $r_t$ to $[1-\varepsilon, 1+\varepsilon]$ ensures the new policy doesn't deviate too far from the old one. - If you chose A: Value function regularization is handled by the value loss term, not clipping. - If you chose C: PPO doesn't modify the reward function. - If you chose D: The learning rate is a hyperparameter, not directly controlled by clipping.

Question 2: In $L^{\text{CLIP}} = \mathbb{E}[\min(r_t \hat{A}_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)\hat{A}_t)]$, what does the $min$ operation do when $\hat{A}_t > 0$?

A. It minimizes the loss B. It takes the smaller of $r_t\hat{A}_t$ and $(1+\varepsilon)\hat{A}_t$, preventing gains from overly large $r_t$ C. It ignores the advantage entirely D. It clips the advantage instead of the ratio

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. When $\hat{A}_t > 0$, increasing $r_t$ (making the action more probable) is good, but only up to $1+\varepsilon$. Beyond that, the clipped value $(1+\varepsilon)\hat{A}_t$ is smaller and the `min` selects it, giving zero gradient. - If you chose A: The `min` selects between two specific expressions, not a generic minimization. - If you chose C: The advantage is always used. - If you chose D: The clipping is on $r_t$, not $\hat{A}_t$.

Question 3: GAE's parameter $\lambda$ controls:

A. The learning rate B. The bias-variance tradeoff in advantage estimation C. The discount factor for rewards D. The clipping threshold

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. $\lambda = 0$ gives TD(0) (high bias, low variance); $\lambda = 1$ gives Monte Carlo (low bias, high variance). Values in between blend $n$-step estimates. - If you chose A: The learning rate is a separate hyperparameter. - If you chose C: $\gamma$ controls reward discounting, not $\lambda$. - If you chose D: The clipping threshold is $\varepsilon$.

Question 4: PPO trains for multiple epochs ($K > 1$) on the same batch of data. Why is this safe?

A. The data doesn't change B. Clipping zeroes out gradients for samples that hit the clipping boundary, preventing over-optimization C. Multiple epochs always improve performance D. The value function prevents it

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. Once $r_t$ hits the $1 \pm \varepsilon$ boundary, the loss becomes constant w.r.t. $\theta$ for that sample and further epochs don't move the policy further in that direction. - If you chose A: Data not changing makes overfitting MORE likely in supervised learning, not less. - If you chose C: Without clipping, multiple epochs would overfit to advantage noise. - If you chose D: The value function helps with advantage estimation but doesn't prevent policy overfitting.

Question 5: In RLHF, what additional regularization does PPO receive beyond the standard clipping?

A. L2 weight decay B. A KL-divergence penalty between the updated policy and the reference model C. Dropout D. Gradient clipping

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. The RLHF objective includes $-\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$, which prevents the fine-tuned model from drifting too far from the supervised-fine-tuned reference. This is critical to prevent reward hacking. - If you chose A/C/D: These are common regularization techniques but not the key RLHF-specific addition.

Pitfalls

  1. Confusing $\pi_\theta$ and $\pi_{\theta_{\text{old}}}$: The advantage $\hat{A}t$ must be computed from trajectories generated under $\pi{\theta_{\text{old}}}$, not $\pi_\theta$. Using the current policy to re-weight advantages from old trajectories introduces bias.
  2. $\lambda$ mixing with $\gamma$: In GAE, the effective horizon is $\gamma\lambda$, not just $\lambda$. Setting $\lambda = 0.95$ with $\gamma = 0.99$ gives effective $\gamma\lambda = 0.9405$ — still quite long-horizon.
  3. Multiple epochs with no clipping: Running many epochs without clipping on the same data causes catastrophic overfitting. Always use clipping or early stopping based on approximate KL.
  4. Forgetting the entropy bonus: Without entropy regularization, PPO can converge to deterministic policies prematurely, missing better stochastic strategies.
  5. Advantage normalization: Always normalize advantages to zero mean and unit variance within each batch — this stabilizes training dramatically.

Next Steps

Next up: 23-10 — Advanced RL — where we explore TRPO's KL constraint approach, maximum entropy RL (Soft Actor-Critic), model-based methods, and multi-agent RL.




Q8: PPO's total loss is typically $$L = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S[\pi]$$. The entropy bonus $$S[\pi]$$ is added (with positive coefficient) to:

A) Penalize stochastic policies B) Encourage exploration by rewarding policies with higher entropy C) Reduce memory usage D) Stabilize the value function

Answer and Explanations **Correct: B)** The entropy $$S[\pi] = -\sum_a \pi(a \mid s) \log \pi(a \mid s)$$ measures how spread-out the action distribution is. Adding $$+c_2 S[\pi]$$ to the maximization objective encourages the policy to remain stochastic, preventing premature convergence to a deterministic (and potentially suboptimal) policy. - A) The entropy bonus *rewards* stochasticity, not penalizes it. - C) Entropy has no direct connection to memory usage. - D) The value function is stabilized by the value loss term $$L^{\text{VF}}$$, not the entropy bonus.