23-09 — Proximal Policy Optimization (PPO)
Phase: 23 — Reinforcement Learning Mathematics Subject: 23-09 Prerequisites: 23-08 — Policy Gradient Methods Next subject: 23-10 — Advanced RL
Learning Objectives
By the end of this subject, you will be able to:
- Derive the clipped surrogate objective $L^{\text{CLIP}}$ and explain why clipping prevents destructive policy updates
- Compute the probability ratio $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ and use it to measure policy change
- Implement GAE (Generalized Advantage Estimation) as an exponentially weighted sum of $n$-step advantage estimates
- Analyze the connection between TRPO's KL constraint and PPO's clipping mechanism
- Connect PPO's objective to the RLHF reward model training from Phase 20
Core Content
The Problem PPO Solves
Standard policy gradient (23-08) suffers from step-size sensitivity: too small a step → slow learning; too large a step → the policy collapses and never recovers because the new policy collects data in a different region of state space.
Trust Region Policy Optimization (TRPO, Schulman et al., 2015) solved this with a KL-divergence constraint on policy updates:
$$\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}(s,a)\right] \quad \text{s.t.} \quad \mathbb{E}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) | \pi_\theta(\cdot|s))] \leq \delta$$
TRPO works well but requires computing the Hessian of the KL divergence and solving a constrained optimization problem with conjugate gradient — computationally expensive and hard to implement.
PPO (Schulman et al., 2017) achieves the same effect with a simple clipped objective — no KL constraints, no conjugate gradient, just gradient ascent with clipping.
The Clipped Surrogate Objective
Define the probability ratio:
$$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$$
This measures how much the new policy has changed relative to the old one at a particular $(s_t, a_t)$ pair. When $\theta = \theta_{\text{old}}$, $r_t(\theta) = 1$.
The unclipped surrogate objective is:
$$L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \hat{A}_t\right]$$
where $\hat{A}t$ is an advantage estimate at time $t$. This is the standard policy gradient objective written with importance sampling from $\theta{\text{old}}$. Without constraints, maximizing this can lead to excessively large policy updates.
⚠️ CRITICAL: The CPI objective uses $\hat{A}t$ computed from trajectories collected under $\theta{\text{old}}$. These advantage estimates are only valid for policies close to $\theta_{\text{old}}$ — importance sampling breaks down when the policy changes too much.
PPO's clipped objective:
$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t\right)\right]$$
where $\varepsilon$ is a hyperparameter (typically $\varepsilon = 0.1$ or $0.2$).
How clipping works:
-
If $\hat{A}_t > 0$ (action was good): We want to increase $r_t(\theta)$ (increase action probability). But we clip at $1+\varepsilon$, preventing the ratio from exceeding $1+\varepsilon$. The $min$ ensures we don't gain from pushing $r_t(\theta)$ beyond $1+\varepsilon$.
-
If $\hat{A}_t < 0$ (action was bad): We want to decrease $r_t(\theta)$ (decrease action probability). But we clip at $1-\varepsilon$, preventing the ratio from falling below $1-\varepsilon$. Again, the $min$ picks the clipped value if $r_t$ goes too low.
Visual interpretation of $L^{\text{CLIP}}$:
For $\hat{A} > 0$: $$ L = \begin{cases} r\hat{A} & \text{if } r < 1+\varepsilon \ (1+\varepsilon)\hat{A} & \text{if } r \geq 1+\varepsilon \end{cases} $$
For $\hat{A} < 0$: $$ L = \begin{cases} r\hat{A} & \text{if } r > 1-\varepsilon \ (1-\varepsilon)\hat{A} & \text{if } r \leq 1-\varepsilon \end{cases} $$
In both cases, the objective is "clipped" — there's no incentive to move $r_t$ beyond the $[1-\varepsilon, 1+\varepsilon]$ range. This prevents the policy from changing too much in a single update, mimicking TRPO's KL constraint.
Why Clipping = Trust Region
TRPO constrains $\mathbb{E}[D_{\text{KL}}(\pi_{\text{old}} | \pi_{\text{new}})] \leq \delta$. PPO clips $r_t = \pi_{\text{new}}/\pi_{\text{old}}$ to $[1-\varepsilon, 1+\varepsilon]$. These are related:
For small changes, $D_{\text{KL}}(\pi_{\text{old}} | \pi) \approx \frac{1}{2}\sum_a \pi_{\text{old}}(a|s) (r-1)^2$.
So keeping $|r_t - 1| \leq \varepsilon$ roughly bounds the per-sample KL divergence by $\varepsilon^2/2$. PPO's clipping is a KL penalty in disguise — simpler to implement but equally effective.
Generalized Advantage Estimation (GAE)
PPO needs good advantage estimates. GAE (Schulman et al., 2016) provides a low-variance, low-bias estimate by exponentially blending $n$-step advantage estimates:
$$\hat{A}t^{\text{GAE}(\gamma,\lambda)} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$
where $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.
The parameter $\lambda \in [0,1]$ controls the bias-variance tradeoff:
- $\lambda = 0$: $\hat{A}_t = \delta_t$ (TD(0), high bias, low variance)
- $\lambda = 1$: $\hat{A}t = \sum{l=0}^{\infty} \gamma^l \delta_{t+l} = G_t - V(s_t)$ (Monte Carlo, low bias, high variance)
GAE computation (recursive form):
$$\hat{A}t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}{t+1}^{\text{GAE}}$$
This can be computed efficiently in a single backward pass over the collected trajectory, making it practical for on-policy training.
Derivation: Each $\delta_{t+l}$ is an unbiased estimate of $A^\pi(s_{t+l}, a_{t+l})$. GAE forms a weighted sum with geometrically decaying weights:
$$\hat{A}t^{\text{GAE}} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l} = \sum_{l=0}^{\infty} (\gamma\lambda)^l (R_{t+l+1} + \gamma V(s_{t+l+1}) - V(s_{t+l}))$$
This telescopes to:
$$\hat{A}t^{\text{GAE}} = -V(s_t) + R{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$$ $$+ (\gamma\lambda)(-V(s_{t+1}) + R_{t+2} + \gamma R_{t+3} + \dots)$$ $$\dots$$
The effective horizon is controlled by $\lambda$: $(\gamma\lambda)^l$ decays, giving more weight to short-horizon (lower variance) estimates.
Full PPO Algorithm
- Initialize policy $\pi_\theta$ and value function $V_\phi$
- For $k = 1, 2, \dots$ iterations:
- Collect $N$ timesteps of data under $\pi_{\theta_{\text{old}}}$ (set $\theta_{\text{old}} \leftarrow \theta$)
- Compute advantages $\hat{A}_1, \dots, \hat{A}_T$ using GAE
- Compute value targets $\hat{V}t = \hat{A}_t + V{\phi_{\text{old}}}(s_t)$
- For $K$ epochs (typically 4–10):
- Shuffle data into mini-batches
- For each mini-batch:
- Compute $r_t(\theta)$ and $L^{\text{CLIP}}(\theta)$
- Compute value loss $L^{\text{VF}}(\phi) = \frac{1}{2}(V_\phi(s_t) - \hat{V}_t)^2$
- Total loss: $L(\theta, \phi) = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S\pi_\theta$
- Update $\theta, \phi$ via gradient descent
- Repeat
The entropy bonus $S\pi_\theta = -\sum_a \pi_\theta(a|s_t)\log \pi_\theta(a|s_t)$ encourages exploration by penalizing deterministic policies.
PPO in RLHF (Connection to Phase 20)
In RLHF (Reinforcement Learning from Human Feedback, Phase 20), PPO is used to fine-tune language model policies. The setup:
- Policy $\pi_\theta$: the language model being fine-tuned
- Reference policy $\pi_{\text{ref}}$: the initial supervised-fine-tuned model (used as $\pi_{\theta_{\text{old}}}$)
- Reward model $R_\phi(x, y)$: trained on human preference data
- KL penalty: Prevents the model from deviating too far from $\pi_{\text{ref}}$
The RLHF objective (before PPO clipping):
$$J(\theta) = \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)}\left[R_\phi(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta(\cdot|x) | \pi_{\text{ref}}(\cdot|x))\right]$$
PPO applies clipping to the token-level policy updates, with the advantage computed from the reward model scores. The KL penalty acts as an additional trust region — keeping the model close to the reference to prevent reward hacking.
Key Terms
- KL penalty in disguise
- KL-divergence constraint
- Policy
- Reference policy
- Reward model
Worked Examples
Example 1: PPO Clipping Mechanics
At a particular $(s_t, a_t)$: $\pi_{\theta_{\text{old}}}(a_t|s_t) = 0.2$. After one gradient step, $\pi_\theta(a_t|s_t) = 0.5$. The advantage $\hat{A}_t = +2.0$ and $\varepsilon = 0.2$. Compute $L^{\text{CPI}}$ and $L^{\text{CLIP}}$.
Solution:
$r_t(\theta) = 0.5 / 0.2 = 2.5$
CPI objective: $L^{\text{CPI}} = r_t \cdot \hat{A}_t = 2.5 \cdot 2.0 = 5.0$
CLIP objective: Since $\hat{A}_t > 0$: $\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) = \text{clip}(2.5, 0.8, 1.2) = 1.2$ $L^{\text{CLIP}} = \min(5.0, 1.2 \cdot 2.0) = \min(5.0, 2.4) = 2.4$
The clipped objective is much lower — PPO "caps" the benefit of making the action too much more probable.
Click for answer
$L^{\text{CPI}} = 5.0$, $L^{\text{CLIP}} = 2.4$. The clipped objective is less than half the unclipped one, preventing the gradient from pushing $r_t$ even higher. If $\pi_\theta$ were 0.22 ($r_t = 1.1$), $L^{\text{CLIP}} = 2.2$, and PPO would continue increasing the probability.Example 2: GAE Computation
A trajectory has rewards $[1, 2, -1, 3]$ and value predictions $V = [0, 1.5, 2.0, 1.0, 0]$ (terminal state has $V=0$). $\gamma = 0.9$, $\lambda = 0.8$. Compute $\delta_t$ and $\hat{A}_t^{\text{GAE}}$ for $t = 0, 1, 2$.
Solution:
TD errors: $\delta_0 = 1 + 0.9(1.5) - 0 = 1 + 1.35 = 2.35$ $\delta_1 = 2 + 0.9(2.0) - 1.5 = 2 + 1.8 - 1.5 = 2.3$ $\delta_2 = -1 + 0.9(1.0) - 2.0 = -1 + 0.9 - 2.0 = -2.1$ $\delta_3 = 3 + 0.9(0) - 1.0 = 2.0$
GAE (backward recursion, $\hat{A}_3 = \delta_3 = 2.0$): $\hat{A}_2 = \delta_2 + \gamma\lambda \hat{A}_3 = -2.1 + 0.9(0.8)(2.0) = -2.1 + 1.44 = -0.66$ $\hat{A}_1 = \delta_1 + \gamma\lambda \hat{A}_2 = 2.3 + 0.72(-0.66) = 2.3 - 0.475 = 1.825$ $\hat{A}_0 = \delta_0 + \gamma\lambda \hat{A}_1 = 2.35 + 0.72(1.825) = 2.35 + 1.314 = 3.664$
Click for answer
$\hat{A} = [3.664, 1.825, -0.66, 2.0]$. Step 0 has the highest advantage (the trajectory starts well). Step 2 has negative advantage (the reward was worse than predicted by $V$).Example 3: Entropy Bonus Effect
A policy over 4 actions has probabilities $[0.7, 0.1, 0.1, 0.1]$. Compute the entropy $S[\pi]$ and the gradient of the entropy bonus with $c_2 = 0.01$.
Solution:
$S[\pi] = -\sum_a \pi(a) \log \pi(a) = -(0.7 \log 0.7 + 3 \times 0.1 \log 0.1)$ $= -(0.7(-0.357) + 3(-0.23)) = -(-0.25 - 0.69) = 0.94$
Entropy bonus loss term: $-c_2 S[\pi] = -0.01 \times 0.94 = -0.0094$ (negative because it's subtracted in the total loss, or equivalently, maximizing entropy means adding $+c_2 S$ to the objective).
Gradient: $\nabla_\theta S[\pi] = -\sum_a (\log \pi(a) + 1) \nabla_\theta \pi(a)$ For each action: $-\nabla_\theta \pi(a) \cdot (\log \pi(a) + 1)$
The entropy gradient pushes probabilities toward uniformity.
Click for answer
$S[\pi] = 0.94$. The entropy bonus encourages exploration by penalizing deterministic policies. Without it, PPO can collapse to a near-deterministic policy too early.Practice Problems
Problem 1: Prove that for $\hat{A}_t > 0$, the gradient of $L^{\text{CLIP}}$ is zero when $r_t(\theta) > 1 + \varepsilon$. Why is this desirable?
Answer (click to expand)
When $\hat{A}_t > 0$ and $r_t > 1+\varepsilon$: $L^{\text{CLIP}} = \min(r_t \hat{A}_t, (1+\varepsilon)\hat{A}_t) = (1+\varepsilon)\hat{A}_t$ (constant w.r.t $\theta$) $\nabla_\theta L^{\text{CLIP}} = 0$ This is desirable because it stops the policy from changing further in this direction — the policy is already much more likely to take this action than before. Further increases in $\pi(a|s)$ don't improve the objective, so gradient descent stops. This is the trust region mechanism.Problem 2: Derive the recursive GAE formula $\hat{A}t^{\text{GAE}} = \delta_t + \gamma\lambda \hat{A}{t+1}^{\text{GAE}}$ from the definition $\hat{A}t^{\text{GAE}} = \sum{l=0}^\infty (\gamma\lambda)^l \delta_{t+l}$.
Answer (click to expand)
$$\hat{A}_t^{\text{GAE}} = \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+l} = \delta_t + \sum_{l=1}^\infty (\gamma\lambda)^l \delta_{t+l}$$ $$= \delta_t + \gamma\lambda \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+1+l} = \delta_t + \gamma\lambda \hat{A}_{t+1}^{\text{GAE}}$$ This recursive form enables $O(T)$ computation: compute TD errors forward, then GAE backward in one pass.Problem 3: Compare PPO's clipping to TRPO's KL constraint. If $\varepsilon = 0.2$, approximately what KL divergence does PPO's clipping tolerate? (Use the second-order Taylor expansion $D_{\text{KL}}(\pi_{\text{old}} | \pi) \approx \frac{1}{2}(r-1)^2$ for small changes.)
Answer (click to expand)
For an individual $(s,a)$ pair with $r = \pi/\pi_{\text{old}}$: $D_{\text{KL}} \approx \frac{1}{2}(r-1)^2$ Maximum tolerated: $|r-1| = \varepsilon = 0.2$ $D_{\text{KL}}^{\max} \approx \frac{1}{2}(0.2)^2 = 0.02$ In expectation over the data distribution, PPO tolerates roughly $D_{\text{KL}} \approx 0.02$ per update. TRPO typically uses $\delta = 0.01$, so PPO with $\varepsilon = 0.2$ is about twice as permissive as typical TRPO settings. However, PPO uses multiple epochs of updates on the same data, so the effective KL per outer iteration may be comparable.Problem 4: PPO trains for $K = 4$ epochs on the same batch of data. Why doesn't this cause overfitting like in supervised learning? What would happen with $K = 100$?
Answer (click to expand)
PPO avoids overfitting through clipping: once the policy ratio $r_t$ hits the clipping boundary ($1 \pm \varepsilon$), the gradient for that sample becomes zero and the policy stops changing in that direction. Additional epochs have no effect on clipped samples. With $K=100$, the policy would saturate the clipping boundary on nearly all samples after the first few epochs, and subsequent epochs would do nothing — wasted computation. But the policy would also overfit to the advantage estimates (which contain noise), potentially moving it into a bad region of parameter space before hitting the clipper. Typical $K = 3$–$10$ works well.Problem 5: In RLHF with PPO, the reward model gives scores $R_\phi(x, y)$ for generated responses. If the reference policy has probability 0.01 for a token and the updated policy has 0.03 ($r_t = 3.0$), and $\varepsilon = 0.2$, what happens to the gradient for this token?
Answer (click to expand)
$r_t = 3.0 > 1 + \varepsilon = 1.2$. The ratio is far outside the clipping range. If the token has positive advantage (the reward model liked the response): $L^{\text{CLIP}}$ clips at $(1+\varepsilon)\hat{A}$, gradient is zero — PPO discards this update. The policy has already increased this token's probability too much relative to the reference. If negative advantage: $L^{\text{CLIP}} = r_t \hat{A}_t$ (unclipped, since $\hat{A}_t < 0$ and $r_t > 1$, the min picks the smaller value). But this means the bad token's probability increased — the unclipped loss penalizes this heavily. Gradient pushes $\theta$ to reduce $\pi_\theta$. In RLHF, the additional KL penalty $\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ also pushes back toward the reference. The combination of PPO clipping and KL penalty provides a double safety net against reward hacking.Summary
- PPO simplifies TRPO by replacing the KL constraint with a clipped surrogate objective $L^{\text{CLIP}}$
- The probability ratio $r_t = \pi_\theta/\pi_{\theta_{\text{old}}}$ measures policy change; clipping to $[1-\varepsilon, 1+\varepsilon]$ prevents destructive updates
- GAE blends $n$-step advantage estimates using $\lambda$ to trade off bias and variance; computed efficiently via backward recursion
- PPO's total loss: $L = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S[\pi]$ (clipped policy loss + value loss + entropy bonus)
- Multiple epochs ($K=3$–$10$) on the same data are safe because clipping zeroes gradients on saturated samples
- PPO is the workhorse of RLHF: it fine-tunes language models using reward model scores as advantage signals
Quiz
Question 1: What does PPO's clipped objective prevent?
A. The value function from overfitting B. The policy from changing too much in a single update C. The reward function from diverging D. The learning rate from decaying
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. Clipping $r_t$ to $[1-\varepsilon, 1+\varepsilon]$ ensures the new policy doesn't deviate too far from the old one. - If you chose A: Value function regularization is handled by the value loss term, not clipping. - If you chose C: PPO doesn't modify the reward function. - If you chose D: The learning rate is a hyperparameter, not directly controlled by clipping.Question 2: In $L^{\text{CLIP}} = \mathbb{E}[\min(r_t \hat{A}_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)\hat{A}_t)]$, what does the $min$ operation do when $\hat{A}_t > 0$?
A. It minimizes the loss B. It takes the smaller of $r_t\hat{A}_t$ and $(1+\varepsilon)\hat{A}_t$, preventing gains from overly large $r_t$ C. It ignores the advantage entirely D. It clips the advantage instead of the ratio
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. When $\hat{A}_t > 0$, increasing $r_t$ (making the action more probable) is good, but only up to $1+\varepsilon$. Beyond that, the clipped value $(1+\varepsilon)\hat{A}_t$ is smaller and the `min` selects it, giving zero gradient. - If you chose A: The `min` selects between two specific expressions, not a generic minimization. - If you chose C: The advantage is always used. - If you chose D: The clipping is on $r_t$, not $\hat{A}_t$.Question 3: GAE's parameter $\lambda$ controls:
A. The learning rate B. The bias-variance tradeoff in advantage estimation C. The discount factor for rewards D. The clipping threshold
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. $\lambda = 0$ gives TD(0) (high bias, low variance); $\lambda = 1$ gives Monte Carlo (low bias, high variance). Values in between blend $n$-step estimates. - If you chose A: The learning rate is a separate hyperparameter. - If you chose C: $\gamma$ controls reward discounting, not $\lambda$. - If you chose D: The clipping threshold is $\varepsilon$.Question 4: PPO trains for multiple epochs ($K > 1$) on the same batch of data. Why is this safe?
A. The data doesn't change B. Clipping zeroes out gradients for samples that hit the clipping boundary, preventing over-optimization C. Multiple epochs always improve performance D. The value function prevents it
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. Once $r_t$ hits the $1 \pm \varepsilon$ boundary, the loss becomes constant w.r.t. $\theta$ for that sample and further epochs don't move the policy further in that direction. - If you chose A: Data not changing makes overfitting MORE likely in supervised learning, not less. - If you chose C: Without clipping, multiple epochs would overfit to advantage noise. - If you chose D: The value function helps with advantage estimation but doesn't prevent policy overfitting.Question 5: In RLHF, what additional regularization does PPO receive beyond the standard clipping?
A. L2 weight decay B. A KL-divergence penalty between the updated policy and the reference model C. Dropout D. Gradient clipping
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. The RLHF objective includes $-\beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$, which prevents the fine-tuned model from drifting too far from the supervised-fine-tuned reference. This is critical to prevent reward hacking. - If you chose A/C/D: These are common regularization techniques but not the key RLHF-specific addition.Pitfalls
- Confusing $\pi_\theta$ and $\pi_{\theta_{\text{old}}}$: The advantage $\hat{A}t$ must be computed from trajectories generated under $\pi{\theta_{\text{old}}}$, not $\pi_\theta$. Using the current policy to re-weight advantages from old trajectories introduces bias.
- $\lambda$ mixing with $\gamma$: In GAE, the effective horizon is $\gamma\lambda$, not just $\lambda$. Setting $\lambda = 0.95$ with $\gamma = 0.99$ gives effective $\gamma\lambda = 0.9405$ — still quite long-horizon.
- Multiple epochs with no clipping: Running many epochs without clipping on the same data causes catastrophic overfitting. Always use clipping or early stopping based on approximate KL.
- Forgetting the entropy bonus: Without entropy regularization, PPO can converge to deterministic policies prematurely, missing better stochastic strategies.
- Advantage normalization: Always normalize advantages to zero mean and unit variance within each batch — this stabilizes training dramatically.
Next Steps
Next up: 23-10 — Advanced RL — where we explore TRPO's KL constraint approach, maximum entropy RL (Soft Actor-Critic), model-based methods, and multi-agent RL.
Q8: PPO's total loss is typically $$L = L^{\text{CLIP}} - c_1 L^{\text{VF}} + c_2 S[\pi]$$. The entropy bonus $$S[\pi]$$ is added (with positive coefficient) to:
A) Penalize stochastic policies B) Encourage exploration by rewarding policies with higher entropy C) Reduce memory usage D) Stabilize the value function