20-06 — RLHF Mathematics
Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-06 Prerequisites: 20-05 (Instruction Tuning SFT), 13-04 (KL Divergence), 10-02 (Conditional Probability — Bayes' rule, Bradley-Terry model), 14-02 (Gradient Descent), 23-08 (PPO Algorithm — conceptual), 05-05 (Integration by Parts — for KL integral) Next subject: 20-07 — DPO (Direct Preference Optimization)
Learning Objectives
By the end of this subject, you will be able to:
- Derive the Bradley-Terry preference model and prove the reward model loss: L_RM = −E[log σ(r(x, y_w) − r(x, y_l))]
- Formulate the KL-constrained RL objective for RLHF: max_π E[r(x,y)] − β·D_KL(π || π_ref) and derive its closed-form optimal policy
- Explain the three-stage RLHF pipeline (SFT → Reward Model → PPO) and the mathematical role each stage plays
- Compute the gradient of the PPO-clip objective used in RLHF and analyze how clipping prevents destructive policy updates
- Derive why the KL penalty coefficient β controls the tradeoff between reward optimization and staying close to the reference policy
Core Content
1. The RLHF Pipeline — Three Stages
RLHF (Reinforcement Learning from Human Feedback) extends instruction tuning by incorporating human preferences:
Stage 1: SFT (Supervised Fine-Tuning)
→ Train a base model π_SFT on (instruction, response) pairs
→ Produces a model that can follow instructions but may not align with human preferences
Stage 2: Reward Model Training
→ Collect human preference data: for instruction x, humans prefer y_w over y_l
→ Train a reward model r_φ(x, y) to predict human preferences
→ The reward model scores how "good" a response is
Stage 3: PPO Fine-Tuning
→ Use PPO to optimize π_θ against the reward model r_φ
→ Add a KL penalty to keep π_θ close to π_SFT (prevent reward hacking)
2. Stage 2: The Bradley-Terry Preference Model
Human preferences are modeled as a pairwise comparison. Given instruction x and two responses y₁, y₂, the probability that a human prefers y₁ over y₂ is:
$P(y₁ ≻ y₂ | x) = exp(r*(x, y₁)) / (exp(r*(x, y₁)) + exp(r*(x, y₂))) $
where r*(x, y) is the "true" (latent) reward of response y given instruction x.
Derivation from Bradley-Terry: The Bradley-Terry model for pairwise comparisons states that each item i has a "strength" parameter λ_i, and:
$P(i beats j) = λ_i / (λ_i + λ_j) $
Set λ_i = exp(r(x, y_i)). Then exp(r₁)/(exp(r₁)+exp(r₂)) = σ(r₁ − r₂), where σ is the sigmoid function.
Key simplification:
$P(y₁ ≻ y₂ | x) = σ(r(x, y₁) − r(x, y₂)) $
where σ(z) = 1/(1+e^{−z}) is the sigmoid.
3. Reward Model Loss Function
Given a dataset of human preferences D = {(x, y_w, y_l)} where y_w is the preferred (winning) response and y_l is the losing response:
Maximum likelihood estimation:
The reward model r_φ should maximize the probability of the observed preferences:
$L_RM(φ) = −E_{(x, y_w, y_l)~D}[log P(y_w ≻ y_l | x)]
= −E[log σ(r_φ(x, y_w) − r_φ(x, y_l))]
$
Full derivation:
$P(y_w ≻ y_l | x; φ) = σ(r_φ(x, y_w) − r_φ(x, y_l))
= 1 / (1 + exp(−(r_φ(x, y_w) − r_φ(x, y_l))))
log P = −log(1 + exp(−(r_w − r_l)))
L_RM = E[log(1 + exp(−(r_w − r_l)))]
$
This is the logistic loss on the difference r_w − r_l. If r_w ≫ r_l, the loss is near zero. If r_w ≪ r_l, the loss is large.
Gradient of the reward model loss:
$∂L_RM/∂φ = −E[σ(−(r_w − r_l)) · ∇_φ(r_w − r_l)]
= −E[(1 − σ(r_w − r_l)) · ∇_φ(r_w − r_l)]
$
The gradient pushes r_w UP and r_l DOWN, weighted by how "wrong" the current model is. If the model already strongly prefers y_w, σ(r_w−r_l) ≈ 1, and the gradient is near zero.
⚠️ THIS IS CRITICAL — The reward model is the interface between human preferences and optimization. Its accuracy determines the quality of the final RLHF-tuned model. Reward model overfitting or reward hacking can cause the PPO stage to optimize for "gaming" the reward rather than actual quality.
4. Stage 3: KL-Constrained RL Objective
Given a trained reward model r_φ (we'll just write r), the RL objective is:
$max_π E_{x~D, y~π(·|x)}[r(x, y)] − β · D_KL(π(·|x) || π_ref(·|x))
$
where: - π is the policy (language model) we're optimizing - π_ref is the reference policy (typically π_SFT from Stage 1) - β is the KL penalty coefficient - D_KL(π || π_ref) = E_{y~π}[log(π(y|x)/π_ref(y|x))] measures divergence from the reference
Why the KL penalty? Without it, the policy would exploit any quirks in the reward model — producing nonsensical text that happens to score high on r. The KL penalty ensures the model stays "close" to the well-behaved reference policy, preserving fluency and coherence.
5. Closed-Form Optimal Policy
The KL-constrained RL objective has a known analytic solution. Let's derive it.
Problem: For a fixed x, find π* that maximizes:
$J(π) = E_{y~π}[r(x,y)] − β · D_KL(π || π_ref)
$
With the constraint Σ_y π(y|x) = 1 (valid probability distribution).
Lagrangian:
$L = Σ_y π(y|x) r(x,y) − β Σ_y π(y|x) log(π(y|x)/π_ref(y|x)) − λ(Σ_y π(y|x) − 1) $
First-order condition (∂L/∂π(y|x) = 0):
$r(x,y) − β[log(π(y|x)/π_ref(y|x)) + 1] − λ = 0 r(x,y) − β log π(y|x) + β log π_ref(y|x) − β − λ = 0 β log π(y|x) = r(x,y) + β log π_ref(y|x) − β − λ π(y|x) = π_ref(y|x) · exp(r(x,y)/β) · exp(−1 − λ/β) $
Let Z(x) = exp(1 + λ/β) be the normalization constant:
$π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β) $
Interpretation: The optimal policy is the reference policy RE-WEIGHTED by exp(r/β). Responses with higher reward are up-weighted; responses with lower reward are down-weighted. β controls the temperature of this re-weighting: - β → 0: π puts all mass on argmax r(x,y) (greedy on reward) - β → ∞: π → π_ref (no deviation from reference)
6. PPO for RLHF
In practice, we can't compute the optimal policy analytically (too many possible responses). Instead, we use PPO (Proximal Policy Optimization) to optimize the objective empirically.
PPO-CLIP Objective:
For each batch of prompts, the policy π_θ generates responses y. The PPO loss is:
$L_PPO(θ) = E[min(ρ_t · A_t, clip(ρ_t, 1−ε, 1+ε) · A_t)] $
where: - ρ_t = π_θ(y_t | x, y_{<t}) / π_old(y_t | x, y_{<t}) — the probability ratio - A_t = the advantage at step t - ε = clip range (typically 0.2)
The advantage in RLHF:
$A_t = r(x, y) − β · log(π_θ(y|x)/π_ref(y|x)) $
The advantage is the reward MINUS the KL penalty. Each generated token "consumes" some of the KL budget. The PPO gradient pushes the policy toward tokens that increase reward while not deviating too far from π_ref.
Token-level vs sequence-level: In RLHF, the reward r(x,y) is typically at the sequence level (a single scalar for the entire response). The advantage is distributed across tokens. A common approach: assign the same advantage to all tokens in the response.
Gradient of the PPO-clip objective (when ρ is within clip range):
$∇_θ L_PPO = E[ρ_t · A_t · ∇_θ log π_θ(y_t | x, y_{<t})]
$
This is a standard policy gradient, weighted by the advantage and clipped to prevent overly aggressive updates.
7. The Reward Hacking Problem
Reward models are imperfect. They can be "hacked" — the policy finds responses that score highly on the reward model but are actually low-quality. Examples: - Overly verbose responses (reward models often correlate length with quality) - Repetitive but "safe" text - Nonsensical outputs that happen to trigger high reward
KL penalty as defense: By penalizing divergence from π_ref, the KL term prevents the policy from venturing too far into regions where the reward model is unreliable (since π_ref was trained on actual human text, not reward-optimized text). The reward model is only reliable near the distribution it was trained on — the KL penalty keeps us in that region.
8. Practical RLHF Configuration
Typical hyperparameters from published work (InstructGPT, Llama 2):
| Parameter | Value |
|---|---|
| β (KL coefficient) | 0.01–0.1 |
| ε (PPO clip) | 0.2 |
| γ (discount factor) | 1.0 (no discount in text) |
| λ (GAE) | 0.95 |
| Optimizer steps per batch | 4 |
| Learning rate | 1e-6 to 5e-6 |
| Batch size | 512–1024 prompts |
Worked Examples
Example 1: Reward Model Loss for a Single Pair
Problem: A preference pair has r_φ(x, y_w) = 2.0 and r_φ(x, y_l) = −1.0. Compute the reward model loss for this example.
Solution:
$Δr = r_w − r_l = 2.0 − (−1.0) = 3.0
L = −log σ(3.0)
σ(3.0) = 1/(1+e^{−3}) = 1/(1+0.0498) = 1/1.0498 ≈ 0.9526
L = −log(0.9526) = −(−0.0485) = 0.0485
$
Small loss — the model correctly distinguishes the winning from losing response. If Δr were negative (model thinks loser is better), loss would be on the order of ln(1+e^{|Δr|}) ≈ |Δr|, much larger.
Example 2: Computing the KL Penalty
Problem: For a single response y of length 3 tokens, the reference policy gives probabilities [0.5, 0.3, 0.4] and the current policy gives [0.6, 0.2, 0.5]. Compute D_KL(π || π_ref) and the KL penalty with β = 0.1.
Solution:
$D_KL = Σ_t π(y_t) log(π(y_t)/π_ref(y_t))
= 0.6·log(0.6/0.5) + 0.2·log(0.2/0.3) + 0.5·log(0.5/0.4)
= 0.6·log(1.2) + 0.2·log(0.667) + 0.5·log(1.25)
= 0.6·0.1823 + 0.2·(−0.4055) + 0.5·0.2231
= 0.1094 − 0.0811 + 0.1116
= 0.1399
$
KL penalty = β · D_KL = 0.1 · 0.1399 = 0.0140.
The total reward for this response would be discounted by 0.014 to account for deviation from π_ref.
Example 3: Optimal Policy Re-weighting
Problem: π_ref gives equal probability (0.5) to two responses: "Hello" and "Hi there". The reward model scores them as r("Hello") = 1.0, r("Hi there") = 3.0. With β = 0.5, what are the optimal policy probabilities?
Solution:
$π*(y) ∝ π_ref(y) · exp(r(y)/β)
For "Hello": π_ref · exp(1.0/0.5) = 0.5 · exp(2.0) = 0.5 · 7.389 = 3.695
For "Hi there": π_ref · exp(3.0/0.5) = 0.5 · exp(6.0) = 0.5 · 403.4 = 201.7
Z = 3.695 + 201.7 = 205.4
π*("Hello") = 3.695 / 205.4 = 0.018 = 1.8%
π*("Hi there") = 201.7 / 205.4 = 0.982 = 98.2%
$
The optimal policy strongly prefers the higher-reward response. If we set β = 10.0:
$π*("Hello") ∝ 0.5·exp(0.1) = 0.5526
π*("Hi there") ∝ 0.5·exp(0.3) = 0.6750
Z = 1.2276
π*("Hello") = 0.45, π*("Hi there") = 0.55
$
High β keeps the policy close to uniform — conservative updates.
Quiz
Q1: What does the concept of The Bradley-Terry model primarily refer to in this subject?
A) A historical anecdote about The Bradley-Terry model B) The definition and application of The Bradley-Terry model C) A visual representation of The Bradley-Terry model D) A computational error related to The Bradley-Terry model
Correct: B)
- If you chose A: This is incorrect. The Bradley-Terry model is defined as: the definition and application of the bradley-terry model. The other options describe different aspects that are not the primary focus.
- If you chose B: The Bradley-Terry model is defined as: the definition and application of the bradley-terry model. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. The Bradley-Terry model is defined as: the definition and application of the bradley-terry model. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. The Bradley-Terry model is defined as: the definition and application of the bradley-terry model. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of The KL-constrained objective?
A) It is used to the kl-constrained objective in mathematical analysis B) It replaces all other methods in this domain C) It is used only in advanced research contexts D) It is primarily a historical notation system
Correct: A)
- If you chose A: The KL-constrained objective serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. The KL-constrained objective serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. The KL-constrained objective serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. The KL-constrained objective serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about The optimal policy is TRUE?
A) The optimal policy is not related to this subject B) The optimal policy is a fundamental concept covered in this subject C) The optimal policy is mentioned only as a historical footnote D) The optimal policy is an advanced topic beyond this subject's scope
Correct: B)
- If you chose A: This is incorrect. The optimal policy is a fundamental concept covered in this subject. This subject covers The optimal policy as part of its core content.
- If you chose B: The optimal policy is a fundamental concept covered in this subject. This subject covers The optimal policy as part of its core content. Correct!
- If you chose C: This is incorrect. The optimal policy is a fundamental concept covered in this subject. This subject covers The optimal policy as part of its core content.
- If you chose D: This is incorrect. The optimal policy is a fundamental concept covered in this subject. This subject covers The optimal policy as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) 0.5545. D) An unrelated numerical value
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is 0.5545.. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is 0.5545.. The other options represent common errors.
- If you chose C: The worked examples show that the result is 0.5545.. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is 0.5545.. The other options represent common errors.
Q5: How are The optimal policy and PPO-clip related?
A) The optimal policy and PPO-clip are closely related concepts B) The optimal policy and PPO-clip are completely unrelated topics C) The optimal policy is the inverse of PPO-clip D) The optimal policy is a special case of PPO-clip
Correct: A)
- If you chose A: Both The optimal policy and PPO-clip are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both The optimal policy and PPO-clip are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both The optimal policy and PPO-clip are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both The optimal policy and PPO-clip are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Bradley-Terry model?
A) The main error with Bradley-Terry model is using it when it is not needed B) A common mistake is confusing Bradley-Terry model with a similar concept C) Bradley-Terry model has no common misconceptions D) Bradley-Terry model is always computed the same way in all contexts
Correct: B)
- If you chose A: This is incorrect. Students often confuse Bradley-Terry model with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Bradley-Terry model with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Bradley-Terry model with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Bradley-Terry model with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Reward model?
A) Apply Reward model to solve problems in this subject's domain B) Use Reward model only in pure mathematics contexts C) Reward model is not practically useful D) Avoid Reward model unless explicitly instructed
Correct: A)
- If you chose A: Reward model is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Reward model is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Reward model is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Reward model is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
For a preference pair, r_w = 1.5 and r_l = 1.2. Compute the reward model loss. Then compute it for r_w = 1.5, r_l = −3.0. Why is the second loss smaller despite the larger gap?
Answer
Case 1: Δr = 0.3. L = −log σ(0.3) = −log(0.5744) = 0.5545. Case 2: Δr = 4.5. L = −log σ(4.5) = −log(0.9890) = 0.0111.
The second loss is smaller because the model is MORE confident about the correct ordering. The loss depends on the sigmoid of Δr — once |Δr| ≫ 1, the sigmoid saturates and the loss approaches 0. The loss is largest when Δr is near 0 (model unsure).
Problem 2
Derive that for β → 0, the KL-constrained optimal policy assigns probability 1 to argmax_y r(x,y). (Assume a unique maximum.)
Answer
π*(y) ∝ π_ref(y) · exp(r(y)/β). As β → 0, exp(r(y)/β) grows fastest for the y with the largest r(y). Let r* = max_y r(y) and y* = argmax r(y). Then: π*(y*) ∝ π_ref(y*) · exp(r*/β) π*(y) ∝ π_ref(y) · exp(r(y)/β) for y ≠ y* The ratio: π*(y)/π*(y*) = (π_ref(y)/π_ref(y*)) · exp((r(y)−r*)/β) → 0 as β → 0 (since r(y)−r* < 0 and division by β → 0 makes the exponent → −∞). So π*(y*) → 1 and all other π*(y) → 0. The policy becomes deterministic — always pick the highest-reward response.Problem 3
Show that the KL-constrained objective max_π E[r] − β·D_KL(π||π_ref) is equivalent to maximizing expected reward with an entropy bonus on π_ref. (Hint: expand D_KL.)
Answer
D_KL(π||π_ref) = E_{y~π}[log π(y) − log π_ref(y)] = −H(π) − E[log π_ref(y)] So J(π) = E[r] − β·(−H(π) − E[log π_ref]) = E[r] + β·H(π) + β·E[log π_ref] = E[r + β·log π_ref] + β·H(π) The first term rewards responses that π_ref assigns high probability to (like a shaped reward). The second term is an entropy bonus encouraging exploration. This shows RLHF naturally balances reward optimization, staying close to the reference, and maintaining diversity.Problem 4
The PPO-clip objective clips ρ_t to [1−ε, 1+ε]. If ε=0.2, π_old gives probability 0.1 to a token, and π_θ gives probability 0.3, what is the effective probability ratio used in the loss? What if π_θ gives 0.05?
Answer
Case 1: ρ = 0.3/0.1 = 3.0. Clipped to 1+ε = 1.2. Effective ratio = 1.2. Case 2: ρ = 0.05/0.1 = 0.5. This is within [1−0.2=0.8, 1+0.2=1.2]? No: 0.5 < 0.8, so clipped to 0.8. Effective ratio = 0.8. Clipping prevents the policy from changing TOO much from the old policy — even if the advantage says "increase this token's probability dramatically," the update is bounded. This prevents destructive policy collapse where one bad update ruins the model.Problem 5
Explain why the reward model is trained on PREFERENCES rather than direct scores. Derive what would happen if humans gave absolute scores (0-10) instead of pairwise preferences.
Answer
Preferences are more RELIABLE than absolute scores. Humans disagree on absolute scales (calibration varies) but agree on pairwise comparisons more consistently. Mathematically: **With preferences (Bradley-Terry):** Only the DIFFERENCE r_w − r_l matters. The absolute scale of r is irrelevant — shifting all rewards by a constant doesn't change P(y₁≻y₂). This makes the training robust to calibration shifts. **With absolute scores:** We'd need to minimize ||r(x,y) − score||². This is harder because: 1. Different annotators have different baselines (one's "7" is another's "5") 2. The model must learn absolute magnitudes, which is more complex 3. Small errors in absolute prediction cause large losses, making training unstable Preferences are mathematically elegant: they reduce the problem to ORDERING, which is both more reliable and simpler to optimize.Summary
- RLHF is a three-stage process: SFT teaches instruction following → Reward Model learns human preferences → PPO optimizes the policy against the reward model with a KL penalty
- The Bradley-Terry model converts pairwise preferences into a probability: P(y₁ ≻ y₂) = σ(r(x,y₁) − r(x,y₂)), leading to a simple logistic loss for the reward model
- The KL-constrained objective max E[r] − β·D_KL(π||π_ref) balances reward optimization against staying close to the reference policy, preventing reward hacking
- The optimal policy has closed form π*(y) ∝ π_ref(y)·exp(r(y)/β) — the reference policy re-weighted by exponentiated reward
- PPO-clip provides a practical algorithm for optimizing this objective, with probability ratio clipping preventing destructive updates
Pitfalls
- Using the reward model loss as if it were a regression loss. The reward model minimizes L = −log σ(r_w − r_l), which only cares about the ORDERING of rewards, not their absolute magnitudes. Shifting all rewards by a constant doesn't change the loss. Don't interpret the raw reward values as absolute quality scores — they're only meaningful in relative comparisons.
- Setting β too low and causing reward hacking. A small KL penalty coefficient (β < 0.01) allows the policy to drift far from π_ref, into regions where the reward model is unreliable. The policy learns to exploit reward model quirks (verbosity bias, pattern matching) rather than genuinely improving. Always monitor for reward-KL tradeoff curves and validate with actual human evaluations, not just reward model scores.
- Distributing sequence-level rewards uniformly across all tokens. In RLHF, the reward r(x,y) is a single scalar for the entire response. Assigning the same advantage to every token ignores the fact that some tokens contribute more to quality than others. This can cause the policy to reinforce filler tokens and stylistic tics. Consider token-level reward decomposition or learned value functions for finer-grained credit assignment.
- Not normalizing reward scores before PPO training. Raw reward model outputs can have arbitrary scale and bias. If rewards are very large, the KL penalty becomes negligible and reward hacking occurs. If rewards are very small, the KL penalty dominates and the policy never moves. Standard practice: normalize rewards to zero mean and unit variance across the batch before computing advantages.
- Evaluating RLHF progress using only reward model scores. Since the policy is optimized AGAINST the reward model, improving reward scores can mean either genuine improvement OR increased exploitation of the reward model's blind spots. Always include human evaluations, benchmark task performance, and diversity metrics in your evaluation suite. A policy that scores +10 on the reward model but produces repetitive, unhelpful text has not actually improved.
Key Terms
| Term | Definition |
|---|---|
| RLHF | Three-stage pipeline: SFT → Reward Model → PPO — aligns LLMs with human preferences |
| Bradley-Terry model | P(y₁ ≻ y₂) = σ(r(x,y₁) − r(x,y₂)) — models pairwise preferences via latent rewards |
| Reward model | r_φ(x, y) — trained on human preferences; scores how "good" a response is |
| KL-constrained RL | max_π E[r(x,y)] − β·D_KL(π |
| Optimal policy (closed form) | π*(y) ∝ π_ref(y)·exp(r(y)/β) — reference policy re-weighted by exponentiated reward |
| β (KL coefficient) | Controls the reward-vs-stay-close tradeoff: β → 0 = greedy on reward, β → ∞ = no change from π_ref |
| PPO-clip | Clips probability ratio to [1−ε, 1+ε] — prevents destructive single-step policy changes |
| Reward hacking | Policy exploits reward model weaknesses (e.g., verbosity bias) to game high scores without genuine quality |
| Advantage | A_t = r(x,y) − β·log(π_θ(y |
Next Steps
Continue to 20-07 — DPO (Direct Preference Optimization) to learn how to optimize human preferences DIRECTLY from preference data, without training a separate reward model — a simpler and often more effective alternative to RLHF.