Math graphic
📐 Concept diagram

20-06 — RLHF Mathematics

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-06 Prerequisites: 20-05 (Instruction Tuning SFT), 13-04 (KL Divergence), 10-02 (Conditional Probability — Bayes' rule, Bradley-Terry model), 14-02 (Gradient Descent), 23-08 (PPO Algorithm — conceptual), 05-05 (Integration by Parts — for KL integral) Next subject: 20-07 — DPO (Direct Preference Optimization)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Bradley-Terry preference model and prove the reward model loss: L_RM = −E[log σ(r(x, y_w) − r(x, y_l))]
  2. Formulate the KL-constrained RL objective for RLHF: max_π E[r(x,y)] − β·D_KL(π || π_ref) and derive its closed-form optimal policy
  3. Explain the three-stage RLHF pipeline (SFT → Reward Model → PPO) and the mathematical role each stage plays
  4. Compute the gradient of the PPO-clip objective used in RLHF and analyze how clipping prevents destructive policy updates
  5. Derive why the KL penalty coefficient β controls the tradeoff between reward optimization and staying close to the reference policy

Core Content

1. The RLHF Pipeline — Three Stages

RLHF (Reinforcement Learning from Human Feedback) extends instruction tuning by incorporating human preferences:

Stage 1: SFT (Supervised Fine-Tuning)
    → Train a base model π_SFT on (instruction, response) pairs
    → Produces a model that can follow instructions but may not align with human preferences

Stage 2: Reward Model Training
    → Collect human preference data: for instruction x, humans prefer y_w over y_l
    → Train a reward model r_φ(x, y) to predict human preferences
    → The reward model scores how "good" a response is

Stage 3: PPO Fine-Tuning
    → Use PPO to optimize π_θ against the reward model r_φ
    → Add a KL penalty to keep π_θ close to π_SFT (prevent reward hacking)

2. Stage 2: The Bradley-Terry Preference Model

Human preferences are modeled as a pairwise comparison. Given instruction x and two responses y₁, y₂, the probability that a human prefers y₁ over y₂ is:

$P(y₁ ≻ y₂ | x) = exp(r*(x, y₁)) / (exp(r*(x, y₁)) + exp(r*(x, y₂)))
$

where r*(x, y) is the "true" (latent) reward of response y given instruction x.

Derivation from Bradley-Terry: The Bradley-Terry model for pairwise comparisons states that each item i has a "strength" parameter λ_i, and:

$P(i beats j) = λ_i / (λ_i + λ_j)
$

Set λ_i = exp(r(x, y_i)). Then exp(r₁)/(exp(r₁)+exp(r₂)) = σ(r₁ − r₂), where σ is the sigmoid function.

Key simplification:

$P(y₁ ≻ y₂ | x) = σ(r(x, y₁) − r(x, y₂))
$

where σ(z) = 1/(1+e^{−z}) is the sigmoid.


3. Reward Model Loss Function

Given a dataset of human preferences D = {(x, y_w, y_l)} where y_w is the preferred (winning) response and y_l is the losing response:

Maximum likelihood estimation:

The reward model r_φ should maximize the probability of the observed preferences:

$L_RM(φ) = −E_{(x, y_w, y_l)~D}[log P(y_w ≻ y_l | x)]
         = −E[log σ(r_φ(x, y_w) − r_φ(x, y_l))]
$

Full derivation:

$P(y_w ≻ y_l | x; φ) = σ(r_φ(x, y_w) − r_φ(x, y_l))
                     = 1 / (1 + exp(−(r_φ(x, y_w) − r_φ(x, y_l))))

log P = −log(1 + exp(−(r_w − r_l)))

L_RM = E[log(1 + exp(−(r_w − r_l)))]
$

This is the logistic loss on the difference r_w − r_l. If r_w ≫ r_l, the loss is near zero. If r_w ≪ r_l, the loss is large.

Gradient of the reward model loss:

$∂L_RM/∂φ = −E[σ(−(r_w − r_l)) · ∇_φ(r_w − r_l)]
         = −E[(1 − σ(r_w − r_l)) · ∇_φ(r_w − r_l)]
$

The gradient pushes r_w UP and r_l DOWN, weighted by how "wrong" the current model is. If the model already strongly prefers y_w, σ(r_w−r_l) ≈ 1, and the gradient is near zero.

⚠️ THIS IS CRITICAL — The reward model is the interface between human preferences and optimization. Its accuracy determines the quality of the final RLHF-tuned model. Reward model overfitting or reward hacking can cause the PPO stage to optimize for "gaming" the reward rather than actual quality.


4. Stage 3: KL-Constrained RL Objective

Given a trained reward model r_φ (we'll just write r), the RL objective is:

$max_π  E_{x~D, y~π(·|x)}[r(x, y)] − β · D_KL(π(·|x) || π_ref(·|x))
$

where: - π is the policy (language model) we're optimizing - π_ref is the reference policy (typically π_SFT from Stage 1) - β is the KL penalty coefficient - D_KL(π || π_ref) = E_{y~π}[log(π(y|x)/π_ref(y|x))] measures divergence from the reference

Why the KL penalty? Without it, the policy would exploit any quirks in the reward model — producing nonsensical text that happens to score high on r. The KL penalty ensures the model stays "close" to the well-behaved reference policy, preserving fluency and coherence.


5. Closed-Form Optimal Policy

The KL-constrained RL objective has a known analytic solution. Let's derive it.

Problem: For a fixed x, find π* that maximizes:

$J(π) = E_{y~π}[r(x,y)] − β · D_KL(π || π_ref)
$

With the constraint Σ_y π(y|x) = 1 (valid probability distribution).

Lagrangian:

$L = Σ_y π(y|x) r(x,y) − β Σ_y π(y|x) log(π(y|x)/π_ref(y|x)) − λ(Σ_y π(y|x) − 1)
$

First-order condition (∂L/∂π(y|x) = 0):

$r(x,y) − β[log(π(y|x)/π_ref(y|x)) + 1] − λ = 0
r(x,y) − β log π(y|x) + β log π_ref(y|x) − β − λ = 0
β log π(y|x) = r(x,y) + β log π_ref(y|x) − β − λ
π(y|x) = π_ref(y|x) · exp(r(x,y)/β) · exp(−1 − λ/β)
$

Let Z(x) = exp(1 + λ/β) be the normalization constant:

$π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)
$

Interpretation: The optimal policy is the reference policy RE-WEIGHTED by exp(r/β). Responses with higher reward are up-weighted; responses with lower reward are down-weighted. β controls the temperature of this re-weighting: - β → 0: π puts all mass on argmax r(x,y) (greedy on reward) - β → ∞: π → π_ref (no deviation from reference)


6. PPO for RLHF

In practice, we can't compute the optimal policy analytically (too many possible responses). Instead, we use PPO (Proximal Policy Optimization) to optimize the objective empirically.

PPO-CLIP Objective:

For each batch of prompts, the policy π_θ generates responses y. The PPO loss is:

$L_PPO(θ) = E[min(ρ_t · A_t, clip(ρ_t, 1−ε, 1+ε) · A_t)]
$

where: - ρ_t = π_θ(y_t | x, y_{<t}) / π_old(y_t | x, y_{<t}) — the probability ratio - A_t = the advantage at step t - ε = clip range (typically 0.2)

The advantage in RLHF:

$A_t = r(x, y) − β · log(π_θ(y|x)/π_ref(y|x))
$

The advantage is the reward MINUS the KL penalty. Each generated token "consumes" some of the KL budget. The PPO gradient pushes the policy toward tokens that increase reward while not deviating too far from π_ref.

Token-level vs sequence-level: In RLHF, the reward r(x,y) is typically at the sequence level (a single scalar for the entire response). The advantage is distributed across tokens. A common approach: assign the same advantage to all tokens in the response.

Gradient of the PPO-clip objective (when ρ is within clip range):

$∇_θ L_PPO = E[ρ_t · A_t · ∇_θ log π_θ(y_t | x, y_{<t})]
$

This is a standard policy gradient, weighted by the advantage and clipped to prevent overly aggressive updates.


7. The Reward Hacking Problem

Reward models are imperfect. They can be "hacked" — the policy finds responses that score highly on the reward model but are actually low-quality. Examples: - Overly verbose responses (reward models often correlate length with quality) - Repetitive but "safe" text - Nonsensical outputs that happen to trigger high reward

KL penalty as defense: By penalizing divergence from π_ref, the KL term prevents the policy from venturing too far into regions where the reward model is unreliable (since π_ref was trained on actual human text, not reward-optimized text). The reward model is only reliable near the distribution it was trained on — the KL penalty keeps us in that region.


8. Practical RLHF Configuration

Typical hyperparameters from published work (InstructGPT, Llama 2):

Parameter Value
β (KL coefficient) 0.01–0.1
ε (PPO clip) 0.2
γ (discount factor) 1.0 (no discount in text)
λ (GAE) 0.95
Optimizer steps per batch 4
Learning rate 1e-6 to 5e-6
Batch size 512–1024 prompts

Worked Examples

Example 1: Reward Model Loss for a Single Pair

Problem: A preference pair has r_φ(x, y_w) = 2.0 and r_φ(x, y_l) = −1.0. Compute the reward model loss for this example.

Solution:

$Δr = r_w − r_l = 2.0 − (−1.0) = 3.0
L = −log σ(3.0)
σ(3.0) = 1/(1+e^{−3}) = 1/(1+0.0498) = 1/1.0498 ≈ 0.9526
L = −log(0.9526) = −(−0.0485) = 0.0485
$

Small loss — the model correctly distinguishes the winning from losing response. If Δr were negative (model thinks loser is better), loss would be on the order of ln(1+e^{|Δr|}) ≈ |Δr|, much larger.


Example 2: Computing the KL Penalty

Problem: For a single response y of length 3 tokens, the reference policy gives probabilities [0.5, 0.3, 0.4] and the current policy gives [0.6, 0.2, 0.5]. Compute D_KL(π || π_ref) and the KL penalty with β = 0.1.

Solution:

$D_KL = Σ_t π(y_t) log(π(y_t)/π_ref(y_t))
     = 0.6·log(0.6/0.5) + 0.2·log(0.2/0.3) + 0.5·log(0.5/0.4)
     = 0.6·log(1.2) + 0.2·log(0.667) + 0.5·log(1.25)
     = 0.6·0.1823 + 0.2·(−0.4055) + 0.5·0.2231
     = 0.1094 − 0.0811 + 0.1116
     = 0.1399
$

KL penalty = β · D_KL = 0.1 · 0.1399 = 0.0140.

The total reward for this response would be discounted by 0.014 to account for deviation from π_ref.


Example 3: Optimal Policy Re-weighting

Problem: π_ref gives equal probability (0.5) to two responses: "Hello" and "Hi there". The reward model scores them as r("Hello") = 1.0, r("Hi there") = 3.0. With β = 0.5, what are the optimal policy probabilities?

Solution:

$π*(y) ∝ π_ref(y) · exp(r(y)/β)

For "Hello":    π_ref · exp(1.0/0.5) = 0.5 · exp(2.0) = 0.5 · 7.389 = 3.695
For "Hi there": π_ref · exp(3.0/0.5) = 0.5 · exp(6.0) = 0.5 · 403.4 = 201.7

Z = 3.695 + 201.7 = 205.4

π*("Hello") = 3.695 / 205.4 = 0.018 = 1.8%
π*("Hi there") = 201.7 / 205.4 = 0.982 = 98.2%
$

The optimal policy strongly prefers the higher-reward response. If we set β = 10.0:

$π*("Hello") ∝ 0.5·exp(0.1) = 0.5526
π*("Hi there") ∝ 0.5·exp(0.3) = 0.6750
Z = 1.2276
π*("Hello") = 0.45, π*("Hi there") = 0.55
$

High β keeps the policy close to uniform — conservative updates.



Quiz

Q1: What does the concept of The Bradley-Terry model primarily refer to in this subject?

A) A historical anecdote about The Bradley-Terry model B) The definition and application of The Bradley-Terry model C) A visual representation of The Bradley-Terry model D) A computational error related to The Bradley-Terry model

Correct: B)

Q2: What is the primary purpose of The KL-constrained objective?

A) It is used to the kl-constrained objective in mathematical analysis B) It replaces all other methods in this domain C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: A)

Q3: Which statement about The optimal policy is TRUE?

A) The optimal policy is not related to this subject B) The optimal policy is a fundamental concept covered in this subject C) The optimal policy is mentioned only as a historical footnote D) The optimal policy is an advanced topic beyond this subject's scope

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) 0.5545. D) An unrelated numerical value

Correct: C)

Q5: How are The optimal policy and PPO-clip related?

A) The optimal policy and PPO-clip are closely related concepts B) The optimal policy and PPO-clip are completely unrelated topics C) The optimal policy is the inverse of PPO-clip D) The optimal policy is a special case of PPO-clip

Correct: A)

Q6: What is a common pitfall when working with Bradley-Terry model?

A) The main error with Bradley-Terry model is using it when it is not needed B) A common mistake is confusing Bradley-Terry model with a similar concept C) Bradley-Terry model has no common misconceptions D) Bradley-Terry model is always computed the same way in all contexts

Correct: B)

Q7: When should you apply Reward model?

A) Apply Reward model to solve problems in this subject's domain B) Use Reward model only in pure mathematics contexts C) Reward model is not practically useful D) Avoid Reward model unless explicitly instructed

Correct: A)

Practice Problems

Problem 1

For a preference pair, r_w = 1.5 and r_l = 1.2. Compute the reward model loss. Then compute it for r_w = 1.5, r_l = −3.0. Why is the second loss smaller despite the larger gap?

Answer

Case 1: Δr = 0.3. L = −log σ(0.3) = −log(0.5744) = 0.5545. Case 2: Δr = 4.5. L = −log σ(4.5) = −log(0.9890) = 0.0111.

The second loss is smaller because the model is MORE confident about the correct ordering. The loss depends on the sigmoid of Δr — once |Δr| ≫ 1, the sigmoid saturates and the loss approaches 0. The loss is largest when Δr is near 0 (model unsure).

Problem 2

Derive that for β → 0, the KL-constrained optimal policy assigns probability 1 to argmax_y r(x,y). (Assume a unique maximum.)

Answer π*(y) ∝ π_ref(y) · exp(r(y)/β). As β → 0, exp(r(y)/β) grows fastest for the y with the largest r(y). Let r* = max_y r(y) and y* = argmax r(y). Then: π*(y*) ∝ π_ref(y*) · exp(r*/β) π*(y) ∝ π_ref(y) · exp(r(y)/β) for y ≠ y* The ratio: π*(y)/π*(y*) = (π_ref(y)/π_ref(y*)) · exp((r(y)−r*)/β) → 0 as β → 0 (since r(y)−r* < 0 and division by β → 0 makes the exponent → −∞). So π*(y*) → 1 and all other π*(y) → 0. The policy becomes deterministic — always pick the highest-reward response.

Problem 3

Show that the KL-constrained objective max_π E[r] − β·D_KL(π||π_ref) is equivalent to maximizing expected reward with an entropy bonus on π_ref. (Hint: expand D_KL.)

Answer D_KL(π||π_ref) = E_{y~π}[log π(y) − log π_ref(y)] = −H(π) − E[log π_ref(y)] So J(π) = E[r] − β·(−H(π) − E[log π_ref]) = E[r] + β·H(π) + β·E[log π_ref] = E[r + β·log π_ref] + β·H(π) The first term rewards responses that π_ref assigns high probability to (like a shaped reward). The second term is an entropy bonus encouraging exploration. This shows RLHF naturally balances reward optimization, staying close to the reference, and maintaining diversity.

Problem 4

The PPO-clip objective clips ρ_t to [1−ε, 1+ε]. If ε=0.2, π_old gives probability 0.1 to a token, and π_θ gives probability 0.3, what is the effective probability ratio used in the loss? What if π_θ gives 0.05?

Answer Case 1: ρ = 0.3/0.1 = 3.0. Clipped to 1+ε = 1.2. Effective ratio = 1.2. Case 2: ρ = 0.05/0.1 = 0.5. This is within [1−0.2=0.8, 1+0.2=1.2]? No: 0.5 < 0.8, so clipped to 0.8. Effective ratio = 0.8. Clipping prevents the policy from changing TOO much from the old policy — even if the advantage says "increase this token's probability dramatically," the update is bounded. This prevents destructive policy collapse where one bad update ruins the model.

Problem 5

Explain why the reward model is trained on PREFERENCES rather than direct scores. Derive what would happen if humans gave absolute scores (0-10) instead of pairwise preferences.

Answer Preferences are more RELIABLE than absolute scores. Humans disagree on absolute scales (calibration varies) but agree on pairwise comparisons more consistently. Mathematically: **With preferences (Bradley-Terry):** Only the DIFFERENCE r_w − r_l matters. The absolute scale of r is irrelevant — shifting all rewards by a constant doesn't change P(y₁≻y₂). This makes the training robust to calibration shifts. **With absolute scores:** We'd need to minimize ||r(x,y) − score||². This is harder because: 1. Different annotators have different baselines (one's "7" is another's "5") 2. The model must learn absolute magnitudes, which is more complex 3. Small errors in absolute prediction cause large losses, making training unstable Preferences are mathematically elegant: they reduce the problem to ORDERING, which is both more reliable and simpler to optimize.

Summary

  1. RLHF is a three-stage process: SFT teaches instruction following → Reward Model learns human preferences → PPO optimizes the policy against the reward model with a KL penalty
  2. The Bradley-Terry model converts pairwise preferences into a probability: P(y₁ ≻ y₂) = σ(r(x,y₁) − r(x,y₂)), leading to a simple logistic loss for the reward model
  3. The KL-constrained objective max E[r] − β·D_KL(π||π_ref) balances reward optimization against staying close to the reference policy, preventing reward hacking
  4. The optimal policy has closed form π*(y) ∝ π_ref(y)·exp(r(y)/β) — the reference policy re-weighted by exponentiated reward
  5. PPO-clip provides a practical algorithm for optimizing this objective, with probability ratio clipping preventing destructive updates

Pitfalls


Key Terms

Term Definition
RLHF Three-stage pipeline: SFT → Reward Model → PPO — aligns LLMs with human preferences
Bradley-Terry model P(y₁ ≻ y₂) = σ(r(x,y₁) − r(x,y₂)) — models pairwise preferences via latent rewards
Reward model r_φ(x, y) — trained on human preferences; scores how "good" a response is
KL-constrained RL max_π E[r(x,y)] − β·D_KL(π
Optimal policy (closed form) π*(y) ∝ π_ref(y)·exp(r(y)/β) — reference policy re-weighted by exponentiated reward
β (KL coefficient) Controls the reward-vs-stay-close tradeoff: β → 0 = greedy on reward, β → ∞ = no change from π_ref
PPO-clip Clips probability ratio to [1−ε, 1+ε] — prevents destructive single-step policy changes
Reward hacking Policy exploits reward model weaknesses (e.g., verbosity bias) to game high scores without genuine quality
Advantage A_t = r(x,y) − β·log(π_θ(y

Next Steps

Continue to 20-07 — DPO (Direct Preference Optimization) to learn how to optimize human preferences DIRECTLY from preference data, without training a separate reward model — a simpler and often more effective alternative to RLHF.