📐 Concept diagram

20-07 — DPO (Direct Preference Optimization)

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-07 Prerequisites: 20-06 (RLHF Mathematics), 13-04 (KL Divergence), 10-02 (Conditional Probability — Bradley-Terry model), 14-06 (Convex Sets and Functions — for loss analysis), 05-05 (Integration by Parts — for integration in KL) Next subject: 20-08 — Constitutional AI and RLAIF

Learning Objectives

By the end of this subject, you will be able to:

Derive the DPO loss function from the Bradley-Terry model and the closed-form optimal RLHF policy — showing how the reward is eliminated algebraically
Prove that DPO and RLHF optimize the SAME underlying objective, but DPO does it directly without training a separate reward model
Compute the gradient of the DPO loss and analyze how it up-weights preferred responses and down-weights dispreferred ones
Compare DPO and RLHF in terms of stability, computational cost, and susceptibility to reward over-optimization
Explain the role of β in DPO and derive why it controls the same KL tradeoff as in RLHF

Core Content

1. The Motivation: Why DPO?

RLHF (20-06) has a fundamental problem: it's complex. Three stages, two models to train (reward model + policy), PPO's instability, and the ever-present risk of reward model overfitting.

DPO asks: can we optimize human preferences DIRECTLY, without a separate reward model?

The answer is YES — and the derivation is mathematically beautiful.

2. The Key Insight: Reward Reparameterization

Recall from 20-06 that the KL-constrained RL objective has a closed-form optimal policy:

$π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)
$

where Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β) is the partition function.

The genius move: solve for r in terms of π:

$r(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x)
$

This expresses the reward as a function of the policy and reference policy (plus an x-dependent constant β·log Z(x)).

Implication: ANY policy π IMPLICITLY defines a reward function. The "optimal" reward is simply the log-ratio of policy probabilities, scaled by β. This means we don't need to model the reward separately — we can parameterize the reward implicitly through the policy itself.

3. Plugging Into the Bradley-Terry Model

Under the Bradley-Terry preference model:

$P(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l))
$

Substitute r(x,y) = β·log(π(y|x)/π_ref(y|x)) + β·log Z(x):

$r(x, y_w) − r(x, y_l) = β·log(π(y_w|x)/π_ref(y_w|x)) − β·log(π(y_l|x)/π_ref(y_l|x))
$

The β·log Z(x) terms CANCEL! This is the critical cancellation that makes DPO possible:

$r(x, y_w) − r(x, y_l) = β·[log(π(y_w|x)/π_ref(y_w|x)) − log(π(y_l|x)/π_ref(y_l|x))]
$

Now the Bradley-Terry probability becomes:

$P(y_w ≻ y_l | x) = σ(β·log(π(y_w|x)/π_ref(y_w|x)) − β·log(π(y_l|x)/π_ref(y_l|x)))
$

⚠️ THIS IS CRITICAL — The partition function Z(x) cancels out. This is what makes DPO computationally feasible — we never need to compute the intractable sum over all possible responses.

4. The DPO Loss Function

Maximizing the log-likelihood of the observed preferences gives the DPO loss:

$L_DPO(π_θ; π_ref) = −E_{(x, y_w, y_l)~D}[log σ(β·log(π_θ(y_w|x)/π_ref(y_w|x)) − β·log(π_θ(y_l|x)/π_ref(y_l|x)))]
$

Or more compactly, defining the implicit reward difference:

$ρ_θ(x, y) = log(π_θ(y|x) / π_ref(y|x))    (the "implicit reward" up to β)
Δρ = ρ_θ(x, y_w) − ρ_θ(x, y_l)

L_DPO = −E[log σ(β · Δρ)]
$

5. Interpretation and Intuition

What the loss does: - If π_θ assigns higher relative probability (vs π_ref) to y_w than to y_l, then Δρ > 0, σ(β·Δρ) → 1, loss → 0. GOOD. - If π_θ assigns higher relative probability to y_l than to y_w, then Δρ < 0, σ(β·Δρ) → 0, loss → ∞. BAD. - If π_θ assigns equal relative probability, Δρ ≈ 0, σ(0) = 0.5, loss = −log(0.5) = 0.693. The model is unsure.

The implicit reward:

$r_implicit(x,y) = β · log(π_θ(y|x) / π_ref(y|x))
$

This is the reward that π_θ is "acting as if" it's optimizing. A response that π_θ produces more often than π_ref has a positive implicit reward; a response produced less often has a negative implicit reward.

6. The DPO Gradient

The gradient of the DPO loss reveals what the optimization actually does:

$∇_θ L_DPO = −β · E[σ(−β·Δρ) · (∇_θ log π_θ(y_w|x) − ∇_θ log π_θ(y_l|x))]
$

Breakdown: - ∇_θ log π_θ(y|x) is the policy gradient — it increases the log-probability of y - The weight σ(−β·Δρ) = 1 − σ(β·Δρ) = P(loss is high)

So the gradient: - INCREASES log π_θ(y_w|x) — make preferred responses MORE likely - DECREASES log π_θ(y_l|x) — make dispreferred responses LESS likely - Weighted by how "wrong" the current policy is: the more the model currently prefers y_l over y_w, the larger the gradient

Compare this to RLHF with PPO, where: - The reward model provides a scalar signal - PPO uses that signal through advantage estimation, value functions, and clipping - DPO gets the gradient DIRECTLY from preference pairs — no intermediate reward model

7. The Role of β in DPO

β serves the same role as in RLHF — controlling the tradeoff between reward optimization and staying close to π_ref:

Small β (e.g., 0.01): The log-ratio has a small effect on the sigmoid input. The model is "forgiving" — it can deviate more from π_ref without being penalized. Higher reward optimization, more risk of overfitting.
Large β (e.g., 0.5): The log-ratio is amplified. Even small deviations from π_ref produce large changes in implicit reward. The model is forced to stay close to π_ref.

Relationship to KL divergence: DPO's objective is equivalent to RLHF's KL-constrained objective. β in DPO is exactly the same as β in the RLHF objective — it weights the KL penalty.

8. DPO vs RLHF: Mathematical Comparison

Aspect	RLHF	DPO
Reward model	Separate model, trained on preferences	Implicit — no separate model
Training stages	3 (SFT, RM, PPO)	Potentially 1 (DPO directly on SFT model)
Loss function	RM: logistic loss on Δr; PPO: clipped policy gradient	Single logistic loss on preference pairs
Stability	PPO can be unstable; requires careful tuning	Simple gradient descent — stable
Computational cost	High (RM training + PPO sampling + value model)	Low (just forward passes on preference pairs)
Reward over-optimization	Policy can exploit RM weaknesses	Less susceptible (implicit reward tied to policy)
Sample efficiency	Higher (RM generalizes beyond seen pairs)	Lower (only learns from seen pairs)

When to use which: - DPO: When you have preference data and want a simple, stable pipeline - RLHF: When you have an online reward signal (e.g., from a learned reward model that can score arbitrary responses) and want to explore beyond the preference dataset - Iterative DPO: DPO → generate new responses → collect preferences → DPO again — combines strengths of both

9. Practical DPO Configuration

Parameter	Typical Value
β	0.1 (tuned per dataset)
Learning rate	5e-7 to 1e-6
Epochs	1-3
Batch size	32-128 pairs
Reference model	SFT checkpoint (frozen)

The reference model π_ref is typically FROZEN during DPO training — it serves as the anchor. Only π_θ is updated.

Worked Examples

Example 1: Computing DPO Loss for a Single Pair

Problem: For a preference pair (x, y_w, y_l), π_ref assigns log-probabilities: log π_ref(y_w|x) = −5.0, log π_ref(y_l|x) = −4.0. The current policy π_θ assigns: log π_θ(y_w|x) = −4.5, log π_θ(y_l|x) = −4.5. β = 0.1. Compute the DPO loss.

Solution:

$ρ_w = log(π_θ(y_w)/π_ref(y_w)) = −4.5 − (−5.0) = 0.5
ρ_l = log(π_θ(y_l)/π_ref(y_l)) = −4.5 − (−4.0) = −0.5
Δρ = ρ_w − ρ_l = 0.5 − (−0.5) = 1.0

β·Δρ = 0.1 · 1.0 = 0.1
σ(0.1) = 1/(1+e^{−0.1}) = 1/(1+0.9048) = 0.5250

L_DPO = −log(0.5250) = 0.6444
$

Interpretation: π_θ is slightly better than π_ref at preferring y_w over y_l (it increased y_w's relative probability and decreased y_l's). β=0.1 is small, so the log-ratio difference of 1.0 only gives a modest confidence of 52.5% in the correct ordering.

Example 2: Analyzing When Loss is Zero

Problem: Under what conditions does L_DPO → 0? Derive the required relationship between π_θ and π_ref.

Full Solution

L_DPO = −log σ(β·Δρ) → 0 when σ(β·Δρ) → 1, i.e., when β·Δρ → ∞. β·Δρ → ∞ means:

$log(π_θ(y_w)/π_ref(y_w)) − log(π_θ(y_l)/π_ref(y_l)) → ∞
log(π_θ(y_w)·π_ref(y_l) / (π_θ(y_l)·π_ref(y_w))) → ∞
$

This requires π_θ(y_w)/π_θ(y_l) ≫ π_ref(y_w)/π_ref(y_l). The policy must assign MUCH higher relative probability to the winner vs loser compared to the reference. Equivalently: π_θ(y_w) → 1 and π_θ(y_l) → 0 while π_ref stays bounded. In practice, loss never reaches exactly zero because probabilities can't be exactly 0 or 1.

Example 3: Gradient Direction Analysis

Problem: For the same example as above (ρ_w=0.5, ρ_l=−0.5, β=0.1), compute the DPO gradient weight σ(−β·Δρ) and explain which policy log-probabilities increase and which decrease.

Solution:

$σ(−β·Δρ) = σ(−0.1) = 1/(1+e^{0.1}) = 1/(1+1.1052) = 0.4750

∇_θ L_DPO = −β · 0.475 · (∇_θ log π_θ(y_w) − ∇_θ log π_θ(y_l))
         = −0.0475 · (∇log π_θ(y_w) − ∇log π_θ(y_l))
$

The negative sign and the subtraction mean: - ∇log π_θ(y_w) term: −(−0.0475) = +0.0475 · ∇log π(y_w) → INCREASES π(y_w) - ∇log π_θ(y_l) term: −(+0.0475) = −0.0475 · ∇log π(y_l) → DECREASES π(y_l)

The weight 0.475 indicates the model is about 47.5% "wrong" — there's still significant room for improvement.

Quiz

Q1: What does the concept of INCREASES primarily refer to in this subject?

A) The definition and application of INCREASES B) A historical anecdote about INCREASES C) A visual representation of INCREASES D) A computational error related to INCREASES

Correct: A)

If you chose A: INCREASES is defined as: the definition and application of increases. The other options describe different aspects that are not the primary focus. Correct!
If you chose B: This is incorrect. INCREASES is defined as: the definition and application of increases. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. INCREASES is defined as: the definition and application of increases. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. INCREASES is defined as: the definition and application of increases. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of DECREASES?

A) It is used to decreases in mathematical analysis B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used only in advanced research contexts

Correct: A)

If you chose A: DECREASES serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose B: This is incorrect. DECREASES serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. DECREASES serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. DECREASES serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about Reward model is TRUE?

A) Reward model is mentioned only as a historical footnote B) Reward model is not related to this subject C) Reward model is a fundamental concept covered in this subject D) Reward model is an advanced topic beyond this subject's scope

Correct: C)

If you chose A: This is incorrect. Reward model is a fundamental concept covered in this subject. This subject covers Reward model as part of its core content.
If you chose B: This is incorrect. Reward model is a fundamental concept covered in this subject. This subject covers Reward model as part of its core content.
If you chose C: Reward model is a fundamental concept covered in this subject. This subject covers Reward model as part of its core content. Correct!
If you chose D: This is incorrect. Reward model is a fundamental concept covered in this subject. This subject covers Reward model as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) The inverse of the correct answer C) Reward Reparameterization D) A different result from a common mistake

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is Reward Reparameterization. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is Reward Reparameterization. The other options represent common errors.
If you chose C: The worked examples show that the result is Reward Reparameterization. The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is Reward Reparameterization. The other options represent common errors.

Q5: How are Reward model and Training stages related?

A) Reward model is the inverse of Training stages B) Reward model is a special case of Training stages C) Reward model and Training stages are completely unrelated topics D) Reward model and Training stages are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both Reward model and Training stages are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Reward model and Training stages are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Reward model and Training stages are covered in this subject as interconnected topics.
If you chose D: Both Reward model and Training stages are covered in this subject as interconnected topics. Correct!

Q6: What is a common pitfall when working with Loss function?

A) The main error with Loss function is using it when it is not needed B) Loss function has no common misconceptions C) A common mistake is confusing Loss function with a similar concept D) Loss function is always computed the same way in all contexts

Correct: C)

If you chose A: This is incorrect. Students often confuse Loss function with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Loss function with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse Loss function with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse Loss function with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply Stability?

A) Stability is not practically useful B) Apply Stability to solve problems in this subject's domain C) Avoid Stability unless explicitly instructed D) Use Stability only in pure mathematics contexts

Correct: B)

If you chose A: This is incorrect. Stability is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Stability is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Stability is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Stability is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

Derive the DPO loss starting from the RLHF optimal policy and the Bradley-Terry model. Show each algebraic step.

Answer

Step 1: RLHF optimal policy: π*(y|x) = (1/Z(x))·π_ref(y|x)·exp(r(x,y)/β) Step 2: Solve for r: r(x,y) = β·log(π*(y|x)/π_ref(y|x)) + β·log Z(x) Step 3: Substitute into Bradley-Terry:

$P(y_w ≻ y_l) = σ(r_w − r_l)
             = σ(β·log(π*(y_w)/π_ref(y_w)) + β·log Z − β·log(π*(y_l)/π_ref(y_l)) − β·log Z)
             = σ(β·log(π*(y_w)/π_ref(y_w)) − β·log(π*(y_l)/π_ref(y_l)))
$

Step 4: Maximum likelihood → DPO loss:

$L_DPO = −E[log P(y_w ≻ y_l)]
      = −E[log σ(β·log(π_θ(y_w)/π_ref(y_w)) − β·log(π_θ(y_l)/π_ref(y_l)))]
$

Z(x) cancels — the key insight.

Problem 2

For DPO with β = 0.5, π_ref(y_w) = 0.2, π_ref(y_l) = 0.3, π_θ(y_w) = 0.25, π_θ(y_l) = 0.35. Compute the DPO loss. What if π_θ(y_w) = 0.35, π_θ(y_l) = 0.05?

Answer

Case 1: ρ_w = log(0.25/0.2) = log(1.25) = 0.2231 ρ_l = log(0.35/0.3) = log(1.167) = 0.1542 Δρ = 0.2231 − 0.1542 = 0.0689 β·Δρ = 0.5·0.0689 = 0.0345 σ(0.0345) ≈ 0.5086 L = −log(0.5086) = 0.6762 The model barely prefers y_w over y_l relative to π_ref — high loss. Case 2: ρ_w = log(0.35/0.2) = log(1.75) = 0.5596 ρ_l = log(0.05/0.3) = log(0.1667) = −1.7918 Δρ = 0.5596 − (−1.7918) = 2.3514 β·Δρ = 0.5·2.3514 = 1.1757 σ(1.1757) ≈ 0.7643 L = −log(0.7643) = 0.2686 Much lower loss — the model strongly prefers winner over loser.

Problem 3

Prove that DPO with β → ∞ is equivalent to keeping π_θ = π_ref (no change). Hint: examine the gradient.

Answer

DPO gradient: ∇L = −β·E[σ(−β·Δρ)·(∇log π(y_w) − ∇log π(y_l))]. As β → ∞, two things happen: 1. If Δρ ≠ 0: β·Δρ → ±∞, so σ(−β·Δρ) → 0 (if Δρ > 0) or 1 (if Δρ < 0). But β multiplies this, so β·σ(−β·Δρ) → 0 in both cases (exponential decay beats linear growth). 2. If Δρ = 0: σ(0) = 0.5, so β·0.5 → ∞. The gradient DIVERGES — pushing Δρ away from zero. So the only stable point is Δρ = 0 for all pairs, which means π_θ = π_ref (modulo x-dependent shifts that cancel in Δρ). Infinite β forces the policy to exactly match the reference — no preference optimization occurs.

Problem 4

Explain why DPO does NOT need to compute Z(x) = Σ_y π_ref(y|x)·exp(r(y)/β). What makes this sum computationally intractable for language models?

Answer

Z(x) sums over ALL possible responses y for a given instruction x. For a language model with vocabulary V and maximum response length L, there are V^L possible responses. For V≈32K tokens and L≈100, that's 32000^100 ≈ 10^450 possibilities — completely intractable. DPO avoids this by only comparing PAIRS of responses. The Z(x) term cancels in the difference r(y_w)−r(y_l), so we never need to compute the intractable sum. This cancellation is what makes DPO computationally practical.

Problem 5

A common criticism: "DPO can't explore beyond the preference dataset since it has no separate reward model." Is this true? What would happen if you ran DPO, then used the resulting policy to generate new responses, collected preferences on those, and ran DPO again?

Answer

The criticism is partially true for single-round DPO — it only learns from the existing preference pairs. However, iterative DPO overcomes this: Round 1: DPO on dataset D₁ → policy π₁ Round 2: Use π₁ to generate new responses for prompts, collect preferences → D₂ Round 3: DPO on D₂ → policy π₂ ... This "online" or "iterative" DPO lets the policy explore its own generation space and get feedback on it. Each round expands the preference data distribution. This combines the simplicity of DPO with the exploration capability of RLHF. Recent work shows iterative DPO can match or exceed RLHF while remaining simpler to implement.

Summary

DPO eliminates the reward model by reparameterizing the reward as r(x,y) = β·log(π(y|x)/π_ref(y|x)), plugging this into the Bradley-Terry preference model
The partition function cancels in the pairwise comparison, making DPO computationally tractable — no need to sum over all possible responses
The DPO loss L = −E[log σ(β·log(π(y_w)/π_ref(y_w)) − β·log(π(y_l)/π_ref(y_l)))] directly optimizes the policy from preference pairs
β controls the same KL tradeoff as in RLHF: small β allows more deviation from π_ref; large β keeps the policy close to the reference
DPO is simpler and more stable than RLHF, with fewer moving parts, though RLHF's reward model can generalize beyond seen preference pairs

Pitfalls

Training with π_ref unfrozen. In DPO, the reference model π_ref must remain FROZEN throughout training — it's the anchor that defines the implicit reward r = β·log(π_θ/π_ref). If π_ref is also updated (e.g., by accidentally sharing weights or using the same model with gradients enabled), the implicit reward signal becomes meaningless and the optimization collapses. Always detach or freeze the reference model.
Using too small β and causing policy collapse. With β < 0.05, the DPO gradient can push probabilities to extremes (near 0 or 1) very quickly. Once a token's probability under π_θ approaches 0 for the dispreferred response, the log-ratio becomes numerically unstable and the gradient vanishes — the model can never recover that token. Start with β = 0.1 and tune downward only if needed.
Assuming single-pass DPO is sufficient. Training for one epoch on a static preference dataset limits the policy to only the response pairs it has seen. The policy never explores its own generation space. Iterative DPO (generate → collect preferences → retrain) consistently outperforms single-pass DPO because it lets the policy encounter and learn from its own mistakes. Budget for at least 2-3 rounds.
Not aligning the reference model with the preference data distribution. DPO works best when π_ref already generates responses in the same style as the preference pairs. If π_ref is a base pre-trained model and the preferences are about chat formatting, the implicit reward signal will be dominated by format differences rather than content quality. Always SFT the reference model on the preference data's prompt-distribution first.
Comparing DPO and RLHF using only final reward scores. DPO optimizes an implicit reward, not an explicit one — so comparing DPO's implicit reward to RLHF's explicit reward model score is apples-to-oranges. DPO policies can score LOWER on an external reward model while being rated HIGHER by humans. Evaluate both methods with the same human preference benchmark, not just reward model metrics.

Key Terms

Term	Definition
DPO	Direct Preference Optimization — eliminates the reward model by reparameterizing reward through policy probabilities
Reward reparameterization	r(x,y) = β·log(π(y
Partition function Z(x)	Σ_y π_ref(y
Implicit reward	r_implicit = β·log(π_θ(y
DPO loss	L = −E[log σ(β·log(π(y_w)/π_ref(y_w)) − β·log(π(y_l)/π_ref(y_l)))] — logistic loss on preference pairs
DPO gradient	Up-weights y_w log-probability, down-weights y_l; weighted by σ(−β·Δρ) = how "wrong" the model is
β in DPO	Same as RLHF β — controls KL tradeoff: small β allows more deviation, large β forces staying close to π_ref
Iterative DPO	DPO → generate new responses → collect new preferences → DPO again — combines DPO stability with RLHF exploration

Next Steps

Continue to 20-08 — Constitutional AI and RLAIF to learn how AI systems can align themselves using AI-generated feedback instead of human preferences, including constitutional principles and the RLAIF training loop.

Progress

Phases

20-07 — DPO (Direct Preference Optimization)

Learning Objectives

Core Content

1. The Motivation: Why DPO?

2. The Key Insight: Reward Reparameterization

3. Plugging Into the Bradley-Terry Model

4. The DPO Loss Function

5. Interpretation and Intuition

6. The DPO Gradient

7. The Role of β in DPO

8. DPO vs RLHF: Mathematical Comparison

9. Practical DPO Configuration

Worked Examples

Example 1: Computing DPO Loss for a Single Pair

Example 2: Analyzing When Loss is Zero

Example 3: Gradient Direction Analysis

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Key Terms

Next Steps