Math graphic
📐 Concept diagram

20-07 — DPO (Direct Preference Optimization)

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-07 Prerequisites: 20-06 (RLHF Mathematics), 13-04 (KL Divergence), 10-02 (Conditional Probability — Bradley-Terry model), 14-06 (Convex Sets and Functions — for loss analysis), 05-05 (Integration by Parts — for integration in KL) Next subject: 20-08 — Constitutional AI and RLAIF


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the DPO loss function from the Bradley-Terry model and the closed-form optimal RLHF policy — showing how the reward is eliminated algebraically
  2. Prove that DPO and RLHF optimize the SAME underlying objective, but DPO does it directly without training a separate reward model
  3. Compute the gradient of the DPO loss and analyze how it up-weights preferred responses and down-weights dispreferred ones
  4. Compare DPO and RLHF in terms of stability, computational cost, and susceptibility to reward over-optimization
  5. Explain the role of β in DPO and derive why it controls the same KL tradeoff as in RLHF

Core Content

1. The Motivation: Why DPO?

RLHF (20-06) has a fundamental problem: it's complex. Three stages, two models to train (reward model + policy), PPO's instability, and the ever-present risk of reward model overfitting.

DPO asks: can we optimize human preferences DIRECTLY, without a separate reward model?

The answer is YES — and the derivation is mathematically beautiful.


2. The Key Insight: Reward Reparameterization

Recall from 20-06 that the KL-constrained RL objective has a closed-form optimal policy:

$π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)
$

where Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β) is the partition function.

The genius move: solve for r in terms of π:

$r(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x)
$

This expresses the reward as a function of the policy and reference policy (plus an x-dependent constant β·log Z(x)).

Implication: ANY policy π IMPLICITLY defines a reward function. The "optimal" reward is simply the log-ratio of policy probabilities, scaled by β. This means we don't need to model the reward separately — we can parameterize the reward implicitly through the policy itself.


3. Plugging Into the Bradley-Terry Model

Under the Bradley-Terry preference model:

$P(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l))
$

Substitute r(x,y) = β·log(π(y|x)/π_ref(y|x)) + β·log Z(x):

$r(x, y_w) − r(x, y_l) = β·log(π(y_w|x)/π_ref(y_w|x)) − β·log(π(y_l|x)/π_ref(y_l|x))
$

The β·log Z(x) terms CANCEL! This is the critical cancellation that makes DPO possible:

$r(x, y_w) − r(x, y_l) = β·[log(π(y_w|x)/π_ref(y_w|x)) − log(π(y_l|x)/π_ref(y_l|x))]
$

Now the Bradley-Terry probability becomes:

$P(y_w ≻ y_l | x) = σ(β·log(π(y_w|x)/π_ref(y_w|x)) − β·log(π(y_l|x)/π_ref(y_l|x)))
$

⚠️ THIS IS CRITICAL — The partition function Z(x) cancels out. This is what makes DPO computationally feasible — we never need to compute the intractable sum over all possible responses.


4. The DPO Loss Function

Maximizing the log-likelihood of the observed preferences gives the DPO loss:

$L_DPO(π_θ; π_ref) = −E_{(x, y_w, y_l)~D}[log σ(β·log(π_θ(y_w|x)/π_ref(y_w|x)) − β·log(π_θ(y_l|x)/π_ref(y_l|x)))]
$

Or more compactly, defining the implicit reward difference:

$ρ_θ(x, y) = log(π_θ(y|x) / π_ref(y|x))    (the "implicit reward" up to β)
Δρ = ρ_θ(x, y_w) − ρ_θ(x, y_l)

L_DPO = −E[log σ(β · Δρ)]
$

5. Interpretation and Intuition

What the loss does: - If π_θ assigns higher relative probability (vs π_ref) to y_w than to y_l, then Δρ > 0, σ(β·Δρ) → 1, loss → 0. GOOD. - If π_θ assigns higher relative probability to y_l than to y_w, then Δρ < 0, σ(β·Δρ) → 0, loss → ∞. BAD. - If π_θ assigns equal relative probability, Δρ ≈ 0, σ(0) = 0.5, loss = −log(0.5) = 0.693. The model is unsure.

The implicit reward:

$r_implicit(x,y) = β · log(π_θ(y|x) / π_ref(y|x))
$

This is the reward that π_θ is "acting as if" it's optimizing. A response that π_θ produces more often than π_ref has a positive implicit reward; a response produced less often has a negative implicit reward.


6. The DPO Gradient

The gradient of the DPO loss reveals what the optimization actually does:

$∇_θ L_DPO = −β · E[σ(−β·Δρ) · (∇_θ log π_θ(y_w|x) − ∇_θ log π_θ(y_l|x))]
$

Breakdown: - ∇_θ log π_θ(y|x) is the policy gradient — it increases the log-probability of y - The weight σ(−β·Δρ) = 1 − σ(β·Δρ) = P(loss is high)

So the gradient: - INCREASES log π_θ(y_w|x) — make preferred responses MORE likely - DECREASES log π_θ(y_l|x) — make dispreferred responses LESS likely - Weighted by how "wrong" the current policy is: the more the model currently prefers y_l over y_w, the larger the gradient

Compare this to RLHF with PPO, where: - The reward model provides a scalar signal - PPO uses that signal through advantage estimation, value functions, and clipping - DPO gets the gradient DIRECTLY from preference pairs — no intermediate reward model


7. The Role of β in DPO

β serves the same role as in RLHF — controlling the tradeoff between reward optimization and staying close to π_ref:

Relationship to KL divergence: DPO's objective is equivalent to RLHF's KL-constrained objective. β in DPO is exactly the same as β in the RLHF objective — it weights the KL penalty.


8. DPO vs RLHF: Mathematical Comparison

Aspect RLHF DPO
Reward model Separate model, trained on preferences Implicit — no separate model
Training stages 3 (SFT, RM, PPO) Potentially 1 (DPO directly on SFT model)
Loss function RM: logistic loss on Δr; PPO: clipped policy gradient Single logistic loss on preference pairs
Stability PPO can be unstable; requires careful tuning Simple gradient descent — stable
Computational cost High (RM training + PPO sampling + value model) Low (just forward passes on preference pairs)
Reward over-optimization Policy can exploit RM weaknesses Less susceptible (implicit reward tied to policy)
Sample efficiency Higher (RM generalizes beyond seen pairs) Lower (only learns from seen pairs)

When to use which: - DPO: When you have preference data and want a simple, stable pipeline - RLHF: When you have an online reward signal (e.g., from a learned reward model that can score arbitrary responses) and want to explore beyond the preference dataset - Iterative DPO: DPO → generate new responses → collect preferences → DPO again — combines strengths of both


9. Practical DPO Configuration

Parameter Typical Value
β 0.1 (tuned per dataset)
Learning rate 5e-7 to 1e-6
Epochs 1-3
Batch size 32-128 pairs
Reference model SFT checkpoint (frozen)

The reference model π_ref is typically FROZEN during DPO training — it serves as the anchor. Only π_θ is updated.


Worked Examples

Example 1: Computing DPO Loss for a Single Pair

Problem: For a preference pair (x, y_w, y_l), π_ref assigns log-probabilities: log π_ref(y_w|x) = −5.0, log π_ref(y_l|x) = −4.0. The current policy π_θ assigns: log π_θ(y_w|x) = −4.5, log π_θ(y_l|x) = −4.5. β = 0.1. Compute the DPO loss.

Solution:

$ρ_w = log(π_θ(y_w)/π_ref(y_w)) = −4.5 − (−5.0) = 0.5
ρ_l = log(π_θ(y_l)/π_ref(y_l)) = −4.5 − (−4.0) = −0.5
Δρ = ρ_w − ρ_l = 0.5 − (−0.5) = 1.0

β·Δρ = 0.1 · 1.0 = 0.1
σ(0.1) = 1/(1+e^{−0.1}) = 1/(1+0.9048) = 0.5250

L_DPO = −log(0.5250) = 0.6444
$

Interpretation: π_θ is slightly better than π_ref at preferring y_w over y_l (it increased y_w's relative probability and decreased y_l's). β=0.1 is small, so the log-ratio difference of 1.0 only gives a modest confidence of 52.5% in the correct ordering.


Example 2: Analyzing When Loss is Zero

Problem: Under what conditions does L_DPO → 0? Derive the required relationship between π_θ and π_ref.

Full Solution L_DPO = −log σ(β·Δρ) → 0 when σ(β·Δρ) → 1, i.e., when β·Δρ → ∞. β·Δρ → ∞ means:
$log(π_θ(y_w)/π_ref(y_w)) − log(π_θ(y_l)/π_ref(y_l)) → ∞
log(π_θ(y_w)·π_ref(y_l) / (π_θ(y_l)·π_ref(y_w))) → ∞
$
This requires π_θ(y_w)/π_θ(y_l) ≫ π_ref(y_w)/π_ref(y_l). The policy must assign MUCH higher relative probability to the winner vs loser compared to the reference. Equivalently: π_θ(y_w) → 1 and π_θ(y_l) → 0 while π_ref stays bounded. In practice, loss never reaches exactly zero because probabilities can't be exactly 0 or 1.

Example 3: Gradient Direction Analysis

Problem: For the same example as above (ρ_w=0.5, ρ_l=−0.5, β=0.1), compute the DPO gradient weight σ(−β·Δρ) and explain which policy log-probabilities increase and which decrease.

Solution:

$σ(−β·Δρ) = σ(−0.1) = 1/(1+e^{0.1}) = 1/(1+1.1052) = 0.4750

∇_θ L_DPO = −β · 0.475 · (∇_θ log π_θ(y_w) − ∇_θ log π_θ(y_l))
         = −0.0475 · (∇log π_θ(y_w) − ∇log π_θ(y_l))
$

The negative sign and the subtraction mean: - ∇log π_θ(y_w) term: −(−0.0475) = +0.0475 · ∇log π(y_w) → INCREASES π(y_w) - ∇log π_θ(y_l) term: −(+0.0475) = −0.0475 · ∇log π(y_l) → DECREASES π(y_l)

The weight 0.475 indicates the model is about 47.5% "wrong" — there's still significant room for improvement.



Quiz

Q1: What does the concept of INCREASES primarily refer to in this subject?

A) The definition and application of INCREASES B) A historical anecdote about INCREASES C) A visual representation of INCREASES D) A computational error related to INCREASES

Correct: A)

Q2: What is the primary purpose of DECREASES?

A) It is used to decreases in mathematical analysis B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used only in advanced research contexts

Correct: A)

Q3: Which statement about Reward model is TRUE?

A) Reward model is mentioned only as a historical footnote B) Reward model is not related to this subject C) Reward model is a fundamental concept covered in this subject D) Reward model is an advanced topic beyond this subject's scope

Correct: C)

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) The inverse of the correct answer C) Reward Reparameterization D) A different result from a common mistake

Correct: C)

Q5: How are Reward model and Training stages related?

A) Reward model is the inverse of Training stages B) Reward model is a special case of Training stages C) Reward model and Training stages are completely unrelated topics D) Reward model and Training stages are closely related concepts

Correct: D)

Q6: What is a common pitfall when working with Loss function?

A) The main error with Loss function is using it when it is not needed B) Loss function has no common misconceptions C) A common mistake is confusing Loss function with a similar concept D) Loss function is always computed the same way in all contexts

Correct: C)

Q7: When should you apply Stability?

A) Stability is not practically useful B) Apply Stability to solve problems in this subject's domain C) Avoid Stability unless explicitly instructed D) Use Stability only in pure mathematics contexts

Correct: B)

Practice Problems

Problem 1

Derive the DPO loss starting from the RLHF optimal policy and the Bradley-Terry model. Show each algebraic step.

Answer Step 1: RLHF optimal policy: π*(y|x) = (1/Z(x))·π_ref(y|x)·exp(r(x,y)/β) Step 2: Solve for r: r(x,y) = β·log(π*(y|x)/π_ref(y|x)) + β·log Z(x) Step 3: Substitute into Bradley-Terry:
$P(y_w ≻ y_l) = σ(r_w − r_l)
             = σ(β·log(π*(y_w)/π_ref(y_w)) + β·log Z − β·log(π*(y_l)/π_ref(y_l)) − β·log Z)
             = σ(β·log(π*(y_w)/π_ref(y_w)) − β·log(π*(y_l)/π_ref(y_l)))
$
Step 4: Maximum likelihood → DPO loss:
$L_DPO = −E[log P(y_w ≻ y_l)]
      = −E[log σ(β·log(π_θ(y_w)/π_ref(y_w)) − β·log(π_θ(y_l)/π_ref(y_l)))]
$
Z(x) cancels — the key insight.

Problem 2

For DPO with β = 0.5, π_ref(y_w) = 0.2, π_ref(y_l) = 0.3, π_θ(y_w) = 0.25, π_θ(y_l) = 0.35. Compute the DPO loss. What if π_θ(y_w) = 0.35, π_θ(y_l) = 0.05?

Answer Case 1: ρ_w = log(0.25/0.2) = log(1.25) = 0.2231 ρ_l = log(0.35/0.3) = log(1.167) = 0.1542 Δρ = 0.2231 − 0.1542 = 0.0689 β·Δρ = 0.5·0.0689 = 0.0345 σ(0.0345) ≈ 0.5086 L = −log(0.5086) = 0.6762 The model barely prefers y_w over y_l relative to π_ref — high loss. Case 2: ρ_w = log(0.35/0.2) = log(1.75) = 0.5596 ρ_l = log(0.05/0.3) = log(0.1667) = −1.7918 Δρ = 0.5596 − (−1.7918) = 2.3514 β·Δρ = 0.5·2.3514 = 1.1757 σ(1.1757) ≈ 0.7643 L = −log(0.7643) = 0.2686 Much lower loss — the model strongly prefers winner over loser.

Problem 3

Prove that DPO with β → ∞ is equivalent to keeping π_θ = π_ref (no change). Hint: examine the gradient.

Answer DPO gradient: ∇L = −β·E[σ(−β·Δρ)·(∇log π(y_w) − ∇log π(y_l))]. As β → ∞, two things happen: 1. If Δρ ≠ 0: β·Δρ → ±∞, so σ(−β·Δρ) → 0 (if Δρ > 0) or 1 (if Δρ < 0). But β multiplies this, so β·σ(−β·Δρ) → 0 in both cases (exponential decay beats linear growth). 2. If Δρ = 0: σ(0) = 0.5, so β·0.5 → ∞. The gradient DIVERGES — pushing Δρ away from zero. So the only stable point is Δρ = 0 for all pairs, which means π_θ = π_ref (modulo x-dependent shifts that cancel in Δρ). Infinite β forces the policy to exactly match the reference — no preference optimization occurs.

Problem 4

Explain why DPO does NOT need to compute Z(x) = Σ_y π_ref(y|x)·exp(r(y)/β). What makes this sum computationally intractable for language models?

Answer Z(x) sums over ALL possible responses y for a given instruction x. For a language model with vocabulary V and maximum response length L, there are V^L possible responses. For V≈32K tokens and L≈100, that's 32000^100 ≈ 10^450 possibilities — completely intractable. DPO avoids this by only comparing PAIRS of responses. The Z(x) term cancels in the difference r(y_w)−r(y_l), so we never need to compute the intractable sum. This cancellation is what makes DPO computationally practical.

Problem 5

A common criticism: "DPO can't explore beyond the preference dataset since it has no separate reward model." Is this true? What would happen if you ran DPO, then used the resulting policy to generate new responses, collected preferences on those, and ran DPO again?

Answer The criticism is partially true for single-round DPO — it only learns from the existing preference pairs. However, iterative DPO overcomes this: Round 1: DPO on dataset D₁ → policy π₁ Round 2: Use π₁ to generate new responses for prompts, collect preferences → D₂ Round 3: DPO on D₂ → policy π₂ ... This "online" or "iterative" DPO lets the policy explore its own generation space and get feedback on it. Each round expands the preference data distribution. This combines the simplicity of DPO with the exploration capability of RLHF. Recent work shows iterative DPO can match or exceed RLHF while remaining simpler to implement.

Summary

  1. DPO eliminates the reward model by reparameterizing the reward as r(x,y) = β·log(π(y|x)/π_ref(y|x)), plugging this into the Bradley-Terry preference model
  2. The partition function cancels in the pairwise comparison, making DPO computationally tractable — no need to sum over all possible responses
  3. The DPO loss L = −E[log σ(β·log(π(y_w)/π_ref(y_w)) − β·log(π(y_l)/π_ref(y_l)))] directly optimizes the policy from preference pairs
  4. β controls the same KL tradeoff as in RLHF: small β allows more deviation from π_ref; large β keeps the policy close to the reference
  5. DPO is simpler and more stable than RLHF, with fewer moving parts, though RLHF's reward model can generalize beyond seen preference pairs

Pitfalls


Key Terms

Term Definition
DPO Direct Preference Optimization — eliminates the reward model by reparameterizing reward through policy probabilities
Reward reparameterization r(x,y) = β·log(π(y
Partition function Z(x) Σ_y π_ref(y
Implicit reward r_implicit = β·log(π_θ(y
DPO loss L = −E[log σ(β·log(π(y_w)/π_ref(y_w)) − β·log(π(y_l)/π_ref(y_l)))] — logistic loss on preference pairs
DPO gradient Up-weights y_w log-probability, down-weights y_l; weighted by σ(−β·Δρ) = how "wrong" the model is
β in DPO Same as RLHF β — controls KL tradeoff: small β allows more deviation, large β forces staying close to π_ref
Iterative DPO DPO → generate new responses → collect new preferences → DPO again — combines DPO stability with RLHF exploration

Next Steps

Continue to 20-08 — Constitutional AI and RLAIF to learn how AI systems can align themselves using AI-generated feedback instead of human preferences, including constitutional principles and the RLAIF training loop.