📐 Concept diagram

### 13.4 — KL Divergence (Relative Entropy)

Phase: Information Theory Prerequisites: 13-01-entropy, 13-03-mutual-information

Learning Objectives

By the end of this subject, you will be able to:

Define KL divergence $D_{KL}(P \parallel Q)$ and explain what it measures
Prove non-negativity via Gibbs' inequality
Explain why KL divergence is asymmetric and not a true distance metric
Relate KL divergence to cross-entropy: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$
Compute KL divergence for common distributions

Core Content

⚠️ CRITICAL: What KL Divergence Actually Is

The Kullback-Leibler divergence measures how much information is lost when we approximate the true distribution $P$ with a model distribution $Q$:

$$D_{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$$

Interpretations: 1. Information loss: The extra bits needed to encode samples from $P$ using a code optimised for $Q$ 2. Statistical distance: How "far" $Q$ is from $P$ (but NOT a metric — see below) 3. Expected log-likelihood ratio: $E_P[\log P(X) - \log Q(X)]$

🚩 Common Pitfall: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$! KL divergence is asymmetric. The direction matters enormously and choosing the wrong one changes the behaviour entirely.

Why Asymmetry Matters

Forward KL: $D_{KL}(P \parallel Q)$ = "mean-seeking" — $Q$ tries to cover all modes of $P$. If $P(x) > 0$ but $Q(x) \approx 0$, the log ratio explodes → heavy penalty for missing a region where $P$ has mass.

Reverse KL: $D_{KL}(Q \parallel P)$ = "mode-seeking" — $Q$ focuses on the highest-probability regions of $P$. If $Q(x) > 0$ but $P(x) \approx 0$, the penalty is mild.

In variational inference (Bayesian ML), we minimise $D_{KL}(Q \parallel P)$ — the reverse KL — which makes the approximate posterior $Q$ focus on the mode of the true posterior $P$.

Non-Negativity: Gibbs' Inequality

$$D_{KL}(P \parallel Q) \geq 0$$

with equality iff $P = Q$ almost everywhere.

Proof: By Jensen's inequality (since $-\log$ is convex):

$D_{KL}(P \parallel Q) = E_P\left[-\log\frac{Q(X)}{P(X)}\right] \geq -\log E_P\left[\frac{Q(X)}{P(X)}\right] = -\log \sum P(x) \cdot \frac{Q(x)}{P(x)} = -\log 1 = 0$

Cross-Entropy

$$H(P, Q) = -\sum_x P(x) \log Q(x)$$

This is the average number of bits needed to encode events from $P$ when using a code optimised for $Q$.

Key relationship:

$$H(P, Q) = H(P) + D_{KL}(P \parallel Q)$$

$$\text{Cross-entropy} = \text{Entropy} + \text{KL divergence}$$

Since $H(P)$ is fixed (depends only on the true distribution), minimising cross-entropy is equivalent to minimising KL divergence. This is why cross-entropy is the standard loss function for classification in ML.

⚠️ CRITICAL: KL Divergence vs Cross-Entropy in ML

In machine learning, when you train a classifier with cross-entropy loss: - $P$ = true label distribution (one-hot encoding: $P(\text{true class}) = 1$, others 0) - $Q$ = model's predicted probabilities (softmax output)

$H(P, Q) = -\log Q(\text{true class})$ — negative log-likelihood of the correct class.

Since $H(P) = 0$ for one-hot labels, $D_{KL}(P \parallel Q) = H(P, Q)$ — KL and cross-entropy coincide!

Key Terms

Asymmetric
Cross-entropy
KL divergence
Kullback-Leibler divergence
Non-negative

Worked Examples

Example 1: KL between two Bernoulli distributions

$P = \text{Bernoulli}(0.7)$, $Q = \text{Bernoulli}(0.3)$

$D_{KL}(P \parallel Q) = 0.7 \log\frac{0.7}{0.3} + 0.3 \log\frac{0.3}{0.7}$

$= 0.7 \cdot \log_2(2.333) + 0.3 \cdot \log_2(0.429)$

$\log_2(2.333) \approx 1.222$, $\log_2(0.429) \approx -1.222$

$D_{KL}(P \parallel Q) = 0.7(1.222) + 0.3(-1.222) = 0.855 - 0.367 = 0.489$ bits

$D_{KL}(Q \parallel P) = 0.3 \log\frac{0.3}{0.7} + 0.7 \log\frac{0.7}{0.3} = 0.3(-1.222) + 0.7(1.222) = -0.367 + 0.855 = 0.489$ bits

(They're equal here because the distributions are symmetric in value.)

Example 2: Asymmetric KL — when they differ

$P = \text{Bernoulli}(0.99)$, $Q = \text{Bernoulli}(0.01)$

$D_{KL}(P \parallel Q) = 0.99\log\frac{0.99}{0.01} + 0.01\log\frac{0.01}{0.99}$

$\log_2(99) \approx 6.629$, $\log_2(0.0101) \approx -6.629$

$= 0.99(6.629) + 0.01(-6.629) = 6.563 - 0.066 = 6.50$ bits

Now swap: $P(X=1)=0.99$ means $X$ is almost always 1. $Q(X=1)=0.01$ thinks 1 is very rare — so coding $P$ with $Q$'s code is extremely inefficient (6.5 extra bits per sample).

$D_{KL}(Q \parallel P) = 0.01(6.629) + 0.99(-6.629) = 0.066 - 6.563 = -6.50$ bits — wait, that can't be right! Actually:

$D_{KL}(Q \parallel P) = 0.01\log\frac{0.01}{0.99} + 0.99\log\frac{0.99}{0.01} = 0.01(-6.629) + 0.99(6.629) = -0.066 + 6.563 = 6.50$ bits

Wait, they're symmetric for this pair too since both distributions are mirror images. Let me fix:

$P = \text{Bernoulli}(0.8)$, $Q = \text{Bernoulli}(0.2)$

$D_{KL}(P \parallel Q) = 0.8\log\frac{0.8}{0.2} + 0.2\log\frac{0.2}{0.8} = 0.8(2) + 0.2(-2) = 1.6 - 0.4 = 1.2$ bits

$D_{KL}(Q \parallel P) = 0.2\log\frac{0.2}{0.8} + 0.8\log\frac{0.8}{0.2} = 0.2(-2) + 0.8(2) = -0.4 + 1.6 = 1.2$ bits

(These Bernoulli pairs are always symmetric due to the binary structure — both directions depend on the same ratio but weighted differently.)

Example 3: Cross-entropy as ML loss

True labels (one-hot): $P = [1, 0, 0]$ (class 0 is correct) Model predictions: $Q = [0.7, 0.2, 0.1]$

$H(P, Q) = -1 \cdot \log_2 0.7 - 0 \cdot \log_2 0.2 - 0 \cdot \log_2 0.1 = -\log_2 0.7 \approx 0.515$ bits

If the model instead predicted $Q = [0.3, 0.4, 0.3]$: $H(P, Q) = -\log_2 0.3 \approx 1.737$ bits — much higher loss, as the model is less confident in the correct class.

Quiz

Q1: What does the concept of Kullback-Leibler divergence primarily refer to in this subject?

A) The definition and application of Kullback-Leibler divergence B) A computational error related to Kullback-Leibler divergence C) A visual representation of Kullback-Leibler divergence D) A historical anecdote about Kullback-Leibler divergence

Correct: A)

If you chose A: Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus. Correct!
If you chose B: This is incorrect. Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of D_{KL}(P \parallel Q)... B) D_{KL}(P \parallel Q) C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: B)

If you chose A: This is incorrect. The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose C: This is incorrect. The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: This is incorrect. The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Asymmetric?

A) It is used to asymmetric in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

If you chose A: Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose B: This is incorrect. Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Cross-entropy is TRUE?

A) Cross-entropy is an advanced topic beyond this subject's scope B) Cross-entropy is mentioned only as a historical footnote C) Cross-entropy is not related to this subject D) Cross-entropy is a fundamental concept covered in this subject

Correct: D)

If you chose A: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
If you chose B: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
If you chose C: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
If you chose D: Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content. Correct!

Q5: How are Cross-entropy and KL divergence related?

A) Cross-entropy is the inverse of KL divergence B) Cross-entropy and KL divergence are completely unrelated topics C) Cross-entropy is a special case of KL divergence D) Cross-entropy and KL divergence are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both Cross-entropy and KL divergence are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Cross-entropy and KL divergence are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Cross-entropy and KL divergence are covered in this subject as interconnected topics.
If you chose D: Both Cross-entropy and KL divergence are covered in this subject as interconnected topics. Correct!

Q6: What is a common pitfall when working with Non-negative?

A) A common mistake is confusing Non-negative with a similar concept B) Non-negative has no common misconceptions C) The main error with Non-negative is using it when it is not needed D) Non-negative is always computed the same way in all contexts

Correct: A)

If you chose A: Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose B: This is incorrect. Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply ⚠️ Critical: What Kl Divergence Actually Is?

A) ⚠️ Critical: What Kl Divergence Actually Is is not practically useful B) Use ⚠️ Critical: What Kl Divergence Actually Is only in pure mathematics contexts C) Avoid ⚠️ Critical: What Kl Divergence Actually Is unless explicitly instructed D) Apply ⚠️ Critical: What Kl Divergence Actually Is to solve problems in this subject's domain

Correct: D)

If you chose A: This is incorrect. ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems.
If you chose D: ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems. Correct!

Practice Problems

Prove $D_{KL}(P \parallel Q) \geq 0$ using the fact that $\log x \leq x - 1$.

Click for answer
$-D_{KL}(P \parallel Q) = \sum P(x) \log\frac{Q(x)}{P(x)} \leq \sum P(x)\left(\frac{Q(x)}{P(x)} - 1\right) = \sum Q(x) - \sum P(x) = 1 - 1 = 0$ So $-D_{KL} \leq 0$ → $D_{KL} \geq 0$. Equality when $\frac{Q(x)}{P(x)} = 1$ for all $x$ where $P(x) > 0$, i.e. $P = Q$.
If $D_{KL}(P \parallel Q) = 2.5$ bits and $H(P) = 1.8$ bits, what is the cross-entropy $H(P, Q)$?

Click for answer
$H(P, Q) = H(P) + D_{KL}(P \parallel Q) = 1.8 + 2.5 = 4.3$ bits
Why do ML practitioners minimise cross-entropy rather than KL divergence directly?

Click for answer
$H(P, Q) = H(P) + D_{KL}(P \parallel Q)$. Since $H(P)$ is fixed (depends only on the true data distribution), minimising cross-entropy with respect to model parameters is IDENTICAL to minimising KL divergence. Cross-entropy is easier to compute because you don't need to estimate $H(P)$.
What happens to $D_{KL}(P \parallel Q)$ if $Q(x) = 0$ for some $x$ with $P(x) > 0$?

Click for answer
$D_{KL}(P \parallel Q) \to \infty$. By convention (and the definition), if $P(x) > 0$ and $Q(x) = 0$, then $\log(P/Q) = \infty$. This makes sense: if the true distribution assigns positive probability to an event that the model says is impossible, the model is infinitely bad. This is why you should never output exactly 0 probabilities from a model.
Is the "triangle inequality" satisfied by KL divergence? I.e., does $D_{KL}(P \parallel R) \leq D_{KL}(P \parallel Q) + D_{KL}(Q \parallel R)$ hold?

Click for answer
**No.** KL divergence does NOT satisfy the triangle inequality, nor is it symmetric. It is not a metric. Counter-examples exist where this fails. The only metric-like properties KL divergence has are non-negativity and identity of indiscernibles ($D_{KL}(P \parallel Q) = 0 \iff P = Q$).

Summary

Key takeaways:

KL divergence $D_{KL}(P \parallel Q) = \sum P(x) \log\frac{P(x)}{Q(x)}$ measures information loss when using $Q$ to approximate $P$
Non-negative (Gibbs' inequality); zero iff $P = Q$
Asymmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$ — direction matters
Forward KL is "mean-seeking"; reverse KL is "mode-seeking"
Cross-entropy $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$
Minimising cross-entropy = minimising KL divergence (since $H(P)$ is fixed)
If $Q(x) = 0$ where $P(x) > 0$, KL diverges to infinity

Pitfalls

Treating KL divergence as symmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$ in general. The direction matters enormously. Forward KL ($P \parallel Q$) penalises $Q$ for missing modes of $P$ (mean-seeking); reverse KL ($Q \parallel P$) makes $Q$ focus on the highest-density regions of $P$ (mode-seeking). Choosing the wrong direction changes the entire optimisation behaviour.
Using forward KL when the problem calls for reverse KL (or vice versa): In variational inference, we minimise $D_{KL}(Q \parallel P)$, the reverse KL from the approximate posterior to the true posterior. This encourages $Q$ to fit the mode. Minimising $D_{KL}(P \parallel Q)$ would spread $Q$ across all of $P$'s support, producing a different (and usually worse) approximation.
Computing KL without handling $Q(x) = 0$ when $P(x) > 0$: $D_{KL}(P \parallel Q) = \infty$ if $P$ assigns positive mass to any point where $Q$ is zero. This is why models should never output exactly-zero probabilities. In practice, numerical underflow in softmax can produce zeros — always clip or use the log-sum-exp trick to avoid infinite loss.
Thinking KL divergence is a proper distance metric: KL divergence satisfies non-negativity and identity of indiscernibles ($D_{KL}(P \parallel Q) = 0 \iff P = Q$), but it is NOT symmetric and does NOT satisfy the triangle inequality. It is a divergence, not a metric. This matters when interpreting it as a measure of "how far" two distributions are.
Forgetting the cross-entropy decomposition: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$. Since $H(P)$ is fixed, minimising cross-entropy equals minimising KL divergence. In ML, cross-entropy loss IS KL minimisation — the two are equivalent as optimisation objectives. Confusing this relationship leads to misunderstandings about what the loss function is actually doing.

Next Steps

Next up: 13-05-cross-entropy-applications.md

Progress

Phases

### 13.4 — KL Divergence (Relative Entropy)

Learning Objectives

Core Content

⚠️ CRITICAL: What KL Divergence Actually Is

Why Asymmetry Matters

Non-Negativity: Gibbs' Inequality

Cross-Entropy

⚠️ CRITICAL: KL Divergence vs Cross-Entropy in ML

Key Terms

Worked Examples

Example 1: KL between two Bernoulli distributions

Example 2: Asymmetric KL — when they differ

Example 3: Cross-entropy as ML loss

Quiz

Practice Problems

Summary

Pitfalls

Next Steps