### 13.4 — KL Divergence (Relative Entropy)
Phase: Information Theory Prerequisites: 13-01-entropy, 13-03-mutual-information
Learning Objectives
By the end of this subject, you will be able to:
- Define KL divergence $D_{KL}(P \parallel Q)$ and explain what it measures
- Prove non-negativity via Gibbs' inequality
- Explain why KL divergence is asymmetric and not a true distance metric
- Relate KL divergence to cross-entropy: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$
- Compute KL divergence for common distributions
Core Content
⚠️ CRITICAL: What KL Divergence Actually Is
The Kullback-Leibler divergence measures how much information is lost when we approximate the true distribution $P$ with a model distribution $Q$:
$$D_{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$$
Interpretations: 1. Information loss: The extra bits needed to encode samples from $P$ using a code optimised for $Q$ 2. Statistical distance: How "far" $Q$ is from $P$ (but NOT a metric — see below) 3. Expected log-likelihood ratio: $E_P[\log P(X) - \log Q(X)]$
🚩 Common Pitfall: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$! KL divergence is asymmetric. The direction matters enormously and choosing the wrong one changes the behaviour entirely.
Why Asymmetry Matters
Forward KL: $D_{KL}(P \parallel Q)$ = "mean-seeking" — $Q$ tries to cover all modes of $P$. If $P(x) > 0$ but $Q(x) \approx 0$, the log ratio explodes → heavy penalty for missing a region where $P$ has mass.
Reverse KL: $D_{KL}(Q \parallel P)$ = "mode-seeking" — $Q$ focuses on the highest-probability regions of $P$. If $Q(x) > 0$ but $P(x) \approx 0$, the penalty is mild.
In variational inference (Bayesian ML), we minimise $D_{KL}(Q \parallel P)$ — the reverse KL — which makes the approximate posterior $Q$ focus on the mode of the true posterior $P$.
Non-Negativity: Gibbs' Inequality
$$D_{KL}(P \parallel Q) \geq 0$$
with equality iff $P = Q$ almost everywhere.
Proof: By Jensen's inequality (since $-\log$ is convex):
$D_{KL}(P \parallel Q) = E_P\left[-\log\frac{Q(X)}{P(X)}\right] \geq -\log E_P\left[\frac{Q(X)}{P(X)}\right] = -\log \sum P(x) \cdot \frac{Q(x)}{P(x)} = -\log 1 = 0$
Cross-Entropy
$$H(P, Q) = -\sum_x P(x) \log Q(x)$$
This is the average number of bits needed to encode events from $P$ when using a code optimised for $Q$.
Key relationship:
$$H(P, Q) = H(P) + D_{KL}(P \parallel Q)$$
$$\text{Cross-entropy} = \text{Entropy} + \text{KL divergence}$$
Since $H(P)$ is fixed (depends only on the true distribution), minimising cross-entropy is equivalent to minimising KL divergence. This is why cross-entropy is the standard loss function for classification in ML.
⚠️ CRITICAL: KL Divergence vs Cross-Entropy in ML
In machine learning, when you train a classifier with cross-entropy loss: - $P$ = true label distribution (one-hot encoding: $P(\text{true class}) = 1$, others 0) - $Q$ = model's predicted probabilities (softmax output)
$H(P, Q) = -\log Q(\text{true class})$ — negative log-likelihood of the correct class.
Since $H(P) = 0$ for one-hot labels, $D_{KL}(P \parallel Q) = H(P, Q)$ — KL and cross-entropy coincide!
Key Terms
- Asymmetric
- Cross-entropy
- KL divergence
- Kullback-Leibler divergence
- Non-negative
Worked Examples
Example 1: KL between two Bernoulli distributions
$P = \text{Bernoulli}(0.7)$, $Q = \text{Bernoulli}(0.3)$
$D_{KL}(P \parallel Q) = 0.7 \log\frac{0.7}{0.3} + 0.3 \log\frac{0.3}{0.7}$
$= 0.7 \cdot \log_2(2.333) + 0.3 \cdot \log_2(0.429)$
$\log_2(2.333) \approx 1.222$, $\log_2(0.429) \approx -1.222$
$D_{KL}(P \parallel Q) = 0.7(1.222) + 0.3(-1.222) = 0.855 - 0.367 = 0.489$ bits
$D_{KL}(Q \parallel P) = 0.3 \log\frac{0.3}{0.7} + 0.7 \log\frac{0.7}{0.3} = 0.3(-1.222) + 0.7(1.222) = -0.367 + 0.855 = 0.489$ bits
(They're equal here because the distributions are symmetric in value.)
Example 2: Asymmetric KL — when they differ
$P = \text{Bernoulli}(0.99)$, $Q = \text{Bernoulli}(0.01)$
$D_{KL}(P \parallel Q) = 0.99\log\frac{0.99}{0.01} + 0.01\log\frac{0.01}{0.99}$
$\log_2(99) \approx 6.629$, $\log_2(0.0101) \approx -6.629$
$= 0.99(6.629) + 0.01(-6.629) = 6.563 - 0.066 = 6.50$ bits
Now swap: $P(X=1)=0.99$ means $X$ is almost always 1. $Q(X=1)=0.01$ thinks 1 is very rare — so coding $P$ with $Q$'s code is extremely inefficient (6.5 extra bits per sample).
$D_{KL}(Q \parallel P) = 0.01(6.629) + 0.99(-6.629) = 0.066 - 6.563 = -6.50$ bits — wait, that can't be right! Actually:
$D_{KL}(Q \parallel P) = 0.01\log\frac{0.01}{0.99} + 0.99\log\frac{0.99}{0.01} = 0.01(-6.629) + 0.99(6.629) = -0.066 + 6.563 = 6.50$ bits
Wait, they're symmetric for this pair too since both distributions are mirror images. Let me fix:
$P = \text{Bernoulli}(0.8)$, $Q = \text{Bernoulli}(0.2)$
$D_{KL}(P \parallel Q) = 0.8\log\frac{0.8}{0.2} + 0.2\log\frac{0.2}{0.8} = 0.8(2) + 0.2(-2) = 1.6 - 0.4 = 1.2$ bits
$D_{KL}(Q \parallel P) = 0.2\log\frac{0.2}{0.8} + 0.8\log\frac{0.8}{0.2} = 0.2(-2) + 0.8(2) = -0.4 + 1.6 = 1.2$ bits
(These Bernoulli pairs are always symmetric due to the binary structure — both directions depend on the same ratio but weighted differently.)
Example 3: Cross-entropy as ML loss
True labels (one-hot): $P = [1, 0, 0]$ (class 0 is correct) Model predictions: $Q = [0.7, 0.2, 0.1]$
$H(P, Q) = -1 \cdot \log_2 0.7 - 0 \cdot \log_2 0.2 - 0 \cdot \log_2 0.1 = -\log_2 0.7 \approx 0.515$ bits
If the model instead predicted $Q = [0.3, 0.4, 0.3]$: $H(P, Q) = -\log_2 0.3 \approx 1.737$ bits — much higher loss, as the model is less confident in the correct class.
Quiz
Q1: What does the concept of Kullback-Leibler divergence primarily refer to in this subject?
A) The definition and application of Kullback-Leibler divergence B) A computational error related to Kullback-Leibler divergence C) A visual representation of Kullback-Leibler divergence D) A historical anecdote about Kullback-Leibler divergence
Correct: A)
- If you chose A: Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Kullback-Leibler divergence is defined as: the definition and application of kullback-leibler divergence. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) A simplified version of D_{KL}(P \parallel Q)... B) D_{KL}(P \parallel Q) C) The inverse operation of the formula in question D) An unrelated formula from a different topic
Correct: B)
- If you chose A: This is incorrect. The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula D_{KL}(P \parallel Q) is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Asymmetric?
A) It is used to asymmetric in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain
Correct: A)
- If you chose A: Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Asymmetric serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Cross-entropy is TRUE?
A) Cross-entropy is an advanced topic beyond this subject's scope B) Cross-entropy is mentioned only as a historical footnote C) Cross-entropy is not related to this subject D) Cross-entropy is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
- If you chose B: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
- If you chose C: This is incorrect. Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content.
- If you chose D: Cross-entropy is a fundamental concept covered in this subject. This subject covers Cross-entropy as part of its core content. Correct!
Q5: How are Cross-entropy and KL divergence related?
A) Cross-entropy is the inverse of KL divergence B) Cross-entropy and KL divergence are completely unrelated topics C) Cross-entropy is a special case of KL divergence D) Cross-entropy and KL divergence are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Cross-entropy and KL divergence are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Cross-entropy and KL divergence are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Cross-entropy and KL divergence are covered in this subject as interconnected topics.
- If you chose D: Both Cross-entropy and KL divergence are covered in this subject as interconnected topics. Correct!
Q6: What is a common pitfall when working with Non-negative?
A) A common mistake is confusing Non-negative with a similar concept B) Non-negative has no common misconceptions C) The main error with Non-negative is using it when it is not needed D) Non-negative is always computed the same way in all contexts
Correct: A)
- If you chose A: Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Non-negative with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply ⚠️ Critical: What Kl Divergence Actually Is?
A) ⚠️ Critical: What Kl Divergence Actually Is is not practically useful B) Use ⚠️ Critical: What Kl Divergence Actually Is only in pure mathematics contexts C) Avoid ⚠️ Critical: What Kl Divergence Actually Is unless explicitly instructed D) Apply ⚠️ Critical: What Kl Divergence Actually Is to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: ⚠️ Critical: What Kl Divergence Actually Is is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
-
Prove $D_{KL}(P \parallel Q) \geq 0$ using the fact that $\log x \leq x - 1$.
Click for answer
$-D_{KL}(P \parallel Q) = \sum P(x) \log\frac{Q(x)}{P(x)} \leq \sum P(x)\left(\frac{Q(x)}{P(x)} - 1\right) = \sum Q(x) - \sum P(x) = 1 - 1 = 0$ So $-D_{KL} \leq 0$ → $D_{KL} \geq 0$. Equality when $\frac{Q(x)}{P(x)} = 1$ for all $x$ where $P(x) > 0$, i.e. $P = Q$. -
If $D_{KL}(P \parallel Q) = 2.5$ bits and $H(P) = 1.8$ bits, what is the cross-entropy $H(P, Q)$?
Click for answer
$H(P, Q) = H(P) + D_{KL}(P \parallel Q) = 1.8 + 2.5 = 4.3$ bits -
Why do ML practitioners minimise cross-entropy rather than KL divergence directly?
Click for answer
$H(P, Q) = H(P) + D_{KL}(P \parallel Q)$. Since $H(P)$ is fixed (depends only on the true data distribution), minimising cross-entropy with respect to model parameters is IDENTICAL to minimising KL divergence. Cross-entropy is easier to compute because you don't need to estimate $H(P)$. -
What happens to $D_{KL}(P \parallel Q)$ if $Q(x) = 0$ for some $x$ with $P(x) > 0$?
Click for answer
$D_{KL}(P \parallel Q) \to \infty$. By convention (and the definition), if $P(x) > 0$ and $Q(x) = 0$, then $\log(P/Q) = \infty$. This makes sense: if the true distribution assigns positive probability to an event that the model says is impossible, the model is infinitely bad. This is why you should never output exactly 0 probabilities from a model. -
Is the "triangle inequality" satisfied by KL divergence? I.e., does $D_{KL}(P \parallel R) \leq D_{KL}(P \parallel Q) + D_{KL}(Q \parallel R)$ hold?
Click for answer
**No.** KL divergence does NOT satisfy the triangle inequality, nor is it symmetric. It is not a metric. Counter-examples exist where this fails. The only metric-like properties KL divergence has are non-negativity and identity of indiscernibles ($D_{KL}(P \parallel Q) = 0 \iff P = Q$).
Summary
Key takeaways:
- KL divergence $D_{KL}(P \parallel Q) = \sum P(x) \log\frac{P(x)}{Q(x)}$ measures information loss when using $Q$ to approximate $P$
- Non-negative (Gibbs' inequality); zero iff $P = Q$
- Asymmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$ — direction matters
- Forward KL is "mean-seeking"; reverse KL is "mode-seeking"
- Cross-entropy $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$
- Minimising cross-entropy = minimising KL divergence (since $H(P)$ is fixed)
- If $Q(x) = 0$ where $P(x) > 0$, KL diverges to infinity
Pitfalls
-
Treating KL divergence as symmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$ in general. The direction matters enormously. Forward KL ($P \parallel Q$) penalises $Q$ for missing modes of $P$ (mean-seeking); reverse KL ($Q \parallel P$) makes $Q$ focus on the highest-density regions of $P$ (mode-seeking). Choosing the wrong direction changes the entire optimisation behaviour.
-
Using forward KL when the problem calls for reverse KL (or vice versa): In variational inference, we minimise $D_{KL}(Q \parallel P)$, the reverse KL from the approximate posterior to the true posterior. This encourages $Q$ to fit the mode. Minimising $D_{KL}(P \parallel Q)$ would spread $Q$ across all of $P$'s support, producing a different (and usually worse) approximation.
-
Computing KL without handling $Q(x) = 0$ when $P(x) > 0$: $D_{KL}(P \parallel Q) = \infty$ if $P$ assigns positive mass to any point where $Q$ is zero. This is why models should never output exactly-zero probabilities. In practice, numerical underflow in softmax can produce zeros — always clip or use the log-sum-exp trick to avoid infinite loss.
-
Thinking KL divergence is a proper distance metric: KL divergence satisfies non-negativity and identity of indiscernibles ($D_{KL}(P \parallel Q) = 0 \iff P = Q$), but it is NOT symmetric and does NOT satisfy the triangle inequality. It is a divergence, not a metric. This matters when interpreting it as a measure of "how far" two distributions are.
-
Forgetting the cross-entropy decomposition: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$. Since $H(P)$ is fixed, minimising cross-entropy equals minimising KL divergence. In ML, cross-entropy loss IS KL minimisation — the two are equivalent as optimisation objectives. Confusing this relationship leads to misunderstandings about what the loss function is actually doing.
Next Steps
Next up: 13-05-cross-entropy-applications.md