Math graphic
📐 Concept diagram

### 13.4 — KL Divergence (Relative Entropy)

Phase: Information Theory Prerequisites: 13-01-entropy, 13-03-mutual-information

Learning Objectives

By the end of this subject, you will be able to:

  1. Define KL divergence $D_{KL}(P \parallel Q)$ and explain what it measures
  2. Prove non-negativity via Gibbs' inequality
  3. Explain why KL divergence is asymmetric and not a true distance metric
  4. Relate KL divergence to cross-entropy: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$
  5. Compute KL divergence for common distributions

Core Content

⚠️ CRITICAL: What KL Divergence Actually Is

The Kullback-Leibler divergence measures how much information is lost when we approximate the true distribution $P$ with a model distribution $Q$:

$$D_{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$$

Interpretations: 1. Information loss: The extra bits needed to encode samples from $P$ using a code optimised for $Q$ 2. Statistical distance: How "far" $Q$ is from $P$ (but NOT a metric — see below) 3. Expected log-likelihood ratio: $E_P[\log P(X) - \log Q(X)]$

🚩 Common Pitfall: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$! KL divergence is asymmetric. The direction matters enormously and choosing the wrong one changes the behaviour entirely.

Why Asymmetry Matters

Forward KL: $D_{KL}(P \parallel Q)$ = "mean-seeking" — $Q$ tries to cover all modes of $P$. If $P(x) > 0$ but $Q(x) \approx 0$, the log ratio explodes → heavy penalty for missing a region where $P$ has mass.

Reverse KL: $D_{KL}(Q \parallel P)$ = "mode-seeking" — $Q$ focuses on the highest-probability regions of $P$. If $Q(x) > 0$ but $P(x) \approx 0$, the penalty is mild.

In variational inference (Bayesian ML), we minimise $D_{KL}(Q \parallel P)$ — the reverse KL — which makes the approximate posterior $Q$ focus on the mode of the true posterior $P$.

Non-Negativity: Gibbs' Inequality

$$D_{KL}(P \parallel Q) \geq 0$$

with equality iff $P = Q$ almost everywhere.

Proof: By Jensen's inequality (since $-\log$ is convex):

$D_{KL}(P \parallel Q) = E_P\left[-\log\frac{Q(X)}{P(X)}\right] \geq -\log E_P\left[\frac{Q(X)}{P(X)}\right] = -\log \sum P(x) \cdot \frac{Q(x)}{P(x)} = -\log 1 = 0$

Cross-Entropy

$$H(P, Q) = -\sum_x P(x) \log Q(x)$$

This is the average number of bits needed to encode events from $P$ when using a code optimised for $Q$.

Key relationship:

$$H(P, Q) = H(P) + D_{KL}(P \parallel Q)$$

$$\text{Cross-entropy} = \text{Entropy} + \text{KL divergence}$$

Since $H(P)$ is fixed (depends only on the true distribution), minimising cross-entropy is equivalent to minimising KL divergence. This is why cross-entropy is the standard loss function for classification in ML.

⚠️ CRITICAL: KL Divergence vs Cross-Entropy in ML

In machine learning, when you train a classifier with cross-entropy loss: - $P$ = true label distribution (one-hot encoding: $P(\text{true class}) = 1$, others 0) - $Q$ = model's predicted probabilities (softmax output)

$H(P, Q) = -\log Q(\text{true class})$ — negative log-likelihood of the correct class.

Since $H(P) = 0$ for one-hot labels, $D_{KL}(P \parallel Q) = H(P, Q)$ — KL and cross-entropy coincide!



Key Terms

Worked Examples

Example 1: KL between two Bernoulli distributions

$P = \text{Bernoulli}(0.7)$, $Q = \text{Bernoulli}(0.3)$

$D_{KL}(P \parallel Q) = 0.7 \log\frac{0.7}{0.3} + 0.3 \log\frac{0.3}{0.7}$

$= 0.7 \cdot \log_2(2.333) + 0.3 \cdot \log_2(0.429)$

$\log_2(2.333) \approx 1.222$, $\log_2(0.429) \approx -1.222$

$D_{KL}(P \parallel Q) = 0.7(1.222) + 0.3(-1.222) = 0.855 - 0.367 = 0.489$ bits

$D_{KL}(Q \parallel P) = 0.3 \log\frac{0.3}{0.7} + 0.7 \log\frac{0.7}{0.3} = 0.3(-1.222) + 0.7(1.222) = -0.367 + 0.855 = 0.489$ bits

(They're equal here because the distributions are symmetric in value.)

Example 2: Asymmetric KL — when they differ

$P = \text{Bernoulli}(0.99)$, $Q = \text{Bernoulli}(0.01)$

$D_{KL}(P \parallel Q) = 0.99\log\frac{0.99}{0.01} + 0.01\log\frac{0.01}{0.99}$

$\log_2(99) \approx 6.629$, $\log_2(0.0101) \approx -6.629$

$= 0.99(6.629) + 0.01(-6.629) = 6.563 - 0.066 = 6.50$ bits

Now swap: $P(X=1)=0.99$ means $X$ is almost always 1. $Q(X=1)=0.01$ thinks 1 is very rare — so coding $P$ with $Q$'s code is extremely inefficient (6.5 extra bits per sample).

$D_{KL}(Q \parallel P) = 0.01(6.629) + 0.99(-6.629) = 0.066 - 6.563 = -6.50$ bits — wait, that can't be right! Actually:

$D_{KL}(Q \parallel P) = 0.01\log\frac{0.01}{0.99} + 0.99\log\frac{0.99}{0.01} = 0.01(-6.629) + 0.99(6.629) = -0.066 + 6.563 = 6.50$ bits

Wait, they're symmetric for this pair too since both distributions are mirror images. Let me fix:

$P = \text{Bernoulli}(0.8)$, $Q = \text{Bernoulli}(0.2)$

$D_{KL}(P \parallel Q) = 0.8\log\frac{0.8}{0.2} + 0.2\log\frac{0.2}{0.8} = 0.8(2) + 0.2(-2) = 1.6 - 0.4 = 1.2$ bits

$D_{KL}(Q \parallel P) = 0.2\log\frac{0.2}{0.8} + 0.8\log\frac{0.8}{0.2} = 0.2(-2) + 0.8(2) = -0.4 + 1.6 = 1.2$ bits

(These Bernoulli pairs are always symmetric due to the binary structure — both directions depend on the same ratio but weighted differently.)

Example 3: Cross-entropy as ML loss

True labels (one-hot): $P = [1, 0, 0]$ (class 0 is correct) Model predictions: $Q = [0.7, 0.2, 0.1]$

$H(P, Q) = -1 \cdot \log_2 0.7 - 0 \cdot \log_2 0.2 - 0 \cdot \log_2 0.1 = -\log_2 0.7 \approx 0.515$ bits

If the model instead predicted $Q = [0.3, 0.4, 0.3]$: $H(P, Q) = -\log_2 0.3 \approx 1.737$ bits — much higher loss, as the model is less confident in the correct class.



Quiz

Q1: What does the concept of Kullback-Leibler divergence primarily refer to in this subject?

A) The definition and application of Kullback-Leibler divergence B) A computational error related to Kullback-Leibler divergence C) A visual representation of Kullback-Leibler divergence D) A historical anecdote about Kullback-Leibler divergence

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of D_{KL}(P \parallel Q)... B) D_{KL}(P \parallel Q) C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: B)

Q3: What is the primary purpose of Asymmetric?

A) It is used to asymmetric in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

Q4: Which statement about Cross-entropy is TRUE?

A) Cross-entropy is an advanced topic beyond this subject's scope B) Cross-entropy is mentioned only as a historical footnote C) Cross-entropy is not related to this subject D) Cross-entropy is a fundamental concept covered in this subject

Correct: D)

Q5: How are Cross-entropy and KL divergence related?

A) Cross-entropy is the inverse of KL divergence B) Cross-entropy and KL divergence are completely unrelated topics C) Cross-entropy is a special case of KL divergence D) Cross-entropy and KL divergence are closely related concepts

Correct: D)

Q6: What is a common pitfall when working with Non-negative?

A) A common mistake is confusing Non-negative with a similar concept B) Non-negative has no common misconceptions C) The main error with Non-negative is using it when it is not needed D) Non-negative is always computed the same way in all contexts

Correct: A)

Q7: When should you apply ⚠️ Critical: What Kl Divergence Actually Is?

A) ⚠️ Critical: What Kl Divergence Actually Is is not practically useful B) Use ⚠️ Critical: What Kl Divergence Actually Is only in pure mathematics contexts C) Avoid ⚠️ Critical: What Kl Divergence Actually Is unless explicitly instructed D) Apply ⚠️ Critical: What Kl Divergence Actually Is to solve problems in this subject's domain

Correct: D)

Practice Problems

  1. Prove $D_{KL}(P \parallel Q) \geq 0$ using the fact that $\log x \leq x - 1$.

    Click for answer $-D_{KL}(P \parallel Q) = \sum P(x) \log\frac{Q(x)}{P(x)} \leq \sum P(x)\left(\frac{Q(x)}{P(x)} - 1\right) = \sum Q(x) - \sum P(x) = 1 - 1 = 0$ So $-D_{KL} \leq 0$ → $D_{KL} \geq 0$. Equality when $\frac{Q(x)}{P(x)} = 1$ for all $x$ where $P(x) > 0$, i.e. $P = Q$.

  2. If $D_{KL}(P \parallel Q) = 2.5$ bits and $H(P) = 1.8$ bits, what is the cross-entropy $H(P, Q)$?

    Click for answer $H(P, Q) = H(P) + D_{KL}(P \parallel Q) = 1.8 + 2.5 = 4.3$ bits

  3. Why do ML practitioners minimise cross-entropy rather than KL divergence directly?

    Click for answer $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$. Since $H(P)$ is fixed (depends only on the true data distribution), minimising cross-entropy with respect to model parameters is IDENTICAL to minimising KL divergence. Cross-entropy is easier to compute because you don't need to estimate $H(P)$.

  4. What happens to $D_{KL}(P \parallel Q)$ if $Q(x) = 0$ for some $x$ with $P(x) > 0$?

    Click for answer $D_{KL}(P \parallel Q) \to \infty$. By convention (and the definition), if $P(x) > 0$ and $Q(x) = 0$, then $\log(P/Q) = \infty$. This makes sense: if the true distribution assigns positive probability to an event that the model says is impossible, the model is infinitely bad. This is why you should never output exactly 0 probabilities from a model.

  5. Is the "triangle inequality" satisfied by KL divergence? I.e., does $D_{KL}(P \parallel R) \leq D_{KL}(P \parallel Q) + D_{KL}(Q \parallel R)$ hold?

    Click for answer **No.** KL divergence does NOT satisfy the triangle inequality, nor is it symmetric. It is not a metric. Counter-examples exist where this fails. The only metric-like properties KL divergence has are non-negativity and identity of indiscernibles ($D_{KL}(P \parallel Q) = 0 \iff P = Q$).


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-05-cross-entropy-applications.md