Phase 11: Probability Theory II
Subject 11-10: Information Theory Connection
Prerequisites: 10-04 (Discrete Random Variables), 10-10 (Joint Distributions), 11-01 (Expectation), 11-03 (Conditional Expectation)
Learning Objectives
- Define Shannon entropy H(X) and explain its interpretation as expected information content
- Define joint entropy H(X, Y) and conditional entropy H(Y | X), and prove the chain rule H(X, Y) = H(X) + H(Y | X)
- Define Kullback-Leibler divergence D_{KL}(P || Q) and prove it is non-negative (Gibbs' inequality)
- Define mutual information I(X; Y) and relate it to entropy and KL divergence
- Apply information-theoretic concepts to probability: maximum entropy distributions, Fisher information, and data processing inequality
Core Content
1. Entropy — Measuring Uncertainty
The Shannon entropy of a discrete random variable X with PMF p(x) is:
$H(X) = −Σ_x p(x) log₂ p(x) = E[−log₂ p(X)] $
Units: Bits (when using log base 2). Can also use nats (natural log) or dits (log base 10).
Interpretation: - H(X) measures the "expected surprise" or uncertainty about X - −log₂ p(x) is the information content (in bits) of observing outcome x — rare events carry more information - H(X) is the minimum expected number of bits needed to encode X (Shannon's source coding theorem)
Properties of entropy: 1. Non-negativity: H(X) ≥ 0, with equality iff X is constant (deterministic) 2. Maximum entropy: For a random variable with k possible values, H(X) ≤ log₂ k, with equality iff X is uniform. The uniform distribution is "maximally uncertain." 3. H(X) depends ONLY on the probabilities, not the actual values: H(X) = H(p) = H(p₁, ..., pₖ)
Examples: - Fair coin: H(X) = −(0.5 log₂ 0.5 + 0.5 log₂ 0.5) = 1 bit - Biased coin (p=0.9): H(X) = −(0.9 log₂ 0.9 + 0.1 log₂ 0.1) ≈ 0.469 bits - Certain event (p=1): H(X) = 0 bits — no uncertainty
⚠️ CRITICAL: 0 log₂ 0 is defined as 0 (by continuity: lim_{p→0⁺} p log₂ p = 0).
Binary entropy function: For a Bernoulli(p) random variable:
$h(p) = −p log₂ p − (1−p) log₂ (1−p) $
h(p) is concave, symmetric about p = 1/2, with maximum h(1/2) = 1 bit.
2. Joint and Conditional Entropy
Joint entropy:
$H(X, Y) = −Σ_x Σ_y p(x, y) log₂ p(x, y) = E[−log₂ p(X, Y)] $
Conditional entropy:
$H(Y | X) = Σ_x p(x) H(Y | X = x) = −Σ_x Σ_y p(x, y) log₂ p(y | x) = E[−log₂ p(Y | X)] $
H(Y | X) is the expected remaining uncertainty about Y after knowing X.
⚠️ Important: H(Y | X) ≤ H(Y) — conditioning reduces entropy on average. "Information never hurts." Equality iff X and Y are independent.
The Chain Rule:
$H(X, Y) = H(X) + H(Y | X) $
Proof: H(X,Y) = −E[log p(X,Y)] = −E[log(p(X) p(Y|X))] = −E[log p(X)] − E[log p(Y|X)] = H(X) + H(Y|X).
More generally:
$H(X₁, X₂, ..., Xₙ) = Σ_{i=1}^{n} H(X_i | X₁, ..., X_{i−1})
$
3. Kullback-Leibler (KL) Divergence
The KL divergence (relative entropy) between two probability distributions P and Q on the same space:
$D_{KL}(P || Q) = Σ_x p(x) log₂ (p(x) / q(x)) = E_P[log₂ (p(X) / q(X))]
$
Interpretation: The "information lost" when Q is used to approximate P. Not a true distance metric — asymmetric and doesn't satisfy triangle inequality.
Gibbs' Inequality: D_{KL}(P || Q) ≥ 0, with equality iff P = Q almost everywhere.
Proof: Using log(x) ≤ x − 1 (Jensen's inequality on −ln x):
$−D_{KL} = Σ p(x) log(q(x)/p(x)) ≤ Σ p(x)(q(x)/p(x) − 1) = Σ q(x) − Σ p(x) = 0
$
So D_{KL} ≥ 0. Equality when q(x)/p(x) = 1 for all x with p(x) > 0.
Common misconception: D_{KL}(P || Q) ≠ D_{KL}(Q || P). The "forward" and "reverse" KL are different. In variational inference, minimizing D_{KL}(Q || P) (reverse KL) gives mode-seeking behavior; minimizing D_{KL}(P || Q) (forward KL) gives mean-seeking behavior.
Relation to maximum likelihood: Minimizing D_{KL}(P_{true} || P_{model}) is equivalent to maximizing the expected log-likelihood under the true distribution.
4. Mutual Information
The mutual information between X and Y measures how much knowing one variable reduces uncertainty about the other:
$I(X; Y) = D_{KL}(p(x,y) || p(x)p(y))
= Σ_x Σ_y p(x, y) log₂ [p(x, y) / (p(x) p(y))]
$
Equivalent expressions:
$I(X; Y) = H(X) − H(X | Y)
= H(Y) − H(Y | X)
= H(X) + H(Y) − H(X, Y)
$
Properties: 1. Symmetry: I(X; Y) = I(Y; X) 2. Non-negativity: I(X; Y) ≥ 0 (follows from D_{KL} ≥ 0) 3. Zero iff independent: I(X; Y) = 0 ⇔ X ⟂ Y 4. Self-information: I(X; X) = H(X)
⚠️ Unlike correlation, mutual information captures ANY dependence — linear or nonlinear. I(X; Y) = 0 means genuine independence, not just zero correlation.
Conditional mutual information: I(X; Y | Z) = H(X | Z) − H(X | Y, Z). The information X and Y share beyond what Z already tells us.
Data Processing Inequality (DPI): If X → Y → Z forms a Markov chain (Z depends on X only through Y), then:
$I(X; Z) ≤ I(X; Y) $
"Processing data cannot increase information." No transformation of Y can create new information about X.
5. Probability-Meets-Information Deep Connections
Maximum Entropy Principle: Among all distributions satisfying given constraints, the one that maximizes entropy is the "least informative" or "most conservative" choice. This gives: - Given only support [a, b] → Uniform(a, b) maximizes entropy - Given mean μ → Exponential(μ) maximizes entropy on [0, ∞) - Given mean μ and variance σ² → Normal(μ, σ²) maximizes entropy on R
The normal distribution is special: it's the maximum entropy distribution for a given mean and variance — it makes the fewest assumptions beyond those two constraints.
Fisher Information:
$I(θ) = E[(∂/∂θ log f(X; θ))²] = −E[∂²/∂θ² log f(X; θ)] $
measures how much information a sample carries about a parameter θ. Related to KL divergence: D_{KL}(P_θ || P_{θ+dθ}) ≈ (1/2) I(θ) (dθ)².
Cramér-Rao bound: Var(θ̂) ≥ 1 / (n I(θ)) — Fisher information sets a lower bound on the variance of any unbiased estimator.
Connection to Large Deviations: Sanov's theorem: the probability of an empirical distribution deviating from the true distribution decays exponentially at a rate given by KL divergence:
$P(deviate to Q) ≈ exp(−n D_{KL}(Q || P))
$
Key Terms
- KL divergence
- Shannon entropy
Worked Examples
Example 1: Entropy of a Discrete Distribution
X has PMF: p(1) = 0.5, p(2) = 0.25, p(3) = 0.125, p(4) = 0.125. Compute H(X).
Solution:
H(X) = −[0.5 log₂ 0.5 + 0.25 log₂ 0.25 + 0.125 log₂ 0.125 + 0.125 log₂ 0.125] = −[0.5(−1) + 0.25(−2) + 0.125(−3) + 0.125(−3)] = −[−0.5 − 0.5 − 0.375 − 0.375] = −[−1.75] = 1.75 bits
On average, you need 1.75 bits to encode each outcome.
Example 2: Mutual Information Computation
Joint PMF: p(0,0)=0.1, p(0,1)=0.3, p(1,0)=0.4, p(1,1)=0.2. Find I(X; Y).
Solution:
Marginals: p_X(0)=0.4, p_X(1)=0.6; p_Y(0)=0.5, p_Y(1)=0.5.
Joint entropy: H(X,Y) = −[0.1 log₂ 0.1 + 0.3 log₂ 0.3 + 0.4 log₂ 0.4 + 0.2 log₂ 0.2] = −[0.1(−3.322) + 0.3(−1.737) + 0.4(−1.322) + 0.2(−2.322)] = −[−0.332 − 0.521 − 0.529 − 0.464] = 1.846 bits.
H(X) = −[0.4 log₂ 0.4 + 0.6 log₂ 0.6] = −[0.4(−1.322) + 0.6(−0.737)] = 0.971 bits. H(Y) = 1 bit (since p_Y is uniform).
I(X;Y) = 0.971 + 1 − 1.846 = 0.125 bits. Knowing X reduces uncertainty about Y by 0.125 bits (out of a possible 1 bit). Weak dependence.
Example 3: KL Divergence
P = (0.5, 0.3, 0.2), Q = (0.3, 0.5, 0.2). Compute D_{KL}(P || Q).
Solution:
D_{KL}(P || Q) = 0.5 log₂(0.5/0.3) + 0.3 log₂(0.3/0.5) + 0.2 log₂(0.2/0.2) = 0.5 log₂(1.667) + 0.3 log₂(0.6) + 0.2 log₂(1) = 0.5(0.737) + 0.3(−0.737) + 0.2(0) = 0.368 − 0.221 + 0 = 0.147 bits
If we used Q to encode samples from P, we'd waste about 0.147 extra bits per sample (on average).
Quiz
Q1: Shannon entropy H(X) for a discrete random variable measures:
A) The variance of X B) The expected information content or uncertainty of X C) The number of possible values X can take D) The probability of the most likely outcome
Correct: B)
- If you chose B: Correct! H(X) = −Σ p(x) log p(x) quantifies uncertainty in bits (log base 2) or nats (natural log). Higher entropy means more uncertainty.
- If you chose A: Variance measures spread in value space; entropy measures uncertainty in probability space.
- If you chose C: This is the size of the support, not entropy (though entropy is maximized at log|X| for uniform).
- If you chose D: This is max p(x), not a measure of overall uncertainty.
Q2: KL divergence D_{KL}(P || Q) is:
A) Symmetric: D_{KL}(P||Q) = D_{KL}(Q||P) B) Always non-negative: D_{KL}(P||Q) ≥ 0, with equality iff P = Q C) A true distance metric D) Measured in the same units as variance
Correct: B)
- If you chose B: Correct! By Gibbs' inequality, D_{KL}(P||Q) ≥ 0, with equality only when P = Q almost everywhere. It's always non-negative.
- If you chose A: KL divergence is NOT symmetric: D_{KL}(P||Q) ≠ D_{KL}(Q||P) in general.
- If you chose C: It fails symmetry and triangle inequality, so it's a divergence, not a metric.
- If you chose D: KL divergence is measured in bits or nats, not squared units.
Q3: Mutual information I(X; Y) equals zero if and only if:
A) X and Y have the same distribution B) X and Y are independent C) H(X) = H(Y) D) X = Y deterministically
Correct: B)
- If you chose B: Correct! I(X;Y) = D_{KL}(P_{XY} || P_X P_Y) = 0 iff the joint equals the product of marginals, which is the definition of independence. I(X;Y) measures dependence.
- If you chose A: Identical distributions don't imply independence.
- If you chose C: Equal entropies don't imply independence.
- If you chose D: Deterministic relationship gives MAXIMUM mutual information (= H(X)), not zero.
Q5: Cross-entropy H(P, Q) = −Σ p(x) log q(x) is related to KL divergence by:
A) H(P, Q) = H(P) + D_{KL}(P || Q) B) H(P, Q) = H(P) − D_{KL}(P || Q) C) H(P, Q) = D_{KL}(P || Q) D) H(P, Q) = H(P) · D_{KL}(P || Q)
Correct: A)
- If you chose A: Correct! H(P,Q) = H(P) + D_{KL}(P||Q). Cross-entropy = entropy + divergence. Minimizing cross-entropy (with P fixed) is equivalent to minimizing KL divergence.
- If you chose B: The relationship is addition, not subtraction.
- If you chose C: Cross-entropy includes the entropy of P plus the KL divergence.
- If you chose D: These are added, not multiplied.
Practice Problems
- Compute H(X) for X ~ Geometric(0.5) in closed form. You may use Σ p log₂ p.
- Show that H(X, Y) = H(X) + H(Y | X) and that H(Y | X) ≤ H(Y) with equality iff X ⟂ Y.
- Prove Gibbs' inequality: D_{KL}(P || Q) ≥ 0.
- Let X and Y be binary with p(0,0)=p(1,1)=0.3, p(0,1)=p(1,0)=0.2. Compute I(X; Y), H(X|Y), H(Y|X).
- For the normal distribution N(μ, σ²), the differential entropy is h(X) = (1/2) log₂(2πeσ²). Verify this formula.
- Show the data processing inequality: if X→Y→Z is a Markov chain, then I(X; Z) ≤ I(X; Y).
- Find the mutual information I(X; Y) for a bivariate normal with correlation ρ. What happens as ρ → 1?
Answers
1. p(k) = (1/2)^k. H(X) = −Σ_{k=1}^{∞} (1/2)^k log₂(1/2)^k = Σ k/2^k = 2 bits. 2. H(X,Y)=H(X)+H(Y|X): definitional from log p(x,y)=log p(x)+log p(y|x). H(Y|X)≤H(Y): I(X;Y)=H(Y)−H(Y|X)≥0 ⇒ H(Y|X)≤H(Y). Equality when I=0 ⇔ X⟂Y. 3. See proof in Core Content §3. Uses log x ≤ x−1. 4. p_X(0)=p_X(1)=0.5, p_Y(0)=p_Y(1)=0.5. H(X)=H(Y)=1. H(X,Y)=−[2·0.3 log₂ 0.3+2·0.2 log₂ 0.2]=−[0.6(−1.737)+0.4(−2.322)]=1.971. I=1+1−1.971=0.029 bits. Very weak dependence. H(X|Y)=1−0.029=0.971, H(Y|X)=0.971. 5. Differential entropy for continuous: h(X)=−∫f(x)ln f(x)dx (nats). For N(μ,σ²): ln f=−(x−μ)²/(2σ²)−ln(√(2πσ²)). h=1/2+ln(√(2πσ²))=(1/2)ln(2πeσ²) nats. In bits: divide by ln 2. 6. I(X;Y,Z) = I(X;Z) + I(X;Y|Z) ≥ I(X;Z) (since MI is non-negative). By Markov property: I(X;Y|Z) = I(X;Y) − I(X;Z) ≥ 0 → I(X;Z) ≤ I(X;Y). 7. For bivariate normal, I(X;Y) = −(1/2)log₂(1−ρ²). As ρ→±1, I→∞ — perfect linear dependence gives infinite information (continuous case, differential entropy subtleties). As ρ→0, I→0.Summary
- Entropy H(X) = −Σ p(x) log₂ p(x) measures uncertainty in bits; maximum entropy for given support is achieved by the uniform distribution
- Joint entropy chains: H(X,Y) = H(X) + H(Y|X); conditioning reduces entropy, H(Y|X) ≤ H(Y), equality iff independent
- KL divergence D_{KL}(P||Q) = Σ p log(p/q) ≥ 0 measures information lost using Q to approximate P; asymmetric; gives maximum likelihood as a special case
- Mutual information I(X;Y) = H(X)−H(X|Y) = D_{KL}(p(x,y)||p(x)p(y)) captures all dependence (not just linear); zero iff independence
- Deep connections: maximum entropy → uniform/exponential/normal; Fisher information bounds estimator variance; Sanov's theorem ties KL to large deviations
Pitfalls
- Treating KL divergence as a true distance metric: D_{KL}(P||Q) is asymmetric — D_{KL}(P||Q) ≠ D_{KL}(Q||P) in general — and it does not satisfy the triangle inequality. It is a divergence, not a metric. Using it as if it were symmetric (e.g., averaging forward and reverse KL) changes the meaning entirely.
- Forgetting that 0 log 0 = 0: In entropy calculations, terms where p(x) = 0 contribute 0, not undefined. This follows from lim_{p→0⁺} p log p = 0. Omitting this convention leads to nonsensical results when some probabilities are zero.
- Confusing mutual information with correlation: I(X;Y) = 0 if and only if X and Y are independent (any kind of dependence). Correlation ρ = 0 does not imply independence — variables can be uncorrelated but dependent (e.g., X ~ N(0,1), Y = X²). Mutual information captures this; correlation does not.
- Applying discrete entropy formulas to continuous variables: H(X) = −Σ p(x) log p(x) is for discrete random variables only. For continuous variables, you need differential entropy h(X) = −∫ f(x) log f(x) dx, which can be negative and lacks some properties of discrete entropy (e.g., it's not invariant under nonlinear transformations).
- Misusing the chain rule: H(X,Y) = H(X) + H(Y|X) holds for any X,Y. Do not assume H(X,Y) = H(X) + H(Y) — that's only true when X and Y are independent. Always check dependence before simplifying joint entropy.
Quiz
-
Shannon entropy H(X) is measured in: a) Probabilities b) Bits (when using log base 2) c) Standard deviations d) Correlation units Answer: b. H(X) = −Σ p(x) log₂ p(x) gives bits. Use natural log for nats.
-
H(X) = 0 when: a) X is uniform b) X has many values c) X is deterministic (only one possible value) d) X has two equally likely outcomes Answer: c. No uncertainty → zero entropy. Uniform maximizes entropy for given support.
-
D_{KL}(P || Q) is: a) Symmetric: D_{KL}(P||Q) = D_{KL}(Q||P) b) Always negative c) Always ≥ 0, with equality iff P = Q d) A true distance metric Answer: c. Gibbs' inequality guarantees non-negativity. KL is NOT symmetric, so it's not a metric.
-
Mutual information I(X; Y) = 0 if and only if: a) Cov(X, Y) = 0 b) X and Y are independent c) ρ(X, Y) = 0 d) X and Y are normal Answer: b. Unlike correlation, zero mutual information means genuine statistical independence.
-
The chain rule for entropy states: a) H(X, Y) = H(X) + H(Y) b) H(X, Y) = H(X) H(Y) c) H(X, Y) = H(X) + H(Y | X) d) H(X, Y) = H(Y) + H(X | Y, X) Answer: c. Joint entropy = marginal + conditional. Generalizes to n variables.
-
The data processing inequality says that if X→Y→Z is a Markov chain: a) H(Z) ≥ H(Y) b) I(X; Z) ≤ I(X; Y) c) H(X) = H(Z) d) I(X; Z) = I(Y; Z) Answer: b. Processing Y into Z cannot increase the information about X.
-
Which distribution maximizes entropy on (−∞, ∞) given a fixed mean and variance? a) Uniform b) Exponential c) Normal d) Cauchy Answer: c. The normal is the maximum entropy distribution with prescribed mean and variance.
-
H(Y | X) is: a) Always ≥ H(Y) b) The expected remaining uncertainty about Y after knowing X c) Equal to H(Y) for all X, Y d) The same as I(X; Y) Answer: b. H(Y|X) = E[H(Y|X=x)] — the average conditional entropy. Always ≤ H(Y).
Next Steps
You have completed Phase 11. Continue to Phase 12 for an introduction to statistical inference: point estimation, confidence intervals, and hypothesis testing.