Math graphic
📐 Concept diagram

Phase 11: Probability Theory II

Subject 11-10: Information Theory Connection

Prerequisites: 10-04 (Discrete Random Variables), 10-10 (Joint Distributions), 11-01 (Expectation), 11-03 (Conditional Expectation)


Learning Objectives

  1. Define Shannon entropy H(X) and explain its interpretation as expected information content
  2. Define joint entropy H(X, Y) and conditional entropy H(Y | X), and prove the chain rule H(X, Y) = H(X) + H(Y | X)
  3. Define Kullback-Leibler divergence D_{KL}(P || Q) and prove it is non-negative (Gibbs' inequality)
  4. Define mutual information I(X; Y) and relate it to entropy and KL divergence
  5. Apply information-theoretic concepts to probability: maximum entropy distributions, Fisher information, and data processing inequality

Core Content

1. Entropy — Measuring Uncertainty

The Shannon entropy of a discrete random variable X with PMF p(x) is:

$H(X) = −Σ_x p(x) log₂ p(x) = E[−log₂ p(X)]
$

Units: Bits (when using log base 2). Can also use nats (natural log) or dits (log base 10).

Interpretation: - H(X) measures the "expected surprise" or uncertainty about X - −log₂ p(x) is the information content (in bits) of observing outcome x — rare events carry more information - H(X) is the minimum expected number of bits needed to encode X (Shannon's source coding theorem)

Properties of entropy: 1. Non-negativity: H(X) ≥ 0, with equality iff X is constant (deterministic) 2. Maximum entropy: For a random variable with k possible values, H(X) ≤ log₂ k, with equality iff X is uniform. The uniform distribution is "maximally uncertain." 3. H(X) depends ONLY on the probabilities, not the actual values: H(X) = H(p) = H(p₁, ..., pₖ)

Examples: - Fair coin: H(X) = −(0.5 log₂ 0.5 + 0.5 log₂ 0.5) = 1 bit - Biased coin (p=0.9): H(X) = −(0.9 log₂ 0.9 + 0.1 log₂ 0.1) ≈ 0.469 bits - Certain event (p=1): H(X) = 0 bits — no uncertainty

⚠️ CRITICAL: 0 log₂ 0 is defined as 0 (by continuity: lim_{p→0⁺} p log₂ p = 0).

Binary entropy function: For a Bernoulli(p) random variable:

$h(p) = −p log₂ p − (1−p) log₂ (1−p)
$

h(p) is concave, symmetric about p = 1/2, with maximum h(1/2) = 1 bit.

2. Joint and Conditional Entropy

Joint entropy:

$H(X, Y) = −Σ_x Σ_y p(x, y) log₂ p(x, y) = E[−log₂ p(X, Y)]
$

Conditional entropy:

$H(Y | X) = Σ_x p(x) H(Y | X = x) = −Σ_x Σ_y p(x, y) log₂ p(y | x) = E[−log₂ p(Y | X)]
$

H(Y | X) is the expected remaining uncertainty about Y after knowing X.

⚠️ Important: H(Y | X) ≤ H(Y) — conditioning reduces entropy on average. "Information never hurts." Equality iff X and Y are independent.

The Chain Rule:

$H(X, Y) = H(X) + H(Y | X)
$

Proof: H(X,Y) = −E[log p(X,Y)] = −E[log(p(X) p(Y|X))] = −E[log p(X)] − E[log p(Y|X)] = H(X) + H(Y|X).

More generally:

$H(X₁, X₂, ..., Xₙ) = Σ_{i=1}^{n} H(X_i | X₁, ..., X_{i−1})
$

3. Kullback-Leibler (KL) Divergence

The KL divergence (relative entropy) between two probability distributions P and Q on the same space:

$D_{KL}(P || Q) = Σ_x p(x) log₂ (p(x) / q(x)) = E_P[log₂ (p(X) / q(X))]
$

Interpretation: The "information lost" when Q is used to approximate P. Not a true distance metric — asymmetric and doesn't satisfy triangle inequality.

Gibbs' Inequality: D_{KL}(P || Q) ≥ 0, with equality iff P = Q almost everywhere.

Proof: Using log(x) ≤ x − 1 (Jensen's inequality on −ln x):

$−D_{KL} = Σ p(x) log(q(x)/p(x)) ≤ Σ p(x)(q(x)/p(x) − 1) = Σ q(x) − Σ p(x) = 0
$

So D_{KL} ≥ 0. Equality when q(x)/p(x) = 1 for all x with p(x) > 0.

Common misconception: D_{KL}(P || Q) ≠ D_{KL}(Q || P). The "forward" and "reverse" KL are different. In variational inference, minimizing D_{KL}(Q || P) (reverse KL) gives mode-seeking behavior; minimizing D_{KL}(P || Q) (forward KL) gives mean-seeking behavior.

Relation to maximum likelihood: Minimizing D_{KL}(P_{true} || P_{model}) is equivalent to maximizing the expected log-likelihood under the true distribution.

4. Mutual Information

The mutual information between X and Y measures how much knowing one variable reduces uncertainty about the other:

$I(X; Y) = D_{KL}(p(x,y) || p(x)p(y))
        = Σ_x Σ_y p(x, y) log₂ [p(x, y) / (p(x) p(y))]
$

Equivalent expressions:

$I(X; Y) = H(X) − H(X | Y)
        = H(Y) − H(Y | X)
        = H(X) + H(Y) − H(X, Y)
$

Properties: 1. Symmetry: I(X; Y) = I(Y; X) 2. Non-negativity: I(X; Y) ≥ 0 (follows from D_{KL} ≥ 0) 3. Zero iff independent: I(X; Y) = 0 ⇔ X ⟂ Y 4. Self-information: I(X; X) = H(X)

⚠️ Unlike correlation, mutual information captures ANY dependence — linear or nonlinear. I(X; Y) = 0 means genuine independence, not just zero correlation.

Conditional mutual information: I(X; Y | Z) = H(X | Z) − H(X | Y, Z). The information X and Y share beyond what Z already tells us.

Data Processing Inequality (DPI): If X → Y → Z forms a Markov chain (Z depends on X only through Y), then:

$I(X; Z) ≤ I(X; Y)
$

"Processing data cannot increase information." No transformation of Y can create new information about X.

5. Probability-Meets-Information Deep Connections

Maximum Entropy Principle: Among all distributions satisfying given constraints, the one that maximizes entropy is the "least informative" or "most conservative" choice. This gives: - Given only support [a, b] → Uniform(a, b) maximizes entropy - Given mean μ → Exponential(μ) maximizes entropy on [0, ∞) - Given mean μ and variance σ² → Normal(μ, σ²) maximizes entropy on R

The normal distribution is special: it's the maximum entropy distribution for a given mean and variance — it makes the fewest assumptions beyond those two constraints.

Fisher Information:

$I(θ) = E[(∂/∂θ log f(X; θ))²] = −E[∂²/∂θ² log f(X; θ)]
$

measures how much information a sample carries about a parameter θ. Related to KL divergence: D_{KL}(P_θ || P_{θ+dθ}) ≈ (1/2) I(θ) (dθ)².

Cramér-Rao bound: Var(θ̂) ≥ 1 / (n I(θ)) — Fisher information sets a lower bound on the variance of any unbiased estimator.

Connection to Large Deviations: Sanov's theorem: the probability of an empirical distribution deviating from the true distribution decays exponentially at a rate given by KL divergence:

$P(deviate to Q) ≈ exp(−n D_{KL}(Q || P))
$


Key Terms

Worked Examples

Example 1: Entropy of a Discrete Distribution

X has PMF: p(1) = 0.5, p(2) = 0.25, p(3) = 0.125, p(4) = 0.125. Compute H(X).

Solution:

H(X) = −[0.5 log₂ 0.5 + 0.25 log₂ 0.25 + 0.125 log₂ 0.125 + 0.125 log₂ 0.125] = −[0.5(−1) + 0.25(−2) + 0.125(−3) + 0.125(−3)] = −[−0.5 − 0.5 − 0.375 − 0.375] = −[−1.75] = 1.75 bits

On average, you need 1.75 bits to encode each outcome.


Example 2: Mutual Information Computation

Joint PMF: p(0,0)=0.1, p(0,1)=0.3, p(1,0)=0.4, p(1,1)=0.2. Find I(X; Y).

Solution:

Marginals: p_X(0)=0.4, p_X(1)=0.6; p_Y(0)=0.5, p_Y(1)=0.5.

Joint entropy: H(X,Y) = −[0.1 log₂ 0.1 + 0.3 log₂ 0.3 + 0.4 log₂ 0.4 + 0.2 log₂ 0.2] = −[0.1(−3.322) + 0.3(−1.737) + 0.4(−1.322) + 0.2(−2.322)] = −[−0.332 − 0.521 − 0.529 − 0.464] = 1.846 bits.

H(X) = −[0.4 log₂ 0.4 + 0.6 log₂ 0.6] = −[0.4(−1.322) + 0.6(−0.737)] = 0.971 bits. H(Y) = 1 bit (since p_Y is uniform).

I(X;Y) = 0.971 + 1 − 1.846 = 0.125 bits. Knowing X reduces uncertainty about Y by 0.125 bits (out of a possible 1 bit). Weak dependence.


Example 3: KL Divergence

P = (0.5, 0.3, 0.2), Q = (0.3, 0.5, 0.2). Compute D_{KL}(P || Q).

Solution:

D_{KL}(P || Q) = 0.5 log₂(0.5/0.3) + 0.3 log₂(0.3/0.5) + 0.2 log₂(0.2/0.2) = 0.5 log₂(1.667) + 0.3 log₂(0.6) + 0.2 log₂(1) = 0.5(0.737) + 0.3(−0.737) + 0.2(0) = 0.368 − 0.221 + 0 = 0.147 bits

If we used Q to encode samples from P, we'd waste about 0.147 extra bits per sample (on average).

Quiz

Q1: Shannon entropy H(X) for a discrete random variable measures:

A) The variance of X B) The expected information content or uncertainty of X C) The number of possible values X can take D) The probability of the most likely outcome

Correct: B)


Q2: KL divergence D_{KL}(P || Q) is:

A) Symmetric: D_{KL}(P||Q) = D_{KL}(Q||P) B) Always non-negative: D_{KL}(P||Q) ≥ 0, with equality iff P = Q C) A true distance metric D) Measured in the same units as variance

Correct: B)


Q3: Mutual information I(X; Y) equals zero if and only if:

A) X and Y have the same distribution B) X and Y are independent C) H(X) = H(Y) D) X = Y deterministically

Correct: B)


Q5: Cross-entropy H(P, Q) = −Σ p(x) log q(x) is related to KL divergence by:

A) H(P, Q) = H(P) + D_{KL}(P || Q) B) H(P, Q) = H(P) − D_{KL}(P || Q) C) H(P, Q) = D_{KL}(P || Q) D) H(P, Q) = H(P) · D_{KL}(P || Q)

Correct: A)


Practice Problems

  1. Compute H(X) for X ~ Geometric(0.5) in closed form. You may use Σ p log₂ p.
  2. Show that H(X, Y) = H(X) + H(Y | X) and that H(Y | X) ≤ H(Y) with equality iff X ⟂ Y.
  3. Prove Gibbs' inequality: D_{KL}(P || Q) ≥ 0.
  4. Let X and Y be binary with p(0,0)=p(1,1)=0.3, p(0,1)=p(1,0)=0.2. Compute I(X; Y), H(X|Y), H(Y|X).
  5. For the normal distribution N(μ, σ²), the differential entropy is h(X) = (1/2) log₂(2πeσ²). Verify this formula.
  6. Show the data processing inequality: if X→Y→Z is a Markov chain, then I(X; Z) ≤ I(X; Y).
  7. Find the mutual information I(X; Y) for a bivariate normal with correlation ρ. What happens as ρ → 1?
Answers 1. p(k) = (1/2)^k. H(X) = −Σ_{k=1}^{∞} (1/2)^k log₂(1/2)^k = Σ k/2^k = 2 bits. 2. H(X,Y)=H(X)+H(Y|X): definitional from log p(x,y)=log p(x)+log p(y|x). H(Y|X)≤H(Y): I(X;Y)=H(Y)−H(Y|X)≥0 ⇒ H(Y|X)≤H(Y). Equality when I=0 ⇔ X⟂Y. 3. See proof in Core Content §3. Uses log x ≤ x−1. 4. p_X(0)=p_X(1)=0.5, p_Y(0)=p_Y(1)=0.5. H(X)=H(Y)=1. H(X,Y)=−[2·0.3 log₂ 0.3+2·0.2 log₂ 0.2]=−[0.6(−1.737)+0.4(−2.322)]=1.971. I=1+1−1.971=0.029 bits. Very weak dependence. H(X|Y)=1−0.029=0.971, H(Y|X)=0.971. 5. Differential entropy for continuous: h(X)=−∫f(x)ln f(x)dx (nats). For N(μ,σ²): ln f=−(x−μ)²/(2σ²)−ln(√(2πσ²)). h=1/2+ln(√(2πσ²))=(1/2)ln(2πeσ²) nats. In bits: divide by ln 2. 6. I(X;Y,Z) = I(X;Z) + I(X;Y|Z) ≥ I(X;Z) (since MI is non-negative). By Markov property: I(X;Y|Z) = I(X;Y) − I(X;Z) ≥ 0 → I(X;Z) ≤ I(X;Y). 7. For bivariate normal, I(X;Y) = −(1/2)log₂(1−ρ²). As ρ→±1, I→∞ — perfect linear dependence gives infinite information (continuous case, differential entropy subtleties). As ρ→0, I→0.

Summary


Pitfalls


Quiz

  1. Shannon entropy H(X) is measured in: a) Probabilities b) Bits (when using log base 2) c) Standard deviations d) Correlation units Answer: b. H(X) = −Σ p(x) log₂ p(x) gives bits. Use natural log for nats.

  2. H(X) = 0 when: a) X is uniform b) X has many values c) X is deterministic (only one possible value) d) X has two equally likely outcomes Answer: c. No uncertainty → zero entropy. Uniform maximizes entropy for given support.

  3. D_{KL}(P || Q) is: a) Symmetric: D_{KL}(P||Q) = D_{KL}(Q||P) b) Always negative c) Always ≥ 0, with equality iff P = Q d) A true distance metric Answer: c. Gibbs' inequality guarantees non-negativity. KL is NOT symmetric, so it's not a metric.

  4. Mutual information I(X; Y) = 0 if and only if: a) Cov(X, Y) = 0 b) X and Y are independent c) ρ(X, Y) = 0 d) X and Y are normal Answer: b. Unlike correlation, zero mutual information means genuine statistical independence.

  5. The chain rule for entropy states: a) H(X, Y) = H(X) + H(Y) b) H(X, Y) = H(X) H(Y) c) H(X, Y) = H(X) + H(Y | X) d) H(X, Y) = H(Y) + H(X | Y, X) Answer: c. Joint entropy = marginal + conditional. Generalizes to n variables.

  6. The data processing inequality says that if X→Y→Z is a Markov chain: a) H(Z) ≥ H(Y) b) I(X; Z) ≤ I(X; Y) c) H(X) = H(Z) d) I(X; Z) = I(Y; Z) Answer: b. Processing Y into Z cannot increase the information about X.

  7. Which distribution maximizes entropy on (−∞, ∞) given a fixed mean and variance? a) Uniform b) Exponential c) Normal d) Cauchy Answer: c. The normal is the maximum entropy distribution with prescribed mean and variance.

  8. H(Y | X) is: a) Always ≥ H(Y) b) The expected remaining uncertainty about Y after knowing X c) Equal to H(Y) for all X, Y d) The same as I(X; Y) Answer: b. H(Y|X) = E[H(Y|X=x)] — the average conditional entropy. Always ≤ H(Y).


Next Steps

You have completed Phase 11. Continue to Phase 12 for an introduction to statistical inference: point estimation, confidence intervals, and hypothesis testing.