Math graphic
📐 Concept diagram

### 13.3 — Mutual Information

Phase: Information Theory Prerequisites: 13-02-conditional-entropy-chain-rule, 13-01-entropy

Learning Objectives

By the end of this subject, you will be able to:

  1. Define mutual information $I(X; Y)$ and relate it to entropy
  2. Prove that $I(X; Y) \geq 0$ with equality iff $X \perp!!!\perp Y$
  3. Show that mutual information is symmetric: $I(X; Y) = I(Y; X)$
  4. Compute $I(X; Y)$ from joint and marginal distributions
  5. Compare mutual information to correlation and explain when one is preferable

Core Content

⚠️ CRITICAL: What Mutual Information Measures

Mutual information quantifies how much information one random variable contains about another. It is the reduction in uncertainty about $Y$ gained by observing $X$:

$$I(X; Y) = H(Y) - H(Y \mid X)$$

Equivalent formulations: $$I(X; Y) = H(X) - H(X \mid Y)$$ $$I(X; Y) = H(X) + H(Y) - H(X, Y)$$ $$I(X; Y) = \sum_{x,y} p(x, y) \log_2 \frac{p(x, y)}{p(x) p(y)}$$

The last form makes it clear: $I(X; Y)$ is the KL divergence between the joint distribution $p(x, y)$ and the product of marginals $p(x)p(y)$ — it measures how far $(X, Y)$ is from independence.

Properties

  1. Non-negativity: $I(X; Y) \geq 0$, with equality iff $X \perp!!!\perp Y$
  2. Proof: $I(X; Y)$ is a KL divergence, and KL divergence is always $\geq 0$ (Gibbs' inequality).

  3. Symmetry: $I(X; Y) = I(Y; X)$

  4. The information $X$ provides about $Y$ equals the information $Y$ provides about $X$.

  5. Upper bounds: $I(X; Y) \leq \min(H(X), H(Y))$

  6. Mutual information can't exceed the total information in either variable.

  7. Self-information: $I(X; X) = H(X)$

  8. A variable contains exactly its own entropy as information about itself.

Relationship to Entropy (Venn Diagram Intuition)

$    H(X)          H(Y)
   +------+    +------+
   |      |    |      |
   |      | I  |      |
   |      |----|      |
   |      |X;Y)|      |
   +------+    +------+
          ^    ^
          |    |
       H(X|Y) H(Y|X)
$

🚩 Common Pitfall: The Venn diagram is only an approximation for two variables. For three or more variables, the "overlap" intuition breaks down — you can have pairwise MI without triple-wise MI, and vice versa.

Mutual Information vs Correlation

Property Correlation $\rho$ Mutual Information $I(X; Y)$
Captures Linear relationships only Any statistical dependence
Range $[-1, 1]$ $[0, \infty)$
Zero means No linear relationship Independence
Units Dimensionless Bits (or nats)

Key advantage of MI: It detects non-linear dependencies that correlation misses.

Example: $Y = X^2$ with $X \sim \text{Uniform}(-1, 1)$. Correlation $\rho \approx 0$ (no linear relationship), but $I(X; Y) > 0$ (perfect functional dependence in each direction!).

Conditional Mutual Information

$$I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z)$$

The information $X$ and $Y$ share, AFTER accounting for $Z$. This is central to feature selection: does $X$ provide additional information about $Y$ beyond what $Z$ already tells us?



Key Terms

Worked Examples

Example 1: Computing mutual information

$X, Y \in {0, 1}$ with joint distribution:

$p(x, y)$ $Y=0$ $Y=1$ $p(x)$
$X=0$ 0.3 0.1 0.4
$X=1$ 0.2 0.4 0.6
$p(y)$ 0.5 0.5 1.0

$H(Y) = H_b(0.5) = 1$ bit

$H(Y \mid X=0)$: $p(y \mid X=0) = (0.3/0.4, 0.1/0.4) = (0.75, 0.25)$ $H(Y \mid X=0) = H_b(0.25) = -0.25\log_2 0.25 - 0.75\log_2 0.75 = 0.5 + 0.311 = 0.811$

$H(Y \mid X=1)$: $p(y \mid X=1) = (0.2/0.6, 0.4/0.6) = (0.333, 0.667)$ $H(Y \mid X=1) = H_b(0.333) = -0.333\log_2 0.333 - 0.667\log_2 0.667 = 0.528 + 0.390 = 0.918$

$H(Y \mid X) = 0.4 \cdot 0.811 + 0.6 \cdot 0.918 = 0.324 + 0.551 = 0.875$

$I(X; Y) = H(Y) - H(Y \mid X) = 1 - 0.875 = 0.125$ bits

Knowing $X$ reduces uncertainty about $Y$ by 0.125 bits on average.

Example 2: MI for perfectly dependent variables

$Y = X$ (perfect dependence). Then $p(x, y) = p(x)$ when $x=y$, 0 otherwise.

$I(X; Y) = H(Y) - H(Y \mid X) = H(X) - 0 = H(X)$

Mutual information equals the entropy — $X$ completely determines $Y$.

Example 3: MI vs Correlation — non-linear case

$X \sim \text{Bernoulli}(0.5)$ on ${-1, 1}$, $Y = |X| = 1$ (always).

Correlation: $E[XY] = E[X \cdot 1] = 0$, $\rho = 0$ — suggests no relationship.

But $I(X; Y) = H(Y) - H(Y \mid X) = 0 - 0 = 0$ — indeed, knowing $Y$ (always 1) tells you nothing about $X$.

Better example: $X \sim N(0, 1)$, $Y = X^2$. $\rho = 0$, but $I(X; Y) > 0$ because $Y$ tells you the magnitude of $X$ (though not the sign).



Quiz

Q1: What does the concept of Mutual information primarily refer to in this subject?

A) The definition and application of Mutual information B) A computational error related to Mutual information C) A visual representation of Mutual information D) A historical anecdote about Mutual information

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) I(X; Y) \geq 0 B) A simplified version of I(X; Y) \geq 0... C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: A)

Q3: What is the primary purpose of Non-negative?

A) It is primarily a historical notation system B) It is used to non-negative in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: B)

Q4: Which statement about Upper bounded is TRUE?

A) Upper bounded is an advanced topic beyond this subject's scope B) Upper bounded is not related to this subject C) Upper bounded is a fundamental concept covered in this subject D) Upper bounded is mentioned only as a historical footnote

Correct: C)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) A different result from a common mistake C) H(X) + H(Y \mid X)$ D) The inverse of the correct answer

Correct: C)

Q6: How are Upper bounded and ⚠️ Critical: What Mutual Information Measures related?

A) Upper bounded is a special case of ⚠️ Critical: What Mutual Information Measures B) Upper bounded is the inverse of ⚠️ Critical: What Mutual Information Measures C) Upper bounded and ⚠️ Critical: What Mutual Information Measures are closely related concepts D) Upper bounded and ⚠️ Critical: What Mutual Information Measures are completely unrelated topics

Correct: C)

Q7: What is a common pitfall when working with Properties?

A) A common mistake is confusing Properties with a similar concept B) Properties is always computed the same way in all contexts C) The main error with Properties is using it when it is not needed D) Properties has no common misconceptions

Correct: A)

Q8: When should you apply Relationship To Entropy (Venn Diagram Intuition)?

A) Apply Relationship To Entropy (Venn Diagram Intuition) to solve problems in this subject's domain B) Avoid Relationship To Entropy (Venn Diagram Intuition) unless explicitly instructed C) Use Relationship To Entropy (Venn Diagram Intuition) only in pure mathematics contexts D) Relationship To Entropy (Venn Diagram Intuition) is not practically useful

Correct: A)

Practice Problems

  1. Show that $I(X; Y) = H(X) + H(Y) - H(X, Y)$.

    Click for answer From the chain rule: $H(X, Y) = H(X) + H(Y \mid X)$ So $H(Y \mid X) = H(X, Y) - H(X)$ Then $I(X; Y) = H(Y) - H(Y \mid X) = H(Y) - [H(X, Y) - H(X)] = H(X) + H(Y) - H(X, Y)$ ✓

  2. If $X$ and $Y$ are independent, what is $I(X; Y)$? Why?

    Click for answer $I(X; Y) = 0$. For independent variables: $H(Y \mid X) = H(Y)$, so $I(X; Y) = H(Y) - H(Y) = 0$. Alternatively: $p(x, y) = p(x)p(y)$, so $\log \frac{p(x,y)}{p(x)p(y)} = \log 1 = 0$ for all $(x, y)$.

  3. For a deterministic function $Y = f(X)$, prove $I(X; Y) = H(Y)$.

    Click for answer If $Y = f(X)$, then $H(Y \mid X) = 0$ (deterministic given $X$). $I(X; Y) = H(Y) - H(Y \mid X) = H(Y) - 0 = H(Y)$. Also $I(X; Y) = H(X) - H(X \mid Y) \leq H(X)$. So $H(Y) \leq H(X)$ — a deterministic function cannot increase entropy.

  4. If $H(X) = 5$, $H(Y) = 3$, and $I(X; Y) = 2$, find $H(X, Y)$ and $H(Y \mid X)$.

    Click for answer $H(X, Y) = H(X) + H(Y) - I(X; Y) = 5 + 3 - 2 = 6$ bits $H(Y \mid X) = H(Y) - I(X; Y) = 3 - 2 = 1$ bit

  5. Give an example where correlation is near zero but mutual information is high.

    Click for answer $X \sim \text{Uniform}\{-2, -1, 0, 1, 2\}$, $Y = X^2$. $\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0 - 0 \cdot 2 = 0$, so $\rho = 0$. But $Y$ tells you the absolute value of $X$: $H(Y \mid X) = 0$ (deterministic), so $I(X; Y) = H(Y) > 0$. MI captures the perfect functional dependence that correlation misses.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-04-kl-divergence.md