### 13.3 — Mutual Information
Phase: Information Theory Prerequisites: 13-02-conditional-entropy-chain-rule, 13-01-entropy
Learning Objectives
By the end of this subject, you will be able to:
- Define mutual information $I(X; Y)$ and relate it to entropy
- Prove that $I(X; Y) \geq 0$ with equality iff $X \perp!!!\perp Y$
- Show that mutual information is symmetric: $I(X; Y) = I(Y; X)$
- Compute $I(X; Y)$ from joint and marginal distributions
- Compare mutual information to correlation and explain when one is preferable
Core Content
⚠️ CRITICAL: What Mutual Information Measures
Mutual information quantifies how much information one random variable contains about another. It is the reduction in uncertainty about $Y$ gained by observing $X$:
$$I(X; Y) = H(Y) - H(Y \mid X)$$
Equivalent formulations: $$I(X; Y) = H(X) - H(X \mid Y)$$ $$I(X; Y) = H(X) + H(Y) - H(X, Y)$$ $$I(X; Y) = \sum_{x,y} p(x, y) \log_2 \frac{p(x, y)}{p(x) p(y)}$$
The last form makes it clear: $I(X; Y)$ is the KL divergence between the joint distribution $p(x, y)$ and the product of marginals $p(x)p(y)$ — it measures how far $(X, Y)$ is from independence.
Properties
- Non-negativity: $I(X; Y) \geq 0$, with equality iff $X \perp!!!\perp Y$
-
Proof: $I(X; Y)$ is a KL divergence, and KL divergence is always $\geq 0$ (Gibbs' inequality).
-
Symmetry: $I(X; Y) = I(Y; X)$
-
The information $X$ provides about $Y$ equals the information $Y$ provides about $X$.
-
Upper bounds: $I(X; Y) \leq \min(H(X), H(Y))$
-
Mutual information can't exceed the total information in either variable.
-
Self-information: $I(X; X) = H(X)$
- A variable contains exactly its own entropy as information about itself.
Relationship to Entropy (Venn Diagram Intuition)
$ H(X) H(Y)
+------+ +------+
| | | |
| | I | |
| |----| |
| |X;Y)| |
+------+ +------+
^ ^
| |
H(X|Y) H(Y|X)
$
- $I(X; Y)$ = overlap between $H(X)$ and $H(Y)$
- $H(X \mid Y) = H(X) - I(X; Y)$
- $H(Y \mid X) = H(Y) - I(X; Y)$
- $H(X, Y) = H(X) + H(Y) - I(X; Y)$
🚩 Common Pitfall: The Venn diagram is only an approximation for two variables. For three or more variables, the "overlap" intuition breaks down — you can have pairwise MI without triple-wise MI, and vice versa.
Mutual Information vs Correlation
| Property | Correlation $\rho$ | Mutual Information $I(X; Y)$ |
|---|---|---|
| Captures | Linear relationships only | Any statistical dependence |
| Range | $[-1, 1]$ | $[0, \infty)$ |
| Zero means | No linear relationship | Independence |
| Units | Dimensionless | Bits (or nats) |
Key advantage of MI: It detects non-linear dependencies that correlation misses.
Example: $Y = X^2$ with $X \sim \text{Uniform}(-1, 1)$. Correlation $\rho \approx 0$ (no linear relationship), but $I(X; Y) > 0$ (perfect functional dependence in each direction!).
Conditional Mutual Information
$$I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z)$$
The information $X$ and $Y$ share, AFTER accounting for $Z$. This is central to feature selection: does $X$ provide additional information about $Y$ beyond what $Z$ already tells us?
Key Terms
- Mutual information
- Non-negative
- Upper bounded
- Zero
Worked Examples
Example 1: Computing mutual information
$X, Y \in {0, 1}$ with joint distribution:
| $p(x, y)$ | $Y=0$ | $Y=1$ | $p(x)$ |
|---|---|---|---|
| $X=0$ | 0.3 | 0.1 | 0.4 |
| $X=1$ | 0.2 | 0.4 | 0.6 |
| $p(y)$ | 0.5 | 0.5 | 1.0 |
$H(Y) = H_b(0.5) = 1$ bit
$H(Y \mid X=0)$: $p(y \mid X=0) = (0.3/0.4, 0.1/0.4) = (0.75, 0.25)$ $H(Y \mid X=0) = H_b(0.25) = -0.25\log_2 0.25 - 0.75\log_2 0.75 = 0.5 + 0.311 = 0.811$
$H(Y \mid X=1)$: $p(y \mid X=1) = (0.2/0.6, 0.4/0.6) = (0.333, 0.667)$ $H(Y \mid X=1) = H_b(0.333) = -0.333\log_2 0.333 - 0.667\log_2 0.667 = 0.528 + 0.390 = 0.918$
$H(Y \mid X) = 0.4 \cdot 0.811 + 0.6 \cdot 0.918 = 0.324 + 0.551 = 0.875$
$I(X; Y) = H(Y) - H(Y \mid X) = 1 - 0.875 = 0.125$ bits
Knowing $X$ reduces uncertainty about $Y$ by 0.125 bits on average.
Example 2: MI for perfectly dependent variables
$Y = X$ (perfect dependence). Then $p(x, y) = p(x)$ when $x=y$, 0 otherwise.
$I(X; Y) = H(Y) - H(Y \mid X) = H(X) - 0 = H(X)$
Mutual information equals the entropy — $X$ completely determines $Y$.
Example 3: MI vs Correlation — non-linear case
$X \sim \text{Bernoulli}(0.5)$ on ${-1, 1}$, $Y = |X| = 1$ (always).
Correlation: $E[XY] = E[X \cdot 1] = 0$, $\rho = 0$ — suggests no relationship.
But $I(X; Y) = H(Y) - H(Y \mid X) = 0 - 0 = 0$ — indeed, knowing $Y$ (always 1) tells you nothing about $X$.
Better example: $X \sim N(0, 1)$, $Y = X^2$. $\rho = 0$, but $I(X; Y) > 0$ because $Y$ tells you the magnitude of $X$ (though not the sign).
Quiz
Q1: What does the concept of Mutual information primarily refer to in this subject?
A) The definition and application of Mutual information B) A computational error related to Mutual information C) A visual representation of Mutual information D) A historical anecdote about Mutual information
Correct: A)
- If you chose A: Mutual information is defined as: the definition and application of mutual information. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. Mutual information is defined as: the definition and application of mutual information. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Mutual information is defined as: the definition and application of mutual information. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Mutual information is defined as: the definition and application of mutual information. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) I(X; Y) \geq 0 B) A simplified version of I(X; Y) \geq 0... C) The inverse operation of the formula in question D) An unrelated formula from a different topic
Correct: A)
- If you chose A: The formula I(X; Y) \geq 0 is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose B: This is incorrect. The formula I(X; Y) \geq 0 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula I(X; Y) \geq 0 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula I(X; Y) \geq 0 is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Non-negative?
A) It is primarily a historical notation system B) It is used to non-negative in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts
Correct: B)
- If you chose A: This is incorrect. Non-negative serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Non-negative serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Non-negative serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Non-negative serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Upper bounded is TRUE?
A) Upper bounded is an advanced topic beyond this subject's scope B) Upper bounded is not related to this subject C) Upper bounded is a fundamental concept covered in this subject D) Upper bounded is mentioned only as a historical footnote
Correct: C)
- If you chose A: This is incorrect. Upper bounded is a fundamental concept covered in this subject. This subject covers Upper bounded as part of its core content.
- If you chose B: This is incorrect. Upper bounded is a fundamental concept covered in this subject. This subject covers Upper bounded as part of its core content.
- If you chose C: Upper bounded is a fundamental concept covered in this subject. This subject covers Upper bounded as part of its core content. Correct!
- If you chose D: This is incorrect. Upper bounded is a fundamental concept covered in this subject. This subject covers Upper bounded as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) A different result from a common mistake C) H(X) + H(Y \mid X)$ D) The inverse of the correct answer
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is H(X) + H(Y \mid X)$. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is H(X) + H(Y \mid X)$. The other options represent common errors.
- If you chose C: The worked examples show that the result is H(X) + H(Y \mid X)$. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is H(X) + H(Y \mid X)$. The other options represent common errors.
Q6: How are Upper bounded and ⚠️ Critical: What Mutual Information Measures related?
A) Upper bounded is a special case of ⚠️ Critical: What Mutual Information Measures B) Upper bounded is the inverse of ⚠️ Critical: What Mutual Information Measures C) Upper bounded and ⚠️ Critical: What Mutual Information Measures are closely related concepts D) Upper bounded and ⚠️ Critical: What Mutual Information Measures are completely unrelated topics
Correct: C)
- If you chose A: This is incorrect. Both Upper bounded and ⚠️ Critical: What Mutual Information Measures are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Upper bounded and ⚠️ Critical: What Mutual Information Measures are covered in this subject as interconnected topics.
- If you chose C: Both Upper bounded and ⚠️ Critical: What Mutual Information Measures are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both Upper bounded and ⚠️ Critical: What Mutual Information Measures are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Properties?
A) A common mistake is confusing Properties with a similar concept B) Properties is always computed the same way in all contexts C) The main error with Properties is using it when it is not needed D) Properties has no common misconceptions
Correct: A)
- If you chose A: Students often confuse Properties with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse Properties with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Properties with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Properties with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Relationship To Entropy (Venn Diagram Intuition)?
A) Apply Relationship To Entropy (Venn Diagram Intuition) to solve problems in this subject's domain B) Avoid Relationship To Entropy (Venn Diagram Intuition) unless explicitly instructed C) Use Relationship To Entropy (Venn Diagram Intuition) only in pure mathematics contexts D) Relationship To Entropy (Venn Diagram Intuition) is not practically useful
Correct: A)
- If you chose A: Relationship To Entropy (Venn Diagram Intuition) is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Relationship To Entropy (Venn Diagram Intuition) is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Relationship To Entropy (Venn Diagram Intuition) is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Relationship To Entropy (Venn Diagram Intuition) is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Show that $I(X; Y) = H(X) + H(Y) - H(X, Y)$.
Click for answer
From the chain rule: $H(X, Y) = H(X) + H(Y \mid X)$ So $H(Y \mid X) = H(X, Y) - H(X)$ Then $I(X; Y) = H(Y) - H(Y \mid X) = H(Y) - [H(X, Y) - H(X)] = H(X) + H(Y) - H(X, Y)$ ✓ -
If $X$ and $Y$ are independent, what is $I(X; Y)$? Why?
Click for answer
$I(X; Y) = 0$. For independent variables: $H(Y \mid X) = H(Y)$, so $I(X; Y) = H(Y) - H(Y) = 0$. Alternatively: $p(x, y) = p(x)p(y)$, so $\log \frac{p(x,y)}{p(x)p(y)} = \log 1 = 0$ for all $(x, y)$. -
For a deterministic function $Y = f(X)$, prove $I(X; Y) = H(Y)$.
Click for answer
If $Y = f(X)$, then $H(Y \mid X) = 0$ (deterministic given $X$). $I(X; Y) = H(Y) - H(Y \mid X) = H(Y) - 0 = H(Y)$. Also $I(X; Y) = H(X) - H(X \mid Y) \leq H(X)$. So $H(Y) \leq H(X)$ — a deterministic function cannot increase entropy. -
If $H(X) = 5$, $H(Y) = 3$, and $I(X; Y) = 2$, find $H(X, Y)$ and $H(Y \mid X)$.
Click for answer
$H(X, Y) = H(X) + H(Y) - I(X; Y) = 5 + 3 - 2 = 6$ bits $H(Y \mid X) = H(Y) - I(X; Y) = 3 - 2 = 1$ bit -
Give an example where correlation is near zero but mutual information is high.
Click for answer
$X \sim \text{Uniform}\{-2, -1, 0, 1, 2\}$, $Y = X^2$. $\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0 - 0 \cdot 2 = 0$, so $\rho = 0$. But $Y$ tells you the absolute value of $X$: $H(Y \mid X) = 0$ (deterministic), so $I(X; Y) = H(Y) > 0$. MI captures the perfect functional dependence that correlation misses.
Summary
Key takeaways:
- Mutual information $I(X; Y)$ measures shared information: how much knowing one variable reduces uncertainty about the other
- $I(X; Y) = H(X) + H(Y) - H(X, Y) = H(Y) - H(Y \mid X) = H(X) - H(X \mid Y)$
- Non-negative and symmetric: $I(X; Y) = I(Y; X) \geq 0$
- Zero iff $X$ and $Y$ are independent
- Upper bounded by $\min(H(X), H(Y))$
- Unlike correlation, MI captures any statistical dependence, including non-linear
Pitfalls
-
Taking the entropy Venn diagram literally for more than two variables: The Venn diagram ($H(X) + H(Y) - I(X;Y) = H(X,Y)$) is exact for two variables but breaks down with three or more. For three variables, the pairwise overlaps don't fully describe the information structure — there can be triple-wise interaction information that has no Venn diagram analogue.
-
Equating zero correlation with zero mutual information: Correlation only captures linear dependence. $Y = X^2$ for symmetric $X$ gives $\rho \approx 0$ but $I(X; Y) > 0$. Conversely, $I(X; Y) = 0$ implies full statistical independence, which is stronger than zero correlation. MI detects any form of dependence.
-
Forgetting that $I(X; X) = H(X)$, not zero: A variable always shares its entire entropy with itself. This boundary case is a useful sanity check: if your MI computation gives $I(X; X) \neq H(X)$, something is wrong.
-
Treating mutual information as a distance metric: MI is symmetric and non-negative, but it does NOT satisfy the triangle inequality. You cannot say $I(X; Z) \leq I(X; Y) + I(Y; Z)$ in general. The data processing inequality gives $I(X; Z) \leq I(X; Y)$ for Markov chains $X \to Y \to Z$ — a different bound entirely.
-
Confusing conditional MI with incremental value: $I(X; Y \mid Z)$ measures the information $X$ and $Y$ share after accounting for $Z$. A small $I(X; Y \mid Z)$ doesn't mean $X$ is useless for predicting $Y$ — it means $Z$ already captures that information. Feature selection based on conditional MI must consider this subtlety.
Next Steps
Next up: 13-04-kl-divergence.md