### 13.2 — Conditional Entropy and Chain Rule
Phase: Information Theory Prerequisites: 13-01-entropy, 10-02-conditional-probability
Learning Objectives
By the end of this subject, you will be able to:
- Define and compute conditional entropy $H(Y \mid X)$
- State and apply the chain rule $H(X, Y) = H(X) + H(Y \mid X)$
- Prove that conditioning reduces entropy: $H(Y \mid X) \leq H(Y)$
- Extend the chain rule to $n$ variables
- Interpret conditional entropy as remaining uncertainty after observing $X$
Core Content
Conditional Entropy
$H(Y \mid X)$ is the average uncertainty about $Y$ when you know $X$:
$$H(Y \mid X) = \sum_{x} p(x) \cdot H(Y \mid X = x)$$
$$= \sum_{x} p(x) \left[-\sum_{y} p(y \mid x) \log_2 p(y \mid x)\right]$$
$$= -\sum_{x, y} p(x, y) \log_2 p(y \mid x)$$
Interpretation: For each value $x$ of $X$, compute the entropy of $Y$ given that $X = x$. Then average over the distribution of $X$.
⚠️ CRITICAL: $H(Y \mid X = x)$ can be LARGER than $H(Y)$ for a specific $x$. The inequality $H(Y \mid X) \leq H(Y)$ only holds for the AVERAGE conditional entropy, not for individual $x$ values. For example, if $X$ takes a surprising value, $H(Y \mid X = x)$ might spike.
The Chain Rule
$$H(X, Y) = H(X) + H(Y \mid X)$$
This is the information-theoretic chain rule. It decomposes the total uncertainty about $(X, Y)$ into: 1. Uncertainty about $X$ alone 2. Remaining uncertainty about $Y$ after knowing $X$
Proof sketch: $H(X, Y) = -\sum p(x, y)\log p(x, y) = -\sum p(x, y)\log[p(x) \cdot p(y \mid x)]$
$= -\sum p(x, y)\log p(x) - \sum p(x, y)\log p(y \mid x)$
$= -\sum p(x)\log p(x) - \sum p(x, y)\log p(y \mid x) = H(X) + H(Y \mid X)$
⚠️ CRITICAL: Conditioning Reduces Entropy
$$H(Y \mid X) \leq H(Y)$$
Proof: $I(X; Y) = H(Y) - H(Y \mid X) \geq 0$ (mutual information is always non-negative; see 13-03). Therefore $H(Y \mid X) \leq H(Y)$.
Intuition: Knowing $X$ cannot make you MORE uncertain about $Y$ on average. It either provides information (reducing entropy) or is irrelevant (leaving entropy unchanged).
Equality holds iff $X$ and $Y$ are independent: knowing $X$ tells you nothing about $Y$.
Chain Rule for $n$ Variables
By repeated application:
$$H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \ldots, X_{i-1})$$
This is the information-theoretic analogue of the probability chain rule.
Special case — independent variables: $H(X_1, \ldots, X_n) = \sum_i H(X_i)$
Symmetry of Joint Entropy
$H(X, Y) = H(Y, X)$ — joint entropy is symmetric.
Therefore: $H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$ → Bayes' rule for entropy.
Key Terms
- Chain rule
- Conditional entropy
- Conditioning reduces entropy
Worked Examples
Example 1: Computing conditional entropy
$X$ = weather (Sunny with $p=0.6$, Rainy with $p=0.4$) $Y$ = whether you carry an umbrella.
Given Sunny: P(umbrella) = 0.1, P(no umbrella) = 0.9 Given Rainy: P(umbrella) = 0.9, P(no umbrella) = 0.1
$H(Y \mid X=\text{Sunny}) = -0.1\log_2 0.1 - 0.9\log_2 0.9 = H_b(0.1) \approx 0.469$
$H(Y \mid X=\text{Rainy}) = -0.9\log_2 0.9 - 0.1\log_2 0.1 = H_b(0.1) \approx 0.469$
$H(Y \mid X) = 0.6 \cdot 0.469 + 0.4 \cdot 0.469 = 0.469$ bits
Note: Because the conditional distributions are symmetric, $H(Y \mid X) = H_b(0.1)$ regardless of $p(x)$.
Example 2: Chain rule in action
$X$ is a fair coin. If $X = \text{Heads}$, $Y$ is a fair die roll (1-6). If $X = \text{Tails}$, $Y$ is always 1.
$H(X) = 1$ bit (fair coin)
$H(Y \mid X=\text{Heads}) = \log_2 6 \approx 2.585$ bits $H(Y \mid X=\text{Tails}) = 0$ bits (deterministic)
$H(Y \mid X) = 0.5 \cdot 2.585 + 0.5 \cdot 0 = 1.293$ bits
$H(X, Y) = H(X) + H(Y \mid X) = 1 + 1.293 = 2.293$ bits
Example 3: Conditioning reduces entropy
$H(Y)$ from Example 2: $p(Y=1) = 0.5(1/6) + 0.5(1) = 0.083 + 0.5 = 0.583$, plus 5 other outcomes at $0.5/6 \approx 0.083$ each.
Without computing exactly: $H(Y)$ > $H(Y \mid X) = 1.293$ because knowing $X$ reduces our uncertainty about $Y$. The inequality $H(Y \mid X) \leq H(Y)$ holds.
Quiz
Q1: What does the concept of Chain rule primarily refer to in this subject?
A) A computational error related to Chain rule B) A historical anecdote about Chain rule C) A visual representation of Chain rule D) The definition and application of Chain rule
Correct: D)
- If you chose A: This is incorrect. Chain rule is defined as: the definition and application of chain rule. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Chain rule is defined as: the definition and application of chain rule. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Chain rule is defined as: the definition and application of chain rule. The other options describe different aspects that are not the primary focus.
- If you chose D: Chain rule is defined as: the definition and application of chain rule. The other options describe different aspects that are not the primary focus. Correct!
Q2: Which of the following is the key formula discussed in this subject?
A) H(Y \mid X) B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) A simplified version of H(Y \mid X)...
Correct: A)
- If you chose A: The formula H(Y \mid X) is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose B: This is incorrect. The formula H(Y \mid X) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula H(Y \mid X) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula H(Y \mid X) is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Conditional entropy?
A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used to conditional entropy in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Conditional entropy serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Conditional entropy serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Conditional entropy serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Conditional entropy serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q4: Which statement about Conditioning reduces entropy is TRUE?
A) Conditioning reduces entropy is an advanced topic beyond this subject's scope B) Conditioning reduces entropy is not related to this subject C) Conditioning reduces entropy is a fundamental concept covered in this subject D) Conditioning reduces entropy is mentioned only as a historical footnote
Correct: C)
- If you chose A: This is incorrect. Conditioning reduces entropy is a fundamental concept covered in this subject. This subject covers Conditioning reduces entropy as part of its core content.
- If you chose B: This is incorrect. Conditioning reduces entropy is a fundamental concept covered in this subject. This subject covers Conditioning reduces entropy as part of its core content.
- If you chose C: Conditioning reduces entropy is a fundamental concept covered in this subject. This subject covers Conditioning reduces entropy as part of its core content. Correct!
- If you chose D: This is incorrect. Conditioning reduces entropy is a fundamental concept covered in this subject. This subject covers Conditioning reduces entropy as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) The inverse of the correct answer C) H(Y) + H(X \mid Y)$ → Bayes' rule for entropy. D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is H(Y) + H(X \mid Y)$ → Bayes' rule for entropy.. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is H(Y) + H(X \mid Y)$ → Bayes' rule for entropy.. The other options represent common errors.
- If you chose C: The worked examples show that the result is H(Y) + H(X \mid Y)$ → Bayes' rule for entropy.. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is H(Y) + H(X \mid Y)$ → Bayes' rule for entropy.. The other options represent common errors.
Q6: How are Conditioning reduces entropy and The Chain Rule related?
A) Conditioning reduces entropy and The Chain Rule are closely related concepts B) Conditioning reduces entropy and The Chain Rule are completely unrelated topics C) Conditioning reduces entropy is a special case of The Chain Rule D) Conditioning reduces entropy is the inverse of The Chain Rule
Correct: A)
- If you chose A: Both Conditioning reduces entropy and The Chain Rule are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Conditioning reduces entropy and The Chain Rule are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Conditioning reduces entropy and The Chain Rule are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Conditioning reduces entropy and The Chain Rule are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with ⚠️ Critical: Conditioning Reduces Entropy?
A) ⚠️ Critical: Conditioning Reduces Entropy is always computed the same way in all contexts B) The main error with ⚠️ Critical: Conditioning Reduces Entropy is using it when it is not needed C) A common mistake is confusing ⚠️ Critical: Conditioning Reduces Entropy with a similar concept D) ⚠️ Critical: Conditioning Reduces Entropy has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse ⚠️ Critical: Conditioning Reduces Entropy with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse ⚠️ Critical: Conditioning Reduces Entropy with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse ⚠️ Critical: Conditioning Reduces Entropy with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse ⚠️ Critical: Conditioning Reduces Entropy with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Chain Rule For $N$ Variables?
A) Use Chain Rule For $N$ Variables only in pure mathematics contexts B) Avoid Chain Rule For $N$ Variables unless explicitly instructed C) Apply Chain Rule For $N$ Variables to solve problems in this subject's domain D) Chain Rule For $N$ Variables is not practically useful
Correct: C)
- If you chose A: This is incorrect. Chain Rule For $N$ Variables is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Chain Rule For $N$ Variables is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: Chain Rule For $N$ Variables is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. Chain Rule For $N$ Variables is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
If $X$ and $Y$ are independent, prove $H(Y \mid X) = H(Y)$.
Click for answer
For independent variables: $p(y \mid x) = p(y)$ for all $x, y$. $H(Y \mid X) = -\sum_x p(x) \sum_y p(y \mid x) \log p(y \mid x)$ $= -\sum_x p(x) \sum_y p(y) \log p(y) = \sum_x p(x) \cdot H(Y) = H(Y)$ The $(x, y)$ term becomes $p(x)p(y)$, so $H(Y \mid X) = H(Y)$. Knowing $X$ gives no information about $Y$. -
For $H(Y \mid X) = 0$, what does this imply about the relationship between $X$ and $Y$?
Click for answer
$H(Y \mid X) = 0$ means that for every $x$ with $p(x) > 0$, $H(Y \mid X = x) = 0$ — the conditional distribution $p(y \mid x)$ is deterministic. In other words, $Y$ is a function of $X$: knowing $X$ completely determines $Y$. -
$H(X) = 3$ bits, $H(Y) = 2$ bits, $H(X, Y) = 4$ bits. Find $H(X \mid Y)$ and $H(Y \mid X)$.
Click for answer
Chain rule: $H(X, Y) = H(Y) + H(X \mid Y)$ → $H(X \mid Y) = 4 - 2 = 2$ bits Also: $H(X, Y) = H(X) + H(Y \mid X)$ → $H(Y \mid X) = 4 - 3 = 1$ bit Check: $H(X \mid Y) = 2 \leq H(X) = 3$ ✓, $H(Y \mid X) = 1 \leq H(Y) = 2$ ✓ -
Generalise the chain rule to 3 variables: $H(X, Y, Z) = ?$
Click for answer
$H(X, Y, Z) = H(X) + H(Y \mid X) + H(Z \mid X, Y)$ Or any permutation, e.g.: $H(Z) + H(Y \mid Z) + H(X \mid Y, Z)$ -
Can $H(Y \mid X) > H(Y)$? Explain.
Click for answer
**No.** $H(Y \mid X) \leq H(Y)$ always, because $I(X; Y) = H(Y) - H(Y \mid X) \geq 0$. However, $H(Y \mid X = x)$ CAN be larger than $H(Y)$ for a specific $x$ — the inequality only applies to the AVERAGE conditional entropy, not to individual conditional distributions.
Summary
Key takeaways:
- Conditional entropy $H(Y \mid X)$ = average remaining uncertainty about $Y$ after knowing $X$
- Chain rule: $H(X, Y) = H(X) + H(Y \mid X)$ decomposes joint uncertainty
- Conditioning reduces entropy: $H(Y \mid X) \leq H(Y)$, with equality iff $X \perp!!!\perp Y$
- For $n$ variables: $H(X_1, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})$
- $H(Y \mid X) = 0$ iff $Y$ is a deterministic function of $X$
- Symmetry gives: $H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$
Pitfalls
-
Thinking $H(Y \mid X = x) \leq H(Y)$ for every specific $x$: The inequality $H(Y \mid X) \leq H(Y)$ holds only for the average conditional entropy over all $x$, weighted by $p(x)$. For a specific value $x$, $H(Y \mid X = x)$ can easily exceed $H(Y)$ — knowing a surprising value of $X$ may increase your uncertainty about $Y$.
-
Applying $H(X,Y) = H(X) + H(Y)$ to dependent variables: The chain rule says $H(X,Y) = H(X) + H(Y \mid X)$. The simpler sum $H(X) + H(Y)$ is only valid when $X \perp!!!\perp Y$. Using it for dependent variables overcounts entropy — this is a common error when computing joint entropy from marginals without checking independence.
-
Getting the conditioning direction backwards: $H(Y \mid X)$ is generally not equal to $H(X \mid Y)$. The chain rule $H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$ makes this asymmetry clear. Mixing up the direction leads to wrong entropy decompositions.
-
Assuming $H(Y \mid X) = 0$ implies $H(X \mid Y) = 0$: $H(Y \mid X) = 0$ means $Y$ is a deterministic function of $X$, but $X$ could still be highly uncertain given $Y$ (e.g., $Y = X^2$ for symmetric $X$). The conditional entropy being zero in one direction says nothing about the reverse.
-
Forgetting that $H(Y \mid X)$ can be zero for non-trivial $Y$: $H(Y \mid X) = 0$ does not mean $Y$ is deterministic — it means $Y$ is completely determined by $X$, even if $Y$'s marginal entropy is large. This is the case for any deterministic function $Y = f(X)$.
Next Steps
Next up: 13-03-mutual-information.md