Math graphic
📐 Concept diagram

### 13.2 — Conditional Entropy and Chain Rule

Phase: Information Theory Prerequisites: 13-01-entropy, 10-02-conditional-probability

Learning Objectives

By the end of this subject, you will be able to:

  1. Define and compute conditional entropy $H(Y \mid X)$
  2. State and apply the chain rule $H(X, Y) = H(X) + H(Y \mid X)$
  3. Prove that conditioning reduces entropy: $H(Y \mid X) \leq H(Y)$
  4. Extend the chain rule to $n$ variables
  5. Interpret conditional entropy as remaining uncertainty after observing $X$

Core Content

Conditional Entropy

$H(Y \mid X)$ is the average uncertainty about $Y$ when you know $X$:

$$H(Y \mid X) = \sum_{x} p(x) \cdot H(Y \mid X = x)$$

$$= \sum_{x} p(x) \left[-\sum_{y} p(y \mid x) \log_2 p(y \mid x)\right]$$

$$= -\sum_{x, y} p(x, y) \log_2 p(y \mid x)$$

Interpretation: For each value $x$ of $X$, compute the entropy of $Y$ given that $X = x$. Then average over the distribution of $X$.

⚠️ CRITICAL: $H(Y \mid X = x)$ can be LARGER than $H(Y)$ for a specific $x$. The inequality $H(Y \mid X) \leq H(Y)$ only holds for the AVERAGE conditional entropy, not for individual $x$ values. For example, if $X$ takes a surprising value, $H(Y \mid X = x)$ might spike.

The Chain Rule

$$H(X, Y) = H(X) + H(Y \mid X)$$

This is the information-theoretic chain rule. It decomposes the total uncertainty about $(X, Y)$ into: 1. Uncertainty about $X$ alone 2. Remaining uncertainty about $Y$ after knowing $X$

Proof sketch: $H(X, Y) = -\sum p(x, y)\log p(x, y) = -\sum p(x, y)\log[p(x) \cdot p(y \mid x)]$

$= -\sum p(x, y)\log p(x) - \sum p(x, y)\log p(y \mid x)$

$= -\sum p(x)\log p(x) - \sum p(x, y)\log p(y \mid x) = H(X) + H(Y \mid X)$

⚠️ CRITICAL: Conditioning Reduces Entropy

$$H(Y \mid X) \leq H(Y)$$

Proof: $I(X; Y) = H(Y) - H(Y \mid X) \geq 0$ (mutual information is always non-negative; see 13-03). Therefore $H(Y \mid X) \leq H(Y)$.

Intuition: Knowing $X$ cannot make you MORE uncertain about $Y$ on average. It either provides information (reducing entropy) or is irrelevant (leaving entropy unchanged).

Equality holds iff $X$ and $Y$ are independent: knowing $X$ tells you nothing about $Y$.

Chain Rule for $n$ Variables

By repeated application:

$$H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \ldots, X_{i-1})$$

This is the information-theoretic analogue of the probability chain rule.

Special case — independent variables: $H(X_1, \ldots, X_n) = \sum_i H(X_i)$

Symmetry of Joint Entropy

$H(X, Y) = H(Y, X)$ — joint entropy is symmetric.

Therefore: $H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$ → Bayes' rule for entropy.



Key Terms

Worked Examples

Example 1: Computing conditional entropy

$X$ = weather (Sunny with $p=0.6$, Rainy with $p=0.4$) $Y$ = whether you carry an umbrella.

Given Sunny: P(umbrella) = 0.1, P(no umbrella) = 0.9 Given Rainy: P(umbrella) = 0.9, P(no umbrella) = 0.1

$H(Y \mid X=\text{Sunny}) = -0.1\log_2 0.1 - 0.9\log_2 0.9 = H_b(0.1) \approx 0.469$

$H(Y \mid X=\text{Rainy}) = -0.9\log_2 0.9 - 0.1\log_2 0.1 = H_b(0.1) \approx 0.469$

$H(Y \mid X) = 0.6 \cdot 0.469 + 0.4 \cdot 0.469 = 0.469$ bits

Note: Because the conditional distributions are symmetric, $H(Y \mid X) = H_b(0.1)$ regardless of $p(x)$.

Example 2: Chain rule in action

$X$ is a fair coin. If $X = \text{Heads}$, $Y$ is a fair die roll (1-6). If $X = \text{Tails}$, $Y$ is always 1.

$H(X) = 1$ bit (fair coin)

$H(Y \mid X=\text{Heads}) = \log_2 6 \approx 2.585$ bits $H(Y \mid X=\text{Tails}) = 0$ bits (deterministic)

$H(Y \mid X) = 0.5 \cdot 2.585 + 0.5 \cdot 0 = 1.293$ bits

$H(X, Y) = H(X) + H(Y \mid X) = 1 + 1.293 = 2.293$ bits

Example 3: Conditioning reduces entropy

$H(Y)$ from Example 2: $p(Y=1) = 0.5(1/6) + 0.5(1) = 0.083 + 0.5 = 0.583$, plus 5 other outcomes at $0.5/6 \approx 0.083$ each.

Without computing exactly: $H(Y)$ > $H(Y \mid X) = 1.293$ because knowing $X$ reduces our uncertainty about $Y$. The inequality $H(Y \mid X) \leq H(Y)$ holds.



Quiz

Q1: What does the concept of Chain rule primarily refer to in this subject?

A) A computational error related to Chain rule B) A historical anecdote about Chain rule C) A visual representation of Chain rule D) The definition and application of Chain rule

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) H(Y \mid X) B) The inverse operation of the formula in question C) An unrelated formula from a different topic D) A simplified version of H(Y \mid X)...

Correct: A)

Q3: What is the primary purpose of Conditional entropy?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used to conditional entropy in mathematical analysis

Correct: D)

Q4: Which statement about Conditioning reduces entropy is TRUE?

A) Conditioning reduces entropy is an advanced topic beyond this subject's scope B) Conditioning reduces entropy is not related to this subject C) Conditioning reduces entropy is a fundamental concept covered in this subject D) Conditioning reduces entropy is mentioned only as a historical footnote

Correct: C)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) The inverse of the correct answer C) H(Y) + H(X \mid Y)$ → Bayes' rule for entropy. D) A different result from a common mistake

Correct: C)

Q6: How are Conditioning reduces entropy and The Chain Rule related?

A) Conditioning reduces entropy and The Chain Rule are closely related concepts B) Conditioning reduces entropy and The Chain Rule are completely unrelated topics C) Conditioning reduces entropy is a special case of The Chain Rule D) Conditioning reduces entropy is the inverse of The Chain Rule

Correct: A)

Q7: What is a common pitfall when working with ⚠️ Critical: Conditioning Reduces Entropy?

A) ⚠️ Critical: Conditioning Reduces Entropy is always computed the same way in all contexts B) The main error with ⚠️ Critical: Conditioning Reduces Entropy is using it when it is not needed C) A common mistake is confusing ⚠️ Critical: Conditioning Reduces Entropy with a similar concept D) ⚠️ Critical: Conditioning Reduces Entropy has no common misconceptions

Correct: C)

Q8: When should you apply Chain Rule For $N$ Variables?

A) Use Chain Rule For $N$ Variables only in pure mathematics contexts B) Avoid Chain Rule For $N$ Variables unless explicitly instructed C) Apply Chain Rule For $N$ Variables to solve problems in this subject's domain D) Chain Rule For $N$ Variables is not practically useful

Correct: C)

Practice Problems

  1. If $X$ and $Y$ are independent, prove $H(Y \mid X) = H(Y)$.

    Click for answer For independent variables: $p(y \mid x) = p(y)$ for all $x, y$. $H(Y \mid X) = -\sum_x p(x) \sum_y p(y \mid x) \log p(y \mid x)$ $= -\sum_x p(x) \sum_y p(y) \log p(y) = \sum_x p(x) \cdot H(Y) = H(Y)$ The $(x, y)$ term becomes $p(x)p(y)$, so $H(Y \mid X) = H(Y)$. Knowing $X$ gives no information about $Y$.

  2. For $H(Y \mid X) = 0$, what does this imply about the relationship between $X$ and $Y$?

    Click for answer $H(Y \mid X) = 0$ means that for every $x$ with $p(x) > 0$, $H(Y \mid X = x) = 0$ — the conditional distribution $p(y \mid x)$ is deterministic. In other words, $Y$ is a function of $X$: knowing $X$ completely determines $Y$.

  3. $H(X) = 3$ bits, $H(Y) = 2$ bits, $H(X, Y) = 4$ bits. Find $H(X \mid Y)$ and $H(Y \mid X)$.

    Click for answer Chain rule: $H(X, Y) = H(Y) + H(X \mid Y)$ → $H(X \mid Y) = 4 - 2 = 2$ bits Also: $H(X, Y) = H(X) + H(Y \mid X)$ → $H(Y \mid X) = 4 - 3 = 1$ bit Check: $H(X \mid Y) = 2 \leq H(X) = 3$ ✓, $H(Y \mid X) = 1 \leq H(Y) = 2$ ✓

  4. Generalise the chain rule to 3 variables: $H(X, Y, Z) = ?$

    Click for answer $H(X, Y, Z) = H(X) + H(Y \mid X) + H(Z \mid X, Y)$ Or any permutation, e.g.: $H(Z) + H(Y \mid Z) + H(X \mid Y, Z)$

  5. Can $H(Y \mid X) > H(Y)$? Explain.

    Click for answer **No.** $H(Y \mid X) \leq H(Y)$ always, because $I(X; Y) = H(Y) - H(Y \mid X) \geq 0$. However, $H(Y \mid X = x)$ CAN be larger than $H(Y)$ for a specific $x$ — the inequality only applies to the AVERAGE conditional entropy, not to individual conditional distributions.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-03-mutual-information.md