### 13.1 — Entropy
Phase: Information Theory Prerequisites: 11-10-information-theory-connection, 10-04-discrete-random-variables
Learning Objectives
By the end of this subject, you will be able to:
- Define entropy $H(X) = -\sum p(x) \log_2 p(x)$ and explain what it measures
- Interpret entropy as average uncertainty, information content, or surprise
- Compute entropy for simple discrete distributions (Bernoulli, uniform, categorical)
- Express entropy in bits ($\log_2$) and nats ($\ln$)
- Prove that entropy is maximised by the uniform distribution
Core Content
⚠️ CRITICAL: What Entropy Actually Is
Entropy $H(X)$ measures the average uncertainty about the outcome of a random variable $X$, or equivalently, the average information gained by observing its value.
$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)$$
By convention, $0 \log 0 = 0$ (continuity: $\lim_{p \to 0^+} p \log p = 0$).
Three equivalent interpretations: 1. Uncertainty: How uncertain are we about $X$ before observing it? 2. Information content: How much information do we gain on average when we observe $X$? 3. Average surprise: $-\log_2 p(x)$ is the "surprise" of observing $x$ (rare events are more surprising). Entropy is the expected surprise.
Units
| Log base | Unit | When to use |
|---|---|---|
| $\log_2$ | bits | Digital communication, computing |
| $\ln$ (base $e$) | nats | Theoretical derivations (nicer derivatives) |
| $\log_{10}$ | hartleys (dits) | Rare, historical |
Conversion: $1 \text{ bit} = \ln 2 \approx 0.693 \text{ nats}$
Properties of Entropy
-
Non-negativity: $H(X) \geq 0$, with equality iff $X$ is deterministic (one outcome has probability 1).
-
Maximum entropy principle: For a discrete random variable with $|\mathcal{X}| = K$ possible outcomes, $H(X) \leq \log_2 K$, with equality iff $X$ is uniform: $p(x) = 1/K$ for all $x$.
This is why a fair coin (1 bit) has more entropy than a biased coin (less than 1 bit) — uncertainty is maximised when all outcomes are equally likely.
- Additivity for independent variables: $H(X, Y) = H(X) + H(Y)$ if $X \perp!!!\perp Y$.
⚠️ CRITICAL: Self-Information vs Entropy
Self-information (or surprisal) of event $x$: $I(x) = -\log_2 p(x)$
- Rare events have HIGH self-information (more surprising)
- Certain events ($p=1$) have ZERO self-information
Entropy is the expected self-information: $H(X) = E[I(X)] = E[-\log_2 p(X)]$
🚩 Common Pitfall: Entropy is NOT the self-information of the most likely outcome. It's an AVERAGE over all outcomes, weighted by their probabilities.
Entropy of a Bernoulli Random Variable
$X \sim \text{Bernoulli}(p)$, so $P(X=1) = p$, $P(X=0) = 1-p$.
$$H(X) = -p\log_2 p - (1-p)\log_2(1-p)$$
This is called the binary entropy function, denoted $H_b(p)$.
- $H_b(0.5) = 1$ bit (maximum — fair coin)
- $H_b(0.1) \approx 0.469$ bits
- $H_b(0) = H_b(1) = 0$ bits (deterministic)
The binary entropy function is symmetric about $p = 0.5$ and concave.
Joint Entropy
For two random variables $X, Y$ with joint distribution $p(x, y)$:
$$H(X, Y) = -\sum_{x, y} p(x, y) \log_2 p(x, y)$$
This measures the total uncertainty about the pair $(X, Y)$.
Key Terms
- Binary entropy
- Entropy
- Joint entropy
- Self-information
Worked Examples
Example 1: Entropy of a fair die
$X$ = outcome of a fair 6-sided die. $p(x) = 1/6$ for all $x$.
$H(X) = -\sum_{i=1}^{6} \frac{1}{6} \log_2 \frac{1}{6} = -6 \cdot \frac{1}{6} \cdot \log_2 \frac{1}{6} = -\log_2 6^{-1} = \log_2 6 \approx 2.585$ bits
This is the maximum possible entropy for any distribution on 6 outcomes.
Example 2: Biased coin
$P(\text{Heads}) = 0.9$, $P(\text{Tails}) = 0.1$.
$H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1$
$\log_2 0.9 \approx -0.152$, $\log_2 0.1 \approx -3.322$
$H(X) = -0.9(-0.152) - 0.1(-3.322) = 0.137 + 0.332 = 0.469$ bits
Much less than 1 bit — the coin is highly predictable.
Example 3: Joint entropy of independent variables
$X, Y$ are independent fair coin flips. $p(x,y) = 1/4$ for all 4 outcomes.
$H(X, Y) = -\sum 0.25 \log_2 0.25 = -4 \cdot 0.25 \cdot (-2) = 2$ bits
Since $X$ and $Y$ are independent: $H(X, Y) = H(X) + H(Y) = 1 + 1 = 2$ ✓
Quiz
Q1: What does the concept of Entropy primarily refer to in this subject?
A) A historical anecdote about Entropy B) A visual representation of Entropy C) The definition and application of Entropy D) A computational error related to Entropy
Correct: C)
- If you chose A: This is incorrect. Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus.
- If you chose C: Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) A simplified version of H(X) = -\sum p(x) \log_2 p(x... B) H(X) = -\sum p(x) \log_2 p(x) C) The inverse operation of the formula in question D) An unrelated formula from a different topic
Correct: B)
- If you chose A: This is incorrect. The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Self-information?
A) It is used to self-information in mathematical analysis B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used only in advanced research contexts
Correct: A)
- If you chose A: Self-information serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Self-information serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Self-information serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Self-information serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Binary entropy is TRUE?
A) Binary entropy is an advanced topic beyond this subject's scope B) Binary entropy is a fundamental concept covered in this subject C) Binary entropy is not related to this subject D) Binary entropy is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content.
- If you chose B: Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content. Correct!
- If you chose C: This is incorrect. Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content.
- If you chose D: This is incorrect. Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) 5$ bits. B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value
Correct: A)
- If you chose A: The worked examples show that the result is 5$ bits.. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is 5$ bits.. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is 5$ bits.. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is 5$ bits.. The other options represent common errors.
Q6: How are Binary entropy and Joint entropy related?
A) Binary entropy and Joint entropy are closely related concepts B) Binary entropy is the inverse of Joint entropy C) Binary entropy is a special case of Joint entropy D) Binary entropy and Joint entropy are completely unrelated topics
Correct: A)
- If you chose A: Both Binary entropy and Joint entropy are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Binary entropy and Joint entropy are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Binary entropy and Joint entropy are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Binary entropy and Joint entropy are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with ⚠️ Critical: What Entropy Actually Is?
A) ⚠️ Critical: What Entropy Actually Is is always computed the same way in all contexts B) A common mistake is confusing ⚠️ Critical: What Entropy Actually Is with a similar concept C) The main error with ⚠️ Critical: What Entropy Actually Is is using it when it is not needed D) ⚠️ Critical: What Entropy Actually Is has no common misconceptions
Correct: B)
- If you chose A: This is incorrect. Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Units?
A) Avoid Units unless explicitly instructed B) Use Units only in pure mathematics contexts C) Apply Units to solve problems in this subject's domain D) Units is not practically useful
Correct: C)
- If you chose A: This is incorrect. Units is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Units is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: Units is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. Units is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Compute $H_b(0.3)$, the binary entropy for $p = 0.3$.
Click for answer
$H_b(0.3) = -0.3\log_2 0.3 - 0.7\log_2 0.7$ $\log_2 0.3 \approx -1.737$, $\log_2 0.7 \approx -0.515$ $H_b(0.3) = -0.3(-1.737) - 0.7(-0.515) = 0.521 + 0.360 = 0.881$ bits -
A random variable takes values {a, b, c} with probabilities {0.5, 0.25, 0.25}. What is its entropy?
Click for answer
$H(X) = -0.5\log_2 0.5 - 0.25\log_2 0.25 - 0.25\log_2 0.25$ $= -0.5(-1) - 0.25(-2) - 0.25(-2) = 0.5 + 0.5 + 0.5 = 1.5$ bits -
Prove that for a uniform distribution over $K$ outcomes, $H(X) = \log_2 K$.
Click for answer
$p(x) = 1/K$ for all $x$. $H(X) = -\sum_{i=1}^{K} \frac{1}{K} \log_2 \frac{1}{K} = -K \cdot \frac{1}{K} \cdot \log_2 K^{-1}$ $= -1 \cdot (-\log_2 K) = \log_2 K$ ✓ -
A deterministic random variable always takes value 7. What is $H(X)$?
Click for answer
$p(7) = 1$, $p(\text{anything else}) = 0$. $H(X) = -1 \cdot \log_2 1 - 0 \cdot \log 0 = -1 \cdot 0 - 0 = 0$ bits. If there's no uncertainty, there's zero entropy. -
For $X$ and $Y$ independent, $H(X) = 2$ bits and $H(Y) = 3$ bits. What is $H(X, Y)$?
Click for answer
For independent variables: $H(X, Y) = H(X) + H(Y) = 2 + 3 = 5$ bits. Independence means knowing $X$ tells you nothing about $Y$, so total uncertainty is the sum.
Summary
Key takeaways:
- Entropy $H(X) = -\sum p(x) \log p(x)$ measures average uncertainty in bits (if $\log_2$)
- Self-information $-\log_2 p(x)$ is the surprise of a specific outcome; entropy is its expectation
- $0 \leq H(X) \leq \log_2 |\mathcal{X}|$ — minimum for deterministic, maximum for uniform
- Binary entropy $H_b(p)$ is the entropy of a Bernoulli($p$) and is concave, symmetric
- Joint entropy $H(X,Y)$ measures total uncertainty about the pair
- For independent $X, Y$: $H(X,Y) = H(X) + H(Y)$
Pitfalls
-
Confusing entropy with variance or spread: Entropy measures uncertainty in bits, not the spread of values. A distribution can have high entropy and low variance or low entropy and high variance. Always ask "how surprised am I on average?" not "how spread out are the values?"
-
Forgetting the $0 \log 0 = 0$ convention: Terms where $p(x) = 0$ contribute nothing to entropy, but the limit $\lim_{p \to 0^+} p \log p = 0$ must be applied. In numerical code, always filter zero-probability events before computing the sum — otherwise you may get NaN from $0 \cdot (-\infty)$.
-
Using the wrong log base without converting: $H(X)$ in bits uses $\log_2$, while theoretical derivations often use $\ln$. Results in nats must be divided by $\ln 2$ to get bits. Always check units when comparing entropy values from different sources.
-
Assuming joint entropy equals the sum of marginals: $H(X, Y) = H(X) + H(Y)$ only when $X \perp!!!\perp Y$. In general, $H(X, Y) = H(X) + H(Y \mid X) < H(X) + H(Y)$ when variables share information. Forgetting the conditional term overcounts information content.
-
Thinking the uniform distribution always maximises entropy: Uniform maximises entropy only when the sole constraint is the number of outcomes. Under a fixed mean, the exponential is max-entropy; under fixed mean and variance, the Gaussian is max-entropy (continuous case). Each constraint changes the maximiser.