📐 Concept diagram

### 13.1 — Entropy

Phase: Information Theory Prerequisites: 11-10-information-theory-connection, 10-04-discrete-random-variables

Learning Objectives

By the end of this subject, you will be able to:

Define entropy $H(X) = -\sum p(x) \log_2 p(x)$ and explain what it measures
Interpret entropy as average uncertainty, information content, or surprise
Compute entropy for simple discrete distributions (Bernoulli, uniform, categorical)
Express entropy in bits ($\log_2$) and nats ($\ln$)
Prove that entropy is maximised by the uniform distribution

Core Content

⚠️ CRITICAL: What Entropy Actually Is

Entropy $H(X)$ measures the average uncertainty about the outcome of a random variable $X$, or equivalently, the average information gained by observing its value.

$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)$$

By convention, $0 \log 0 = 0$ (continuity: $\lim_{p \to 0^+} p \log p = 0$).

Three equivalent interpretations: 1. Uncertainty: How uncertain are we about $X$ before observing it? 2. Information content: How much information do we gain on average when we observe $X$? 3. Average surprise: $-\log_2 p(x)$ is the "surprise" of observing $x$ (rare events are more surprising). Entropy is the expected surprise.

Units

Log base	Unit	When to use
$\log_2$	bits	Digital communication, computing
$\ln$ (base $e$)	nats	Theoretical derivations (nicer derivatives)
$\log_{10}$	hartleys (dits)	Rare, historical

Conversion: $1 \text{ bit} = \ln 2 \approx 0.693 \text{ nats}$

Properties of Entropy

Non-negativity: $H(X) \geq 0$, with equality iff $X$ is deterministic (one outcome has probability 1).
Maximum entropy principle: For a discrete random variable with $|\mathcal{X}| = K$ possible outcomes, $H(X) \leq \log_2 K$, with equality iff $X$ is uniform: $p(x) = 1/K$ for all $x$.

This is why a fair coin (1 bit) has more entropy than a biased coin (less than 1 bit) — uncertainty is maximised when all outcomes are equally likely.

Additivity for independent variables: $H(X, Y) = H(X) + H(Y)$ if $X \perp!!!\perp Y$.

⚠️ CRITICAL: Self-Information vs Entropy

Self-information (or surprisal) of event $x$: $I(x) = -\log_2 p(x)$

Rare events have HIGH self-information (more surprising)
Certain events ($p=1$) have ZERO self-information

Entropy is the expected self-information: $H(X) = E[I(X)] = E[-\log_2 p(X)]$

🚩 Common Pitfall: Entropy is NOT the self-information of the most likely outcome. It's an AVERAGE over all outcomes, weighted by their probabilities.

Entropy of a Bernoulli Random Variable

$X \sim \text{Bernoulli}(p)$, so $P(X=1) = p$, $P(X=0) = 1-p$.

$$H(X) = -p\log_2 p - (1-p)\log_2(1-p)$$

This is called the binary entropy function, denoted $H_b(p)$.

$H_b(0.5) = 1$ bit (maximum — fair coin)
$H_b(0.1) \approx 0.469$ bits
$H_b(0) = H_b(1) = 0$ bits (deterministic)

The binary entropy function is symmetric about $p = 0.5$ and concave.

Joint Entropy

For two random variables $X, Y$ with joint distribution $p(x, y)$:

$$H(X, Y) = -\sum_{x, y} p(x, y) \log_2 p(x, y)$$

This measures the total uncertainty about the pair $(X, Y)$.

Key Terms

Binary entropy
Entropy
Joint entropy
Self-information

Worked Examples

Example 1: Entropy of a fair die

$X$ = outcome of a fair 6-sided die. $p(x) = 1/6$ for all $x$.

$H(X) = -\sum_{i=1}^{6} \frac{1}{6} \log_2 \frac{1}{6} = -6 \cdot \frac{1}{6} \cdot \log_2 \frac{1}{6} = -\log_2 6^{-1} = \log_2 6 \approx 2.585$ bits

This is the maximum possible entropy for any distribution on 6 outcomes.

Example 2: Biased coin

$P(\text{Heads}) = 0.9$, $P(\text{Tails}) = 0.1$.

$H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1$

$\log_2 0.9 \approx -0.152$, $\log_2 0.1 \approx -3.322$

$H(X) = -0.9(-0.152) - 0.1(-3.322) = 0.137 + 0.332 = 0.469$ bits

Much less than 1 bit — the coin is highly predictable.

Example 3: Joint entropy of independent variables

$X, Y$ are independent fair coin flips. $p(x,y) = 1/4$ for all 4 outcomes.

$H(X, Y) = -\sum 0.25 \log_2 0.25 = -4 \cdot 0.25 \cdot (-2) = 2$ bits

Since $X$ and $Y$ are independent: $H(X, Y) = H(X) + H(Y) = 1 + 1 = 2$ ✓

Quiz

Q1: What does the concept of Entropy primarily refer to in this subject?

A) A historical anecdote about Entropy B) A visual representation of Entropy C) The definition and application of Entropy D) A computational error related to Entropy

Correct: C)

If you chose A: This is incorrect. Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus.
If you chose C: Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Entropy is defined as: the definition and application of entropy. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of H(X) = -\sum p(x) \log_2 p(x... B) H(X) = -\sum p(x) \log_2 p(x) C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: B)

If you chose A: This is incorrect. The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose C: This is incorrect. The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: This is incorrect. The formula H(X) = -\sum p(x) \log_2 p(x) is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Self-information?

A) It is used to self-information in mathematical analysis B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used only in advanced research contexts

Correct: A)

If you chose A: Self-information serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose B: This is incorrect. Self-information serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Self-information serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Self-information serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Binary entropy is TRUE?

A) Binary entropy is an advanced topic beyond this subject's scope B) Binary entropy is a fundamental concept covered in this subject C) Binary entropy is not related to this subject D) Binary entropy is mentioned only as a historical footnote

Correct: B)

If you chose A: This is incorrect. Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content.
If you chose B: Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content. Correct!
If you chose C: This is incorrect. Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content.
If you chose D: This is incorrect. Binary entropy is a fundamental concept covered in this subject. This subject covers Binary entropy as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) 5$ bits. B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value

Correct: A)

If you chose A: The worked examples show that the result is 5$ bits.. The other options represent common errors. Correct!
If you chose B: This is incorrect. The worked examples show that the result is 5$ bits.. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is 5$ bits.. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is 5$ bits.. The other options represent common errors.

Q6: How are Binary entropy and Joint entropy related?

A) Binary entropy and Joint entropy are closely related concepts B) Binary entropy is the inverse of Joint entropy C) Binary entropy is a special case of Joint entropy D) Binary entropy and Joint entropy are completely unrelated topics

Correct: A)

If you chose A: Both Binary entropy and Joint entropy are covered in this subject as interconnected topics. Correct!
If you chose B: This is incorrect. Both Binary entropy and Joint entropy are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Binary entropy and Joint entropy are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Binary entropy and Joint entropy are covered in this subject as interconnected topics.

Q7: What is a common pitfall when working with ⚠️ Critical: What Entropy Actually Is?

A) ⚠️ Critical: What Entropy Actually Is is always computed the same way in all contexts B) A common mistake is confusing ⚠️ Critical: What Entropy Actually Is with a similar concept C) The main error with ⚠️ Critical: What Entropy Actually Is is using it when it is not needed D) ⚠️ Critical: What Entropy Actually Is has no common misconceptions

Correct: B)

If you chose A: This is incorrect. Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose C: This is incorrect. Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse ⚠️ Critical: What Entropy Actually Is with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply Units?

A) Avoid Units unless explicitly instructed B) Use Units only in pure mathematics contexts C) Apply Units to solve problems in this subject's domain D) Units is not practically useful

Correct: C)

If you chose A: This is incorrect. Units is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. Units is a practical tool used throughout this subject to solve relevant problems.
If you chose C: Units is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose D: This is incorrect. Units is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Compute $H_b(0.3)$, the binary entropy for $p = 0.3$.

Click for answer
$H_b(0.3) = -0.3\log_2 0.3 - 0.7\log_2 0.7$ $\log_2 0.3 \approx -1.737$, $\log_2 0.7 \approx -0.515$ $H_b(0.3) = -0.3(-1.737) - 0.7(-0.515) = 0.521 + 0.360 = 0.881$ bits
A random variable takes values {a, b, c} with probabilities {0.5, 0.25, 0.25}. What is its entropy?

Click for answer
$H(X) = -0.5\log_2 0.5 - 0.25\log_2 0.25 - 0.25\log_2 0.25$ $= -0.5(-1) - 0.25(-2) - 0.25(-2) = 0.5 + 0.5 + 0.5 = 1.5$ bits
Prove that for a uniform distribution over $K$ outcomes, $H(X) = \log_2 K$.

Click for answer
$p(x) = 1/K$ for all $x$. $H(X) = -\sum_{i=1}^{K} \frac{1}{K} \log_2 \frac{1}{K} = -K \cdot \frac{1}{K} \cdot \log_2 K^{-1}$ $= -1 \cdot (-\log_2 K) = \log_2 K$ ✓
A deterministic random variable always takes value 7. What is $H(X)$?

Click for answer
$p(7) = 1$, $p(\text{anything else}) = 0$. $H(X) = -1 \cdot \log_2 1 - 0 \cdot \log 0 = -1 \cdot 0 - 0 = 0$ bits. If there's no uncertainty, there's zero entropy.
For $X$ and $Y$ independent, $H(X) = 2$ bits and $H(Y) = 3$ bits. What is $H(X, Y)$?

Click for answer
For independent variables: $H(X, Y) = H(X) + H(Y) = 2 + 3 = 5$ bits. Independence means knowing $X$ tells you nothing about $Y$, so total uncertainty is the sum.

Summary

Key takeaways:

Entropy $H(X) = -\sum p(x) \log p(x)$ measures average uncertainty in bits (if $\log_2$)
Self-information $-\log_2 p(x)$ is the surprise of a specific outcome; entropy is its expectation
$0 \leq H(X) \leq \log_2 |\mathcal{X}|$ — minimum for deterministic, maximum for uniform
Binary entropy $H_b(p)$ is the entropy of a Bernoulli($p$) and is concave, symmetric
Joint entropy $H(X,Y)$ measures total uncertainty about the pair
For independent $X, Y$: $H(X,Y) = H(X) + H(Y)$

Pitfalls

Confusing entropy with variance or spread: Entropy measures uncertainty in bits, not the spread of values. A distribution can have high entropy and low variance or low entropy and high variance. Always ask "how surprised am I on average?" not "how spread out are the values?"
Forgetting the $0 \log 0 = 0$ convention: Terms where $p(x) = 0$ contribute nothing to entropy, but the limit $\lim_{p \to 0^+} p \log p = 0$ must be applied. In numerical code, always filter zero-probability events before computing the sum — otherwise you may get NaN from $0 \cdot (-\infty)$.
Using the wrong log base without converting: $H(X)$ in bits uses $\log_2$, while theoretical derivations often use $\ln$. Results in nats must be divided by $\ln 2$ to get bits. Always check units when comparing entropy values from different sources.
Assuming joint entropy equals the sum of marginals: $H(X, Y) = H(X) + H(Y)$ only when $X \perp!!!\perp Y$. In general, $H(X, Y) = H(X) + H(Y \mid X) < H(X) + H(Y)$ when variables share information. Forgetting the conditional term overcounts information content.
Thinking the uniform distribution always maximises entropy: Uniform maximises entropy only when the sole constraint is the number of outcomes. Under a fixed mean, the exponential is max-entropy; under fixed mean and variance, the Gaussian is max-entropy (continuous case). Each constraint changes the maximiser.

Next Steps

Next up: 13-02-conditional-entropy-chain-rule.md

Progress

Phases

### 13.1 — Entropy

Learning Objectives

Core Content

⚠️ CRITICAL: What Entropy Actually Is

Units

Properties of Entropy

⚠️ CRITICAL: Self-Information vs Entropy

Entropy of a Bernoulli Random Variable

Joint Entropy

Key Terms

Worked Examples

Example 1: Entropy of a fair die

Example 2: Biased coin

Example 3: Joint entropy of independent variables

Quiz

Practice Problems

Summary

Pitfalls

Next Steps