Math graphic
📐 Concept diagram

### 13.1 — Entropy

Phase: Information Theory Prerequisites: 11-10-information-theory-connection, 10-04-discrete-random-variables

Learning Objectives

By the end of this subject, you will be able to:

  1. Define entropy $H(X) = -\sum p(x) \log_2 p(x)$ and explain what it measures
  2. Interpret entropy as average uncertainty, information content, or surprise
  3. Compute entropy for simple discrete distributions (Bernoulli, uniform, categorical)
  4. Express entropy in bits ($\log_2$) and nats ($\ln$)
  5. Prove that entropy is maximised by the uniform distribution

Core Content

⚠️ CRITICAL: What Entropy Actually Is

Entropy $H(X)$ measures the average uncertainty about the outcome of a random variable $X$, or equivalently, the average information gained by observing its value.

$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)$$

By convention, $0 \log 0 = 0$ (continuity: $\lim_{p \to 0^+} p \log p = 0$).

Three equivalent interpretations: 1. Uncertainty: How uncertain are we about $X$ before observing it? 2. Information content: How much information do we gain on average when we observe $X$? 3. Average surprise: $-\log_2 p(x)$ is the "surprise" of observing $x$ (rare events are more surprising). Entropy is the expected surprise.

Units

Log base Unit When to use
$\log_2$ bits Digital communication, computing
$\ln$ (base $e$) nats Theoretical derivations (nicer derivatives)
$\log_{10}$ hartleys (dits) Rare, historical

Conversion: $1 \text{ bit} = \ln 2 \approx 0.693 \text{ nats}$

Properties of Entropy

  1. Non-negativity: $H(X) \geq 0$, with equality iff $X$ is deterministic (one outcome has probability 1).

  2. Maximum entropy principle: For a discrete random variable with $|\mathcal{X}| = K$ possible outcomes, $H(X) \leq \log_2 K$, with equality iff $X$ is uniform: $p(x) = 1/K$ for all $x$.

This is why a fair coin (1 bit) has more entropy than a biased coin (less than 1 bit) — uncertainty is maximised when all outcomes are equally likely.

  1. Additivity for independent variables: $H(X, Y) = H(X) + H(Y)$ if $X \perp!!!\perp Y$.

⚠️ CRITICAL: Self-Information vs Entropy

Self-information (or surprisal) of event $x$: $I(x) = -\log_2 p(x)$

Entropy is the expected self-information: $H(X) = E[I(X)] = E[-\log_2 p(X)]$

🚩 Common Pitfall: Entropy is NOT the self-information of the most likely outcome. It's an AVERAGE over all outcomes, weighted by their probabilities.

Entropy of a Bernoulli Random Variable

$X \sim \text{Bernoulli}(p)$, so $P(X=1) = p$, $P(X=0) = 1-p$.

$$H(X) = -p\log_2 p - (1-p)\log_2(1-p)$$

This is called the binary entropy function, denoted $H_b(p)$.

The binary entropy function is symmetric about $p = 0.5$ and concave.

Joint Entropy

For two random variables $X, Y$ with joint distribution $p(x, y)$:

$$H(X, Y) = -\sum_{x, y} p(x, y) \log_2 p(x, y)$$

This measures the total uncertainty about the pair $(X, Y)$.



Key Terms

Worked Examples

Example 1: Entropy of a fair die

$X$ = outcome of a fair 6-sided die. $p(x) = 1/6$ for all $x$.

$H(X) = -\sum_{i=1}^{6} \frac{1}{6} \log_2 \frac{1}{6} = -6 \cdot \frac{1}{6} \cdot \log_2 \frac{1}{6} = -\log_2 6^{-1} = \log_2 6 \approx 2.585$ bits

This is the maximum possible entropy for any distribution on 6 outcomes.

Example 2: Biased coin

$P(\text{Heads}) = 0.9$, $P(\text{Tails}) = 0.1$.

$H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1$

$\log_2 0.9 \approx -0.152$, $\log_2 0.1 \approx -3.322$

$H(X) = -0.9(-0.152) - 0.1(-3.322) = 0.137 + 0.332 = 0.469$ bits

Much less than 1 bit — the coin is highly predictable.

Example 3: Joint entropy of independent variables

$X, Y$ are independent fair coin flips. $p(x,y) = 1/4$ for all 4 outcomes.

$H(X, Y) = -\sum 0.25 \log_2 0.25 = -4 \cdot 0.25 \cdot (-2) = 2$ bits

Since $X$ and $Y$ are independent: $H(X, Y) = H(X) + H(Y) = 1 + 1 = 2$ ✓



Quiz

Q1: What does the concept of Entropy primarily refer to in this subject?

A) A historical anecdote about Entropy B) A visual representation of Entropy C) The definition and application of Entropy D) A computational error related to Entropy

Correct: C)

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of H(X) = -\sum p(x) \log_2 p(x... B) H(X) = -\sum p(x) \log_2 p(x) C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: B)

Q3: What is the primary purpose of Self-information?

A) It is used to self-information in mathematical analysis B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used only in advanced research contexts

Correct: A)

Q4: Which statement about Binary entropy is TRUE?

A) Binary entropy is an advanced topic beyond this subject's scope B) Binary entropy is a fundamental concept covered in this subject C) Binary entropy is not related to this subject D) Binary entropy is mentioned only as a historical footnote

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) 5$ bits. B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value

Correct: A)

Q6: How are Binary entropy and Joint entropy related?

A) Binary entropy and Joint entropy are closely related concepts B) Binary entropy is the inverse of Joint entropy C) Binary entropy is a special case of Joint entropy D) Binary entropy and Joint entropy are completely unrelated topics

Correct: A)

Q7: What is a common pitfall when working with ⚠️ Critical: What Entropy Actually Is?

A) ⚠️ Critical: What Entropy Actually Is is always computed the same way in all contexts B) A common mistake is confusing ⚠️ Critical: What Entropy Actually Is with a similar concept C) The main error with ⚠️ Critical: What Entropy Actually Is is using it when it is not needed D) ⚠️ Critical: What Entropy Actually Is has no common misconceptions

Correct: B)

Q8: When should you apply Units?

A) Avoid Units unless explicitly instructed B) Use Units only in pure mathematics contexts C) Apply Units to solve problems in this subject's domain D) Units is not practically useful

Correct: C)

Practice Problems

  1. Compute $H_b(0.3)$, the binary entropy for $p = 0.3$.

    Click for answer $H_b(0.3) = -0.3\log_2 0.3 - 0.7\log_2 0.7$ $\log_2 0.3 \approx -1.737$, $\log_2 0.7 \approx -0.515$ $H_b(0.3) = -0.3(-1.737) - 0.7(-0.515) = 0.521 + 0.360 = 0.881$ bits

  2. A random variable takes values {a, b, c} with probabilities {0.5, 0.25, 0.25}. What is its entropy?

    Click for answer $H(X) = -0.5\log_2 0.5 - 0.25\log_2 0.25 - 0.25\log_2 0.25$ $= -0.5(-1) - 0.25(-2) - 0.25(-2) = 0.5 + 0.5 + 0.5 = 1.5$ bits

  3. Prove that for a uniform distribution over $K$ outcomes, $H(X) = \log_2 K$.

    Click for answer $p(x) = 1/K$ for all $x$. $H(X) = -\sum_{i=1}^{K} \frac{1}{K} \log_2 \frac{1}{K} = -K \cdot \frac{1}{K} \cdot \log_2 K^{-1}$ $= -1 \cdot (-\log_2 K) = \log_2 K$ ✓

  4. A deterministic random variable always takes value 7. What is $H(X)$?

    Click for answer $p(7) = 1$, $p(\text{anything else}) = 0$. $H(X) = -1 \cdot \log_2 1 - 0 \cdot \log 0 = -1 \cdot 0 - 0 = 0$ bits. If there's no uncertainty, there's zero entropy.

  5. For $X$ and $Y$ independent, $H(X) = 2$ bits and $H(Y) = 3$ bits. What is $H(X, Y)$?

    Click for answer For independent variables: $H(X, Y) = H(X) + H(Y) = 2 + 3 = 5$ bits. Independence means knowing $X$ tells you nothing about $Y$, so total uncertainty is the sum.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-02-conditional-entropy-chain-rule.md