### 13.8 — Differential Entropy
Phase: Information Theory Prerequisites: 13-01-entropy, 10-07-continuous-random-variables
Learning Objectives
By the end of this subject, you will be able to:
- Define differential entropy for continuous random variables
- Explain why differential entropy can be negative (unlike discrete entropy)
- Compute differential entropy for uniform, Gaussian, and exponential distributions
- State the maximum entropy principle for continuous distributions under constraints
- Relate differential entropy to discrete entropy via quantisation
Core Content
⚠️ CRITICAL: Differential vs Discrete Entropy
Differential entropy is the continuous analogue of discrete entropy:
$$h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx$$
where $f(x)$ is the probability density function (PDF).
CRUCIAL DIFFERENCES from discrete entropy:
| Property | Discrete $H(X)$ | Continuous $h(X)$ |
|---|---|---|
| Range | $[0, \infty)$ | $(-\infty, \infty)$ |
| Can be negative? | No | Yes! |
| Invariant to scaling? | Yes ($X$ categorical) | No — $h(aX) = h(X) + \log |
| Meaning | Absolute info in bits | Relative to coordinate system |
🚩 Common Pitfall: A Gaussian with variance $\sigma^2 < 1/(2\pi e)$ has NEGATIVE differential entropy. This does NOT mean "negative information" — it means the distribution is more concentrated than a standard reference. Differential entropy is NOT an absolute measure like discrete entropy.
⚠️ Why Differential Entropy Can Be Negative
Discrete entropy measures bits of uncertainty. Continuous entropy measures log-volume of the "typical set" relative to the coordinate system.
For a uniform distribution on $[0, a]$: $h(X) = \log a$
If $a < 1$, $h(X) < 0$ — the distribution occupies less than 1 unit of the coordinate system. This makes sense: negative $h(X)$ means the typical set volume is less than 1.
Differential Entropy of Common Distributions
Uniform distribution on $[a, b]$: $$h(X) = \log(b - a)$$
Gaussian (Normal) $N(\mu, \sigma^2)$: $$h(X) = \frac{1}{2}\log(2\pi e \sigma^2)$$
This is the maximum entropy distribution for fixed variance. For $\sigma^2 = 1$: $h(X) = \frac{1}{2}\log(2\pi e) \approx 1.419$ bits ($\approx 2.047$ nats).
Exponential with rate $\lambda$: $$h(X) = 1 - \log \lambda$$
⚠️ Maximum Entropy Principle (Continuous)
The principle of maximum entropy: among all distributions satisfying given constraints, choose the one with maximum entropy (it adds the fewest assumptions).
| Constraint | Max-Entropy Distribution |
|---|---|
| Fixed support $[a, b]$ | Uniform |
| Fixed mean $\mu$ (on $[0, \infty)$) | Exponential |
| Fixed variance $\sigma^2$ | Gaussian (normal) |
| Fixed mean $\mu$ AND variance $\sigma^2$ | Gaussian |
| Fixed covariance matrix | Multivariate Gaussian |
This principle justifies the ubiquity of the Gaussian distribution — it's the LEAST ASSUMPTIVE distribution with a given mean and variance. Any other distribution implicitly assumes additional structure.
Relationship to Discrete Entropy
Quantise a continuous variable into bins of width $\Delta$. As $\Delta \to 0$:
$$H(X^\Delta) \approx h(X) - \log \Delta$$
The discrete entropy diverges to $\infty$ (finer quantisation = more possible values), but the difference $H(X^\Delta) + \log\Delta \to h(X)$.
This explains why differential entropy is not an absolute measure — it's the "excess" entropy beyond the resolution-dependent baseline $-\log\Delta$.
Key Terms
- Differential entropy
- Maximum entropy distributions
Worked Examples
Example 1: Differential entropy of Gaussian
$X \sim N(0, 4)$ (variance = 4, $\sigma = 2$).
$h(X) = \frac{1}{2}\log_2(2\pi e \cdot 4) = \frac{1}{2}\log_2(8\pi e)$
$= \frac{1}{2}(\log_2 8 + \log_2 \pi + \log_2 e) = \frac{1}{2}(3 + 1.651 + 1.443) = \frac{1}{2}(6.094) \approx 3.047$ bits
(Using nats: $h(X) = \frac{1}{2}\ln(8\pi e) \approx \frac{1}{2}(4.225) \approx 2.112$ nats)
Example 2: Negative differential entropy
Uniform distribution on $[0, 0.5]$.
$h(X) = \log_2(0.5) = \log_2(1/2) = -1$ bit
This distribution is very concentrated — its typical set "volume" is 0.5 units. A narrower distribution would have even more negative entropy: uniform on $[0, 0.1]$ gives $h(X) = \log_2(0.1) \approx -3.32$ bits.
Example 3: Maximum entropy — verify Gaussian
Claim: For fixed variance $\sigma^2$, the Gaussian maximises $h(X)$.
Gaussian entropy: $h_{\text{Gauss}} = \frac{1}{2}\log(2\pi e \sigma^2)$
Uniform on $[-\sqrt{3}\sigma, \sqrt{3}\sigma]$ (also has variance $\sigma^2$):
$h_{\text{Unif}} = \log(2\sqrt{3}\sigma) = \frac{1}{2}\log(12\sigma^2)$
Difference: $h_{\text{Gauss}} - h_{\text{Unif}} = \frac{1}{2}\log(2\pi e \sigma^2) - \frac{1}{2}\log(12\sigma^2) = \frac{1}{2}\log\frac{2\pi e}{12} \approx \frac{1}{2}\log(1.423) \approx 0.254$ nats ≈ 0.367 bits
Gaussian has HIGHER entropy — it is indeed the max-entropy distribution for fixed variance.
Quiz
Q1: What does the concept of Differential entropy primarily refer to in this subject?
A) A computational error related to Differential entropy B) The definition and application of Differential entropy C) A visual representation of Differential entropy D) A historical anecdote about Differential entropy
Correct: B)
- If you chose A: This is incorrect. Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus.
- If you chose B: Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx D) A simplified version of h(X) = -\int_{-\infty}^{\in...
Correct: C)
- If you chose A: This is incorrect. The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose D: This is incorrect. The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Maximum entropy distributions?
A) It is used only in advanced research contexts B) It is used to maximum entropy distributions in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about ⚠️ Critical: Differential Vs Discrete Entropy is TRUE?
A) ⚠️ Critical: Differential Vs Discrete Entropy is mentioned only as a historical footnote B) ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject C) ⚠️ Critical: Differential Vs Discrete Entropy is an advanced topic beyond this subject's scope D) ⚠️ Critical: Differential Vs Discrete Entropy is not related to this subject
Correct: B)
- If you chose A: This is incorrect. ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content.
- If you chose B: ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content. Correct!
- If you chose C: This is incorrect. ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content.
- If you chose D: This is incorrect. ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) -\int f(x) \log f(x) dx$ extends entropy to c C) An unrelated numerical value D) A different result from a common mistake
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors.
- If you chose B: The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors.
Q6: How are ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative related?
A) ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are closely related concepts B) ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are completely unrelated topics C) ⚠️ Critical: Differential Vs Discrete Entropy is the inverse of ⚠️ Why Differential Entropy Can Be Negative D) ⚠️ Critical: Differential Vs Discrete Entropy is a special case of ⚠️ Why Differential Entropy Can Be Negative
Correct: A)
- If you chose A: Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Differential Entropy Of Common Distributions?
A) The main error with Differential Entropy Of Common Distributions is using it when it is not needed B) Differential Entropy Of Common Distributions is always computed the same way in all contexts C) A common mistake is confusing Differential Entropy Of Common Distributions with a similar concept D) Differential Entropy Of Common Distributions has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply ⚠️ Maximum Entropy Principle (Continuous)?
A) ⚠️ Maximum Entropy Principle (Continuous) is not practically useful B) Avoid ⚠️ Maximum Entropy Principle (Continuous) unless explicitly instructed C) Apply ⚠️ Maximum Entropy Principle (Continuous) to solve problems in this subject's domain D) Use ⚠️ Maximum Entropy Principle (Continuous) only in pure mathematics contexts
Correct: C)
- If you chose A: This is incorrect. ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Compute $h(X)$ for $X \sim \text{Uniform}(2, 8)$.
Click for answer
$h(X) = \log_2(8 - 2) = \log_2 6 \approx 2.585$ bits -
Why is $h(X + c) = h(X)$ (translation-invariant) but $h(aX) = h(X) + \log|a|$ (not scale-invariant)?
Click for answer
Translation just shifts the PDF: $f_{X+c}(y) = f_X(y-c)$. The integral $-\int f_{X+c}\log f_{X+c}$ is unchanged because the shape is the same. Scaling stretches/squeezes the PDF: $f_{aX}(y) = \frac{1}{|a|}f_X(y/a)$. The $\log|a|$ term comes from the Jacobian of the transformation: $h(aX) = h(X) + \log|a|$. Scaling changes the "volume" of the typical set. -
For a fixed mean $\mu$, which distribution maximises differential entropy on $[0, \infty)$?
Click for answer
The exponential distribution. This is derived via calculus of variations: maximise $-\int_0^\infty f(x)\log f(x)dx$ subject to $\int_0^\infty f(x)dx = 1$ and $\int_0^\infty x f(x)dx = \mu$, which yields $f(x) = \frac{1}{\mu}e^{-x/\mu}$. -
A Gaussian with variance 0.01 has differential entropy equal to what?
Click for answer
$h(X) = \frac{1}{2}\log_2(2\pi e \cdot 0.01) = \frac{1}{2}\log_2(0.1708) \approx \frac{1}{2}(-2.55) \approx -1.27$ bits. Negative differential entropy — the distribution is very concentrated (standard deviation = 0.1). -
Explain why differential entropy can be negative but still meaningful.
Click for answer
Differential entropy is not an absolute measure of information — it's relative to the coordinate system (like measuring height relative to an arbitrary zero). Negative $h(X)$ means the typical set has "volume" less than 1 in the chosen coordinates. The quantity that IS absolute is $h(X) - \log \Delta$, which relates to discrete entropy under quantisation. Differences of differential entropy (e.g., $h(X) - h(Y)$) are well-defined and meaningful.
Summary
Key takeaways:
- Differential entropy $h(X) = -\int f(x) \log f(x) dx$ extends entropy to continuous variables
- Unlike discrete entropy, $h(X)$ can be negative and is not scale-invariant
- $h(aX) = h(X) + \log|a|$ — scaling changes the coordinate system
- Maximum entropy distributions: uniform (fixed support), exponential (fixed mean), Gaussian (fixed variance)
- The Gaussian is the "least assumptive" distribution for given mean and variance
- $h(X) \approx H(X^\Delta) + \log\Delta$ for small quantisation $\Delta$
Pitfalls
-
Interpreting negative differential entropy as "negative information": Negative $h(X)$ simply means the typical set has volume less than 1 in the chosen coordinate system — it does not mean the distribution has negative information content. A uniform distribution on $[0, 0.5]$ has $h(X) = -1$ bit because it occupies half a unit, not because it is somehow "anti-informative."
-
Treating differential entropy as scale-invariant: $h(aX) = h(X) + \log|a|$, unlike discrete entropy which does not change under relabeling of outcomes. Scaling stretches the PDF, changing the log-volume of the typical set. Comparing differential entropies across variables with different units or scales is meaningless — only differences of differential entropy are invariant under scaling.
-
Comparing differential entropies across different coordinate systems: Two variables with different units (e.g., metres vs centimetres) will have different differential entropies purely due to the scale term $\log|a|$. Always ensure variables are on comparable scales before interpreting differential entropy, or use relative measures like $h(X) - h(Y)$.
-
Forgetting the max-entropy distribution depends on the constraint type: For fixed support, the uniform distribution maximises $h(X)$. For fixed variance, the Gaussian maximises $h(X)$. For fixed nonnegative mean, the exponential maximises $h(X)$. Applying the wrong max-entropy principle (e.g., assuming uniform when variance is fixed) gives the wrong distribution.
-
Confusing differential entropy $h(X)$ with discrete entropy $H(X^\Delta)$: For a quantised continuous variable with bin width $\Delta$, $H(X^\Delta) \approx h(X) - \log\Delta$. As $\Delta \to 0$, $H(X^\Delta) \to \infty$ while $h(X)$ remains finite. The relationship explains why differential entropy is not an absolute measure — it captures the entropy beyond the resolution-dependent baseline.
Next Steps
Next up: 13-09-rate-distortion-theory.md