📐 Concept diagram

### 13.8 — Differential Entropy

Phase: Information Theory Prerequisites: 13-01-entropy, 10-07-continuous-random-variables

Learning Objectives

By the end of this subject, you will be able to:

Define differential entropy for continuous random variables
Explain why differential entropy can be negative (unlike discrete entropy)
Compute differential entropy for uniform, Gaussian, and exponential distributions
State the maximum entropy principle for continuous distributions under constraints
Relate differential entropy to discrete entropy via quantisation

Core Content

⚠️ CRITICAL: Differential vs Discrete Entropy

Differential entropy is the continuous analogue of discrete entropy:

$$h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx$$

where $f(x)$ is the probability density function (PDF).

CRUCIAL DIFFERENCES from discrete entropy:

Property	Discrete $H(X)$	Continuous $h(X)$
Range	$[0, \infty)$	$(-\infty, \infty)$
Can be negative?	No	Yes!
Invariant to scaling?	Yes ($X$ categorical)	No — $h(aX) = h(X) + \log
Meaning	Absolute info in bits	Relative to coordinate system

🚩 Common Pitfall: A Gaussian with variance $\sigma^2 < 1/(2\pi e)$ has NEGATIVE differential entropy. This does NOT mean "negative information" — it means the distribution is more concentrated than a standard reference. Differential entropy is NOT an absolute measure like discrete entropy.

⚠️ Why Differential Entropy Can Be Negative

Discrete entropy measures bits of uncertainty. Continuous entropy measures log-volume of the "typical set" relative to the coordinate system.

For a uniform distribution on $[0, a]$: $h(X) = \log a$

If $a < 1$, $h(X) < 0$ — the distribution occupies less than 1 unit of the coordinate system. This makes sense: negative $h(X)$ means the typical set volume is less than 1.

Differential Entropy of Common Distributions

Uniform distribution on $[a, b]$: $$h(X) = \log(b - a)$$

Gaussian (Normal) $N(\mu, \sigma^2)$: $$h(X) = \frac{1}{2}\log(2\pi e \sigma^2)$$

This is the maximum entropy distribution for fixed variance. For $\sigma^2 = 1$: $h(X) = \frac{1}{2}\log(2\pi e) \approx 1.419$ bits ($\approx 2.047$ nats).

Exponential with rate $\lambda$: $$h(X) = 1 - \log \lambda$$

⚠️ Maximum Entropy Principle (Continuous)

The principle of maximum entropy: among all distributions satisfying given constraints, choose the one with maximum entropy (it adds the fewest assumptions).

Constraint	Max-Entropy Distribution
Fixed support $[a, b]$	Uniform
Fixed mean $\mu$ (on $[0, \infty)$)	Exponential
Fixed variance $\sigma^2$	Gaussian (normal)
Fixed mean $\mu$ AND variance $\sigma^2$	Gaussian
Fixed covariance matrix	Multivariate Gaussian

This principle justifies the ubiquity of the Gaussian distribution — it's the LEAST ASSUMPTIVE distribution with a given mean and variance. Any other distribution implicitly assumes additional structure.

Relationship to Discrete Entropy

Quantise a continuous variable into bins of width $\Delta$. As $\Delta \to 0$:

$$H(X^\Delta) \approx h(X) - \log \Delta$$

The discrete entropy diverges to $\infty$ (finer quantisation = more possible values), but the difference $H(X^\Delta) + \log\Delta \to h(X)$.

This explains why differential entropy is not an absolute measure — it's the "excess" entropy beyond the resolution-dependent baseline $-\log\Delta$.

Key Terms

Differential entropy
Maximum entropy distributions

Worked Examples

Example 1: Differential entropy of Gaussian

$X \sim N(0, 4)$ (variance = 4, $\sigma = 2$).

$h(X) = \frac{1}{2}\log_2(2\pi e \cdot 4) = \frac{1}{2}\log_2(8\pi e)$

$= \frac{1}{2}(\log_2 8 + \log_2 \pi + \log_2 e) = \frac{1}{2}(3 + 1.651 + 1.443) = \frac{1}{2}(6.094) \approx 3.047$ bits

(Using nats: $h(X) = \frac{1}{2}\ln(8\pi e) \approx \frac{1}{2}(4.225) \approx 2.112$ nats)

Example 2: Negative differential entropy

Uniform distribution on $[0, 0.5]$.

$h(X) = \log_2(0.5) = \log_2(1/2) = -1$ bit

This distribution is very concentrated — its typical set "volume" is 0.5 units. A narrower distribution would have even more negative entropy: uniform on $[0, 0.1]$ gives $h(X) = \log_2(0.1) \approx -3.32$ bits.

Example 3: Maximum entropy — verify Gaussian

Claim: For fixed variance $\sigma^2$, the Gaussian maximises $h(X)$.

Gaussian entropy: $h_{\text{Gauss}} = \frac{1}{2}\log(2\pi e \sigma^2)$

Uniform on $[-\sqrt{3}\sigma, \sqrt{3}\sigma]$ (also has variance $\sigma^2$):

$h_{\text{Unif}} = \log(2\sqrt{3}\sigma) = \frac{1}{2}\log(12\sigma^2)$

Difference: $h_{\text{Gauss}} - h_{\text{Unif}} = \frac{1}{2}\log(2\pi e \sigma^2) - \frac{1}{2}\log(12\sigma^2) = \frac{1}{2}\log\frac{2\pi e}{12} \approx \frac{1}{2}\log(1.423) \approx 0.254$ nats ≈ 0.367 bits

Gaussian has HIGHER entropy — it is indeed the max-entropy distribution for fixed variance.

Quiz

Q1: What does the concept of Differential entropy primarily refer to in this subject?

A) A computational error related to Differential entropy B) The definition and application of Differential entropy C) A visual representation of Differential entropy D) A historical anecdote about Differential entropy

Correct: B)

If you chose A: This is incorrect. Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus.
If you chose B: Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus. Correct!
If you chose C: This is incorrect. Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. Differential entropy is defined as: the definition and application of differential entropy. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx D) A simplified version of h(X) = -\int_{-\infty}^{\in...

Correct: C)

If you chose A: This is incorrect. The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: This is incorrect. The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose D: This is incorrect. The formula h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Maximum entropy distributions?

A) It is used only in advanced research contexts B) It is used to maximum entropy distributions in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system

Correct: B)

If you chose A: This is incorrect. Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose C: This is incorrect. Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Maximum entropy distributions serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about ⚠️ Critical: Differential Vs Discrete Entropy is TRUE?

A) ⚠️ Critical: Differential Vs Discrete Entropy is mentioned only as a historical footnote B) ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject C) ⚠️ Critical: Differential Vs Discrete Entropy is an advanced topic beyond this subject's scope D) ⚠️ Critical: Differential Vs Discrete Entropy is not related to this subject

Correct: B)

If you chose A: This is incorrect. ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content.
If you chose B: ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content. Correct!
If you chose C: This is incorrect. ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content.
If you chose D: This is incorrect. ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject. This subject covers ⚠️ Critical: Differential Vs Discrete Entropy as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) -\int f(x) \log f(x) dx$ extends entropy to c C) An unrelated numerical value D) A different result from a common mistake

Correct: B)

If you chose A: This is incorrect. The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors.
If you chose B: The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors. Correct!
If you chose C: This is incorrect. The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is -\int f(x) \log f(x) dx$ extends entropy to c. The other options represent common errors.

Q6: How are ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative related?

A) ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are closely related concepts B) ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are completely unrelated topics C) ⚠️ Critical: Differential Vs Discrete Entropy is the inverse of ⚠️ Why Differential Entropy Can Be Negative D) ⚠️ Critical: Differential Vs Discrete Entropy is a special case of ⚠️ Why Differential Entropy Can Be Negative

Correct: A)

If you chose A: Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics. Correct!
If you chose B: This is incorrect. Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are covered in this subject as interconnected topics.

Q7: What is a common pitfall when working with Differential Entropy Of Common Distributions?

A) The main error with Differential Entropy Of Common Distributions is using it when it is not needed B) Differential Entropy Of Common Distributions is always computed the same way in all contexts C) A common mistake is confusing Differential Entropy Of Common Distributions with a similar concept D) Differential Entropy Of Common Distributions has no common misconceptions

Correct: C)

If you chose A: This is incorrect. Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse Differential Entropy Of Common Distributions with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply ⚠️ Maximum Entropy Principle (Continuous)?

A) ⚠️ Maximum Entropy Principle (Continuous) is not practically useful B) Avoid ⚠️ Maximum Entropy Principle (Continuous) unless explicitly instructed C) Apply ⚠️ Maximum Entropy Principle (Continuous) to solve problems in this subject's domain D) Use ⚠️ Maximum Entropy Principle (Continuous) only in pure mathematics contexts

Correct: C)

If you chose A: This is incorrect. ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems.
If you chose C: ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose D: This is incorrect. ⚠️ Maximum Entropy Principle (Continuous) is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Compute $h(X)$ for $X \sim \text{Uniform}(2, 8)$.

Click for answer
$h(X) = \log_2(8 - 2) = \log_2 6 \approx 2.585$ bits
Why is $h(X + c) = h(X)$ (translation-invariant) but $h(aX) = h(X) + \log|a|$ (not scale-invariant)?

Click for answer
Translation just shifts the PDF: $f_{X+c}(y) = f_X(y-c)$. The integral $-\int f_{X+c}\log f_{X+c}$ is unchanged because the shape is the same. Scaling stretches/squeezes the PDF: $f_{aX}(y) = \frac{1}{|a|}f_X(y/a)$. The $\log|a|$ term comes from the Jacobian of the transformation: $h(aX) = h(X) + \log|a|$. Scaling changes the "volume" of the typical set.
For a fixed mean $\mu$, which distribution maximises differential entropy on $[0, \infty)$?

Click for answer
The exponential distribution. This is derived via calculus of variations: maximise $-\int_0^\infty f(x)\log f(x)dx$ subject to $\int_0^\infty f(x)dx = 1$ and $\int_0^\infty x f(x)dx = \mu$, which yields $f(x) = \frac{1}{\mu}e^{-x/\mu}$.
A Gaussian with variance 0.01 has differential entropy equal to what?

Click for answer
$h(X) = \frac{1}{2}\log_2(2\pi e \cdot 0.01) = \frac{1}{2}\log_2(0.1708) \approx \frac{1}{2}(-2.55) \approx -1.27$ bits. Negative differential entropy — the distribution is very concentrated (standard deviation = 0.1).
Explain why differential entropy can be negative but still meaningful.

Click for answer
Differential entropy is not an absolute measure of information — it's relative to the coordinate system (like measuring height relative to an arbitrary zero). Negative $h(X)$ means the typical set has "volume" less than 1 in the chosen coordinates. The quantity that IS absolute is $h(X) - \log \Delta$, which relates to discrete entropy under quantisation. Differences of differential entropy (e.g., $h(X) - h(Y)$) are well-defined and meaningful.

Summary

Key takeaways:

Differential entropy $h(X) = -\int f(x) \log f(x) dx$ extends entropy to continuous variables
Unlike discrete entropy, $h(X)$ can be negative and is not scale-invariant
$h(aX) = h(X) + \log|a|$ — scaling changes the coordinate system
Maximum entropy distributions: uniform (fixed support), exponential (fixed mean), Gaussian (fixed variance)
The Gaussian is the "least assumptive" distribution for given mean and variance
$h(X) \approx H(X^\Delta) + \log\Delta$ for small quantisation $\Delta$

Pitfalls

Interpreting negative differential entropy as "negative information": Negative $h(X)$ simply means the typical set has volume less than 1 in the chosen coordinate system — it does not mean the distribution has negative information content. A uniform distribution on $[0, 0.5]$ has $h(X) = -1$ bit because it occupies half a unit, not because it is somehow "anti-informative."
Treating differential entropy as scale-invariant: $h(aX) = h(X) + \log|a|$, unlike discrete entropy which does not change under relabeling of outcomes. Scaling stretches the PDF, changing the log-volume of the typical set. Comparing differential entropies across variables with different units or scales is meaningless — only differences of differential entropy are invariant under scaling.
Comparing differential entropies across different coordinate systems: Two variables with different units (e.g., metres vs centimetres) will have different differential entropies purely due to the scale term $\log|a|$. Always ensure variables are on comparable scales before interpreting differential entropy, or use relative measures like $h(X) - h(Y)$.
Forgetting the max-entropy distribution depends on the constraint type: For fixed support, the uniform distribution maximises $h(X)$. For fixed variance, the Gaussian maximises $h(X)$. For fixed nonnegative mean, the exponential maximises $h(X)$. Applying the wrong max-entropy principle (e.g., assuming uniform when variance is fixed) gives the wrong distribution.
Confusing differential entropy $h(X)$ with discrete entropy $H(X^\Delta)$: For a quantised continuous variable with bin width $\Delta$, $H(X^\Delta) \approx h(X) - \log\Delta$. As $\Delta \to 0$, $H(X^\Delta) \to \infty$ while $h(X)$ remains finite. The relationship explains why differential entropy is not an absolute measure — it captures the entropy beyond the resolution-dependent baseline.

Next Steps

Next up: 13-09-rate-distortion-theory.md

Progress

Phases

### 13.8 — Differential Entropy

Learning Objectives

Core Content

⚠️ CRITICAL: Differential vs Discrete Entropy

⚠️ Why Differential Entropy Can Be Negative

Differential Entropy of Common Distributions

⚠️ Maximum Entropy Principle (Continuous)

Relationship to Discrete Entropy

Key Terms

Worked Examples

Example 1: Differential entropy of Gaussian

Example 2: Negative differential entropy

Example 3: Maximum entropy — verify Gaussian

Quiz

Practice Problems

Summary

Pitfalls

Next Steps