Math graphic
📐 Concept diagram

### 13.8 — Differential Entropy

Phase: Information Theory Prerequisites: 13-01-entropy, 10-07-continuous-random-variables

Learning Objectives

By the end of this subject, you will be able to:

  1. Define differential entropy for continuous random variables
  2. Explain why differential entropy can be negative (unlike discrete entropy)
  3. Compute differential entropy for uniform, Gaussian, and exponential distributions
  4. State the maximum entropy principle for continuous distributions under constraints
  5. Relate differential entropy to discrete entropy via quantisation

Core Content

⚠️ CRITICAL: Differential vs Discrete Entropy

Differential entropy is the continuous analogue of discrete entropy:

$$h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx$$

where $f(x)$ is the probability density function (PDF).

CRUCIAL DIFFERENCES from discrete entropy:

Property Discrete $H(X)$ Continuous $h(X)$
Range $[0, \infty)$ $(-\infty, \infty)$
Can be negative? No Yes!
Invariant to scaling? Yes ($X$ categorical) No — $h(aX) = h(X) + \log
Meaning Absolute info in bits Relative to coordinate system

🚩 Common Pitfall: A Gaussian with variance $\sigma^2 < 1/(2\pi e)$ has NEGATIVE differential entropy. This does NOT mean "negative information" — it means the distribution is more concentrated than a standard reference. Differential entropy is NOT an absolute measure like discrete entropy.

⚠️ Why Differential Entropy Can Be Negative

Discrete entropy measures bits of uncertainty. Continuous entropy measures log-volume of the "typical set" relative to the coordinate system.

For a uniform distribution on $[0, a]$: $h(X) = \log a$

If $a < 1$, $h(X) < 0$ — the distribution occupies less than 1 unit of the coordinate system. This makes sense: negative $h(X)$ means the typical set volume is less than 1.

Differential Entropy of Common Distributions

Uniform distribution on $[a, b]$: $$h(X) = \log(b - a)$$

Gaussian (Normal) $N(\mu, \sigma^2)$: $$h(X) = \frac{1}{2}\log(2\pi e \sigma^2)$$

This is the maximum entropy distribution for fixed variance. For $\sigma^2 = 1$: $h(X) = \frac{1}{2}\log(2\pi e) \approx 1.419$ bits ($\approx 2.047$ nats).

Exponential with rate $\lambda$: $$h(X) = 1 - \log \lambda$$

⚠️ Maximum Entropy Principle (Continuous)

The principle of maximum entropy: among all distributions satisfying given constraints, choose the one with maximum entropy (it adds the fewest assumptions).

Constraint Max-Entropy Distribution
Fixed support $[a, b]$ Uniform
Fixed mean $\mu$ (on $[0, \infty)$) Exponential
Fixed variance $\sigma^2$ Gaussian (normal)
Fixed mean $\mu$ AND variance $\sigma^2$ Gaussian
Fixed covariance matrix Multivariate Gaussian

This principle justifies the ubiquity of the Gaussian distribution — it's the LEAST ASSUMPTIVE distribution with a given mean and variance. Any other distribution implicitly assumes additional structure.

Relationship to Discrete Entropy

Quantise a continuous variable into bins of width $\Delta$. As $\Delta \to 0$:

$$H(X^\Delta) \approx h(X) - \log \Delta$$

The discrete entropy diverges to $\infty$ (finer quantisation = more possible values), but the difference $H(X^\Delta) + \log\Delta \to h(X)$.

This explains why differential entropy is not an absolute measure — it's the "excess" entropy beyond the resolution-dependent baseline $-\log\Delta$.



Key Terms

Worked Examples

Example 1: Differential entropy of Gaussian

$X \sim N(0, 4)$ (variance = 4, $\sigma = 2$).

$h(X) = \frac{1}{2}\log_2(2\pi e \cdot 4) = \frac{1}{2}\log_2(8\pi e)$

$= \frac{1}{2}(\log_2 8 + \log_2 \pi + \log_2 e) = \frac{1}{2}(3 + 1.651 + 1.443) = \frac{1}{2}(6.094) \approx 3.047$ bits

(Using nats: $h(X) = \frac{1}{2}\ln(8\pi e) \approx \frac{1}{2}(4.225) \approx 2.112$ nats)

Example 2: Negative differential entropy

Uniform distribution on $[0, 0.5]$.

$h(X) = \log_2(0.5) = \log_2(1/2) = -1$ bit

This distribution is very concentrated — its typical set "volume" is 0.5 units. A narrower distribution would have even more negative entropy: uniform on $[0, 0.1]$ gives $h(X) = \log_2(0.1) \approx -3.32$ bits.

Example 3: Maximum entropy — verify Gaussian

Claim: For fixed variance $\sigma^2$, the Gaussian maximises $h(X)$.

Gaussian entropy: $h_{\text{Gauss}} = \frac{1}{2}\log(2\pi e \sigma^2)$

Uniform on $[-\sqrt{3}\sigma, \sqrt{3}\sigma]$ (also has variance $\sigma^2$):

$h_{\text{Unif}} = \log(2\sqrt{3}\sigma) = \frac{1}{2}\log(12\sigma^2)$

Difference: $h_{\text{Gauss}} - h_{\text{Unif}} = \frac{1}{2}\log(2\pi e \sigma^2) - \frac{1}{2}\log(12\sigma^2) = \frac{1}{2}\log\frac{2\pi e}{12} \approx \frac{1}{2}\log(1.423) \approx 0.254$ nats ≈ 0.367 bits

Gaussian has HIGHER entropy — it is indeed the max-entropy distribution for fixed variance.



Quiz

Q1: What does the concept of Differential entropy primarily refer to in this subject?

A) A computational error related to Differential entropy B) The definition and application of Differential entropy C) A visual representation of Differential entropy D) A historical anecdote about Differential entropy

Correct: B)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx D) A simplified version of h(X) = -\int_{-\infty}^{\in...

Correct: C)

Q3: What is the primary purpose of Maximum entropy distributions?

A) It is used only in advanced research contexts B) It is used to maximum entropy distributions in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system

Correct: B)

Q4: Which statement about ⚠️ Critical: Differential Vs Discrete Entropy is TRUE?

A) ⚠️ Critical: Differential Vs Discrete Entropy is mentioned only as a historical footnote B) ⚠️ Critical: Differential Vs Discrete Entropy is a fundamental concept covered in this subject C) ⚠️ Critical: Differential Vs Discrete Entropy is an advanced topic beyond this subject's scope D) ⚠️ Critical: Differential Vs Discrete Entropy is not related to this subject

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) -\int f(x) \log f(x) dx$ extends entropy to c C) An unrelated numerical value D) A different result from a common mistake

Correct: B)

Q6: How are ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative related?

A) ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are closely related concepts B) ⚠️ Critical: Differential Vs Discrete Entropy and ⚠️ Why Differential Entropy Can Be Negative are completely unrelated topics C) ⚠️ Critical: Differential Vs Discrete Entropy is the inverse of ⚠️ Why Differential Entropy Can Be Negative D) ⚠️ Critical: Differential Vs Discrete Entropy is a special case of ⚠️ Why Differential Entropy Can Be Negative

Correct: A)

Q7: What is a common pitfall when working with Differential Entropy Of Common Distributions?

A) The main error with Differential Entropy Of Common Distributions is using it when it is not needed B) Differential Entropy Of Common Distributions is always computed the same way in all contexts C) A common mistake is confusing Differential Entropy Of Common Distributions with a similar concept D) Differential Entropy Of Common Distributions has no common misconceptions

Correct: C)

Q8: When should you apply ⚠️ Maximum Entropy Principle (Continuous)?

A) ⚠️ Maximum Entropy Principle (Continuous) is not practically useful B) Avoid ⚠️ Maximum Entropy Principle (Continuous) unless explicitly instructed C) Apply ⚠️ Maximum Entropy Principle (Continuous) to solve problems in this subject's domain D) Use ⚠️ Maximum Entropy Principle (Continuous) only in pure mathematics contexts

Correct: C)

Practice Problems

  1. Compute $h(X)$ for $X \sim \text{Uniform}(2, 8)$.

    Click for answer $h(X) = \log_2(8 - 2) = \log_2 6 \approx 2.585$ bits

  2. Why is $h(X + c) = h(X)$ (translation-invariant) but $h(aX) = h(X) + \log|a|$ (not scale-invariant)?

    Click for answer Translation just shifts the PDF: $f_{X+c}(y) = f_X(y-c)$. The integral $-\int f_{X+c}\log f_{X+c}$ is unchanged because the shape is the same. Scaling stretches/squeezes the PDF: $f_{aX}(y) = \frac{1}{|a|}f_X(y/a)$. The $\log|a|$ term comes from the Jacobian of the transformation: $h(aX) = h(X) + \log|a|$. Scaling changes the "volume" of the typical set.

  3. For a fixed mean $\mu$, which distribution maximises differential entropy on $[0, \infty)$?

    Click for answer The exponential distribution. This is derived via calculus of variations: maximise $-\int_0^\infty f(x)\log f(x)dx$ subject to $\int_0^\infty f(x)dx = 1$ and $\int_0^\infty x f(x)dx = \mu$, which yields $f(x) = \frac{1}{\mu}e^{-x/\mu}$.

  4. A Gaussian with variance 0.01 has differential entropy equal to what?

    Click for answer $h(X) = \frac{1}{2}\log_2(2\pi e \cdot 0.01) = \frac{1}{2}\log_2(0.1708) \approx \frac{1}{2}(-2.55) \approx -1.27$ bits. Negative differential entropy — the distribution is very concentrated (standard deviation = 0.1).

  5. Explain why differential entropy can be negative but still meaningful.

    Click for answer Differential entropy is not an absolute measure of information — it's relative to the coordinate system (like measuring height relative to an arbitrary zero). Negative $h(X)$ means the typical set has "volume" less than 1 in the chosen coordinates. The quantity that IS absolute is $h(X) - \log \Delta$, which relates to discrete entropy under quantisation. Differences of differential entropy (e.g., $h(X) - h(Y)$) are well-defined and meaningful.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-09-rate-distortion-theory.md