Math graphic
📐 Concept diagram

### 13.9 — Rate-Distortion Theory (Conceptual)

Phase: Information Theory Prerequisites: 13-03-mutual-information, 13-08-differential-entropy, 13-07-channel-capacity

Learning Objectives

By the end of this subject, you will be able to:

  1. Define the rate-distortion function $R(D)$ and interpret it
  2. Explain the fundamental trade-off between compression rate and distortion
  3. Describe the information bottleneck principle
  4. Connect rate-distortion theory to lossy compression (JPEG, MP3)
  5. Recognise its role in representation learning and autoencoders

Core Content

⚠️ CRITICAL: Lossy Compression and the Rate-Distortion Trade-off

Lossless compression has a hard lower bound: you cannot compress below $H(X)$ bits per symbol on average (source coding theorem).

Lossy compression relaxes this: you accept SOME distortion in exchange for a LOWER bit rate. Rate-distortion theory characterises the optimal trade-off.

The Rate-Distortion Function

Given a source $X$ and a distortion measure $d(x, \hat{x})$ (e.g., squared error, Hamming distance), the rate-distortion function $R(D)$ is the minimum rate needed to represent $X$ with average distortion at most $D$:

$$R(D) = \min_{P(\hat{X} \mid X): E[d(X, \hat{X})] \leq D} I(X; \hat{X})$$

Interpretation: Among all compression schemes with average distortion $\leq D$, the minimum number of bits needed per symbol is $R(D)$.

Properties: - $R(0) = H(X)$: zero distortion requires lossless compression (at least entropy) - $R(D_{\max}) = 0$: at maximum acceptable distortion, you can just output a constant - $R(D)$ is convex and non-increasing in $D$ - Steep slope near $D = 0$: the first few bits of distortion reduction cost the most

🚩 Common Pitfall: Rate-distortion theory gives the THEORETICAL limit. Practical codecs (JPEG, MP3, H.264) approach this limit but don't achieve it — there's always a coding gap.

Example: Gaussian Source with Squared Error

For a memoryless Gaussian source $X \sim N(0, \sigma^2)$ and mean squared error distortion:

$$R(D) = \frac{1}{2}\log_2\frac{\sigma^2}{D}, \quad \text{for } 0 \leq D \leq \sigma^2$$

Or equivalently: $D(R) = \sigma^2 \cdot 2^{-2R}$

This means each additional bit halves the distortion (divides it by 4 in squared-error terms) — a 6 dB improvement per bit.

Practical Lossy Compression Systems

Format Source type Distortion measure Rate control
JPEG Images Perceptual (DCT quantisation) Quality factor
MP3 Audio Psychoacoustic masking Bitrate (kbps)
H.264/H.265 Video Perceptual + motion Bitrate, CRF
VQ-VAE Latent vectors MSE of reconstruction Codebook size

All these systems embody the rate-distortion trade-off: you choose a point on the $R$-$D$ curve by setting quality/bitrate.

The Information Bottleneck Principle

The information bottleneck (Tishby et al.) is rate-distortion theory applied to representation learning:

Given input $X$ and a target $Y$ we want to predict, find a compressed representation $Z$ that: 1. Compresses $X$: minimises $I(X; Z)$ (keeps rate low) 2. Preserves information about $Y$: maximises $I(Z; Y)$ (keeps distortion low)

$$\min_{P(Z \mid X)} I(X; Z) - \beta \cdot I(Z; Y)$$

Where $\beta$ controls the compression-prediction trade-off.

Connection to deep learning: The layers of a neural network form a "Markov chain" $X \to Z_1 \to Z_2 \to \cdots \to Y$. Each layer compresses and transforms. The information bottleneck theory suggests that networks learn by first "fitting" (memorising) and then "compressing" (generalising) — the two phases of deep learning.

Connection to Autoencoders and VAEs

Autoencoders: Encoder compresses $X \to Z$ (bottleneck), decoder reconstructs $Z \to \hat{X}$. The bottleneck size (latent dimension) directly controls the rate-distortion trade-off.

Variational Autoencoders (VAEs): Add a regularisation term that limits $I(X; Z)$ (KL divergence between encoder and prior), implementing the information bottleneck explicitly.

Rate-distortion perspective on VAEs: $$\mathcal{L}{\text{VAE}} = \underbrace{E{q(Z \mid X)}[-\log p(X \mid Z)]}{\text{Distortion}} + \beta \cdot \underbrace{D{KL}(q(Z \mid X) \parallel p(Z))}_{\text{Rate}}$$

This is EXACTLY the rate-distortion Lagrangian! $\beta$-VAE makes this explicit by weighting the KL term.



Key Terms

Worked Examples

Example 1: Gaussian rate-distortion

For a Gaussian source with $\sigma^2 = 16$, what is the minimum rate for distortion $D = 4$?

$R(4) = \frac{1}{2}\log_2\frac{16}{4} = \frac{1}{2}\log_2 4 = \frac{1}{2} \cdot 2 = 1$ bit per sample.

At $D = 1$: $R(1) = \frac{1}{2}\log_2 16 = \frac{1}{2} \cdot 4 = 2$ bits per sample.

Each factor-of-4 reduction in distortion costs 1 additional bit.

Example 2: Binary source with Hamming distortion

$X \sim \text{Bernoulli}(0.5)$, Hamming distortion: $d(x, \hat{x}) = 1$ if $x \neq \hat{x}$, 0 otherwise.

$D$ is the bit error probability.

$R(D) = 1 - H_b(D) = H_b(0.5) - H_b(D)$, for $0 \leq D \leq 0.5$

Note the symmetry with BSC capacity: $C = 1 - H_b(p)$. Rate-distortion and channel coding are dual problems!

Example 3: Information bottleneck in practice

In a classification network, the penultimate layer forms a bottleneck representation $Z$. Too small a $Z$ (low rate) → underfitting (high distortion). Too large a $Z$ (high rate) → overfitting (memorises noise instead of compressing). The optimal $Z$ size balances compression of $X$ with preservation of information about $Y$.

This is also why dropout and weight decay work — they implicitly limit $I(X; Z)$, forcing the network to learn compressed, generalisable representations.



Quiz

Q1: What does the concept of Lossless compression primarily refer to in this subject?

A) A computational error related to Lossless compression B) A visual representation of Lossless compression C) A historical anecdote about Lossless compression D) The definition and application of Lossless compression

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) d(x, \hat{x}) B) A simplified version of d(x, \hat{x})... C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: A)

Q3: What is the primary purpose of Lossy compression?

A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It is used to lossy compression in mathematical analysis

Correct: D)

Q4: Which statement about Compresses is TRUE?

A) Compresses is mentioned only as a historical footnote B) Compresses is a fundamental concept covered in this subject C) Compresses is an advanced topic beyond this subject's scope D) Compresses is not related to this subject

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) The inverse of the correct answer C) A different result from a common mistake D) $R(D) \to H(X)$ — we approach lossless compression

Correct: D)

Q6: How are Compresses and Preserves information about related?

A) Compresses and Preserves information about are completely unrelated topics B) Compresses is the inverse of Preserves information about C) Compresses and Preserves information about are closely related concepts D) Compresses is a special case of Preserves information about

Correct: C)

Q7: What is a common pitfall when working with Rate-distortion function?

A) A common mistake is confusing Rate-distortion function with a similar concept B) The main error with Rate-distortion function is using it when it is not needed C) Rate-distortion function has no common misconceptions D) Rate-distortion function is always computed the same way in all contexts

Correct: A)

Q8: When should you apply ⚠️ Critical: Lossy Compression And The Rate-Distortion Trade-Off?

A) ⚠️ Critical: Lossy Compression And The Rate-Distortion Trade-Off is not practically useful B) Apply ⚠️ Critical: Lossy Compression And The Rate-Distortion Trade-Off to solve problems in this subject's domain C) Avoid ⚠️ Critical: Lossy Compression And The Rate-Distortion Trade-Off unless explicitly instructed D) Use ⚠️ Critical: Lossy Compression And The Rate-Distortion Trade-Off only in pure mathematics contexts

Correct: B)

Practice Problems

  1. What happens to $R(D)$ as $D \to 0$? What about as $D \to D_{\max}$?

    Click for answer $D \to 0$: $R(D) \to H(X)$ — we approach lossless compression, requiring at least the entropy. $D \to D_{\max}$: $R(D) \to 0$ — if we accept maximal distortion, we can just output a constant guess with zero bits.

  2. For a Gaussian source ($\sigma^2 = 25$), how many bits per sample to achieve distortion $D = 1$?

    Click for answer $R(1) = \frac{1}{2}\log_2\frac{25}{1} = \frac{1}{2}\log_2 25 \approx \frac{1}{2}(4.644) = 2.322$ bits per sample

  3. Explain the symmetry between channel capacity and rate-distortion.

    Click for answer Channel capacity: $C = \max I(X; Y)$ — max information through a noisy channel. Rate-distortion: $R(D) = \min I(X; \hat{X})$ — min information for given distortion. For the BSC and binary source: $C(p) = 1 - H_b(p)$ and $R(D) = 1 - H_b(D)$ are identical functions. This is the information-theoretic duality: the problem of communicating over a noisy channel is mathematically dual to the problem of compressing with distortion.

  4. How does a VAE's $\beta$ parameter control the rate-distortion trade-off?

    Click for answer $\mathcal{L} = \text{Reconstruction Error} + \beta \cdot D_{KL}(q(Z \mid X) \parallel p(Z))$ High $\beta$: penalises the KL term heavily → forces $Z$ to be close to the prior → limits $I(X; Z)$ (low rate) → blurrier reconstructions (high distortion). Low $\beta$: weak regularisation → $Z$ can encode lots of information about $X$ (high rate) → sharper reconstructions (low distortion), but risk of overfitting. This is the rate-distortion Lagrangian with $\beta$ controlling the trade-off.

  5. Why does JPEG work well for photographs but poorly for line drawings?

    Click for answer JPEG's distortion measure (DCT coefficient quantisation) aligns with human perception of natural images: we're insensitive to high-frequency errors in smooth regions. Line drawings have sharp edges (high-frequency content) that JPEG's quantisation distorts visibly (ringing artefacts). This is a mismatch between the codec's implicit distortion measure and perceptual quality for that content type. PNG (lossless) or GIF (palette-based) are better for graphics.


Summary

Key takeaways:


Pitfalls



Next Steps

🎉 Congratulations! You've completed Phase 13 (Information Theory). Next: 14-01-optimization-fundamentals.md