Math graphic
📐 Concept diagram

### 13.6 — Source Coding Theorem

Phase: Information Theory Prerequisites: 13-01-entropy, 13-04-kl-divergence

Learning Objectives

By the end of this subject, you will be able to:

  1. State Shannon's source coding theorem and its operational significance
  2. Define prefix codes, uniquely decodable codes, and the Kraft inequality
  3. Prove that the expected code length $L \geq H(X)$ (entropy bound)
  4. Construct a Huffman code for a given distribution
  5. Explain why $H(X)$ is the fundamental compression limit

Core Content

⚠️ CRITICAL: Shannon's Source Coding Theorem

Theorem (Shannon, 1948): For any discrete memoryless source with entropy $H(X)$, there exists a uniquely decodable code with expected length $L$ satisfying:

$$H(X) \leq L < H(X) + 1$$

What this means operationally: - You CANNOT compress below $H(X)$ bits per symbol on average (no matter how clever your code) - You CAN achieve average length arbitrarily close to $H(X)$ by coding blocks of symbols together - For a block of $n$ symbols: $H(X) \leq \frac{L_n}{n} < H(X) + \frac{1}{n}$

As $n \to \infty$, the per-symbol rate approaches the entropy.

Entropy is the fundamental compression limit. No lossless compression algorithm can beat it in expectation.

Types of Codes

Uniquely decodable: Any encoded string corresponds to exactly one source message. No ambiguity.

Prefix-free (instantaneous) code: No codeword is a prefix of another. Can be decoded on-the-fly without lookahead. Example: {0, 10, 110, 111} is prefix-free; {0, 01, 011} is NOT (0 is prefix of 01, 01 is prefix of 011).

All prefix codes are uniquely decodable, but not vice versa. However, any uniquely decodable code can be converted to a prefix code with the same lengths.

Kraft Inequality

For a prefix code with codeword lengths $\ell_1, \ell_2, \ldots, \ell_m$ over a binary alphabet:

$$\sum_{i=1}^{m} 2^{-\ell_i} \leq 1$$

Conversely, if lengths satisfy Kraft, a prefix code with those lengths EXISTS.

This is the fundamental constraint on code design — you cannot make all codewords short; the budget is limited to 1.

⚠️ CRITICAL: Entropy Bound Proof

For any uniquely decodable code with lengths ${\ell_i}$:

$$L - H(X) = \sum p_i \ell_i - \sum p_i \log_2 \frac{1}{p_i}$$

$$= \sum p_i \log_2 \frac{p_i}{2^{-\ell_i}} = \sum p_i \log_2 \frac{p_i}{q_i} = D_{KL}(P \parallel Q) \geq 0$$

where $q_i = 2^{-\ell_i} / \sum 2^{-\ell_j}$ (normalised from Kraft). The KL divergence is non-negative, so $L \geq H(X)$, with equality iff $p_i = q_i$, i.e., when $\ell_i = -\log_2 p_i$ (ideal codeword lengths).

Huffman Coding

Huffman's algorithm constructs the optimal prefix code for a given distribution:

  1. Take the two least probable symbols
  2. Combine them into a meta-symbol with probability = sum
  3. Repeat until one symbol remains
  4. Assign 0/1 at each merge to build the code tree

Properties: - Optimal among all prefix codes (minimum expected length) - $H(X) \leq L_{\text{Huffman}} < H(X) + 1$ - Can be far from optimal if one symbol has probability close to 1

🚩 Common Pitfall: Huffman coding is optimal for symbol-by-symbol encoding a KNOWN distribution. It is NOT optimal for unknown distributions, non-i.i.d. sources, or block coding.



Key Terms

Worked Examples

Example 1: Code design and Kraft inequality

Symbols A, B, C, D with codewords: A→0, B→10, C→110, D→111.

Check Kraft: $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1.0$ ✓

This is a complete prefix code (Kraft sum = 1). Any binary string can be uniquely decoded.

Decode "0101101110": 0 → A 10 → B 110 → C 111 → D 0 → A → "ABCDA"

Example 2: Huffman code construction

Distribution: P(A)=0.4, P(B)=0.3, P(C)=0.2, P(D)=0.1

Merge D(0.1) and C(0.2) → CD(0.3), assign 0→C, 1→D Merge CD(0.3) and B(0.3) → BCD(0.6), assign 0→B, 1→CD Merge BCD(0.6) and A(0.4) → root(1.0), assign 0→A, 1→BCD

Tracing back: A→0, B→10, C→110, D→111

Expected length: $L = 0.4(1) + 0.3(2) + 0.2(3) + 0.1(3) = 0.4 + 0.6 + 0.6 + 0.3 = 1.9$ bits

Entropy: $H = -0.4\log_2 0.4 - 0.3\log_2 0.3 - 0.2\log_2 0.2 - 0.1\log_2 0.1 \approx 1.846$ bits

$1.846 \leq 1.9 < 2.846$ ✓ (theorem satisfied)

Example 3: Block coding improves efficiency

Source outputs A (p=0.7) or B (p=0.3). $H(X) = H_b(0.3) \approx 0.881$ bits.

Single-symbol Huffman: A→0, B→1. $L = 1$ bit → gap of 0.119 bits/symbol.

Block of 2: code pairs of symbols. P(AA)=0.49, P(AB)=0.21, P(BA)=0.21, P(BB)=0.09

Huffman on pairs: yields $L_2 \approx 1.81$ bits per pair → 0.905 bits/symbol.

Gap shrinks from 0.119 to 0.024 bits/symbol. Block of 3 would get even closer!



Quiz

Q1: What does the concept of Block coding primarily refer to in this subject?

A) A visual representation of Block coding B) A computational error related to Block coding C) The definition and application of Block coding D) A historical anecdote about Block coding

Correct: C)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) L \geq H(X) D) A simplified version of L \geq H(X)...

Correct: C)

Q3: What is the primary purpose of Huffman coding?

A) It replaces all other methods in this domain B) It is used to huffman coding in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: B)

Q4: Which statement about Kraft inequality is TRUE?

A) Kraft inequality is a fundamental concept covered in this subject B) Kraft inequality is an advanced topic beyond this subject's scope C) Kraft inequality is not related to this subject D) Kraft inequality is mentioned only as a historical footnote

Correct: A)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) The inverse of the correct answer C) 1.0 \leq 1$ ✓ D) A different result from a common mistake

Correct: C)

Q6: How are Kraft inequality and Prefix codes related?

A) Kraft inequality and Prefix codes are completely unrelated topics B) Kraft inequality is the inverse of Prefix codes C) Kraft inequality and Prefix codes are closely related concepts D) Kraft inequality is a special case of Prefix codes

Correct: C)

Q7: What is a common pitfall when working with ⚠️ Critical: Shannon'S Source Coding Theorem?

A) A common mistake is confusing ⚠️ Critical: Shannon'S Source Coding Theorem with a similar concept B) The main error with ⚠️ Critical: Shannon'S Source Coding Theorem is using it when it is not needed C) ⚠️ Critical: Shannon'S Source Coding Theorem has no common misconceptions D) ⚠️ Critical: Shannon'S Source Coding Theorem is always computed the same way in all contexts

Correct: A)

Q8: When should you apply Types Of Codes?

A) Use Types Of Codes only in pure mathematics contexts B) Types Of Codes is not practically useful C) Apply Types Of Codes to solve problems in this subject's domain D) Avoid Types Of Codes unless explicitly instructed

Correct: C)

Practice Problems

  1. Why is {0, 01, 10} not a prefix code? Can you decode "010" unambiguously?

    Click for answer 0 is a prefix of 01 — not prefix-free. Attempting to decode "010": first bit is 0 → A. Remaining "10" → C. So "010" → "AC". But is "01" at the start a codeword? Yes, 01 → B. Then "0" → A? But "0" on its own after that... The string "010" can be parsed as A-C or B-? (ambiguous — B leaves "0" which is A). So the decoding is NOT unique: "010" could be AC or BA. Not uniquely decodable.

  2. Given codeword lengths {1, 2, 3, 3}, do there exist prefix codes with these lengths? What about {1, 1, 2, 3}?

    Click for answer {1, 2, 3, 3}: $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1.0 \leq 1$ ✓ — exists. {1, 1, 2, 3}: $2^{-1} + 2^{-1} + 2^{-2} + 2^{-3} = 0.5 + 0.5 + 0.25 + 0.125 = 1.375 > 1$ ✗ — impossible. You can't have two 1-bit codewords.

  3. For P(A)=0.5, P(B)=0.25, P(C)=0.125, P(D)=0.125, what is the optimal expected code length?

    Click for answer $H(X) = 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75$ bits. Since the probabilities are powers of 2, ideal lengths $\ell_i = -\log_2 p_i$ are integers: (1, 2, 3, 3). Kraft sum = 1.0. So $L_{\text{opt}} = H(X) = 1.75$ bits — the entropy bound is achieved exactly!

  4. Why do modern compressors (gzip, LZMA) outperform Huffman coding on real data?

    Click for answer Huffman assumes an i.i.d. source with a KNOWN, FIXED distribution. Real data has structure: patterns, repetitions, context dependence. LZ-based methods exploit repeated substrings. Context-modelling compressors (PPM, PAQ) use conditional probabilities $P(x_i \mid x_{i-1}, \ldots, x_{i-k})$. Block coding and adaptive coding also help close the gap to entropy.

  5. Prove that the expected length of any prefix code satisfies $L \geq H(X)$.

    Click for answer Define $q_i = 2^{-\ell_i} / c$ where $c = \sum 2^{-\ell_j} \leq 1$ (Kraft). So $q_i \geq 2^{-\ell_i}$. $L - H(X) = \sum p_i \ell_i + \sum p_i \log p_i = \sum p_i \log \frac{p_i}{2^{-\ell_i}}$ $\geq \sum p_i \log \frac{p_i}{q_i} = D_{KL}(P \parallel Q) \geq 0$ Therefore $L \geq H(X)$, with equality iff $p_i = q_i = 2^{-\ell_i}$ (integer-length constraint permitting).


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 13-07-channel-capacity.md