📐 Concept diagram

### 13.6 — Source Coding Theorem

Phase: Information Theory Prerequisites: 13-01-entropy, 13-04-kl-divergence

Learning Objectives

By the end of this subject, you will be able to:

State Shannon's source coding theorem and its operational significance
Define prefix codes, uniquely decodable codes, and the Kraft inequality
Prove that the expected code length $L \geq H(X)$ (entropy bound)
Construct a Huffman code for a given distribution
Explain why $H(X)$ is the fundamental compression limit

Core Content

⚠️ CRITICAL: Shannon's Source Coding Theorem

Theorem (Shannon, 1948): For any discrete memoryless source with entropy $H(X)$, there exists a uniquely decodable code with expected length $L$ satisfying:

$$H(X) \leq L < H(X) + 1$$

What this means operationally: - You CANNOT compress below $H(X)$ bits per symbol on average (no matter how clever your code) - You CAN achieve average length arbitrarily close to $H(X)$ by coding blocks of symbols together - For a block of $n$ symbols: $H(X) \leq \frac{L_n}{n} < H(X) + \frac{1}{n}$

As $n \to \infty$, the per-symbol rate approaches the entropy.

Entropy is the fundamental compression limit. No lossless compression algorithm can beat it in expectation.

Types of Codes

Uniquely decodable: Any encoded string corresponds to exactly one source message. No ambiguity.

Prefix-free (instantaneous) code: No codeword is a prefix of another. Can be decoded on-the-fly without lookahead. Example: {0, 10, 110, 111} is prefix-free; {0, 01, 011} is NOT (0 is prefix of 01, 01 is prefix of 011).

All prefix codes are uniquely decodable, but not vice versa. However, any uniquely decodable code can be converted to a prefix code with the same lengths.

Kraft Inequality

For a prefix code with codeword lengths $\ell_1, \ell_2, \ldots, \ell_m$ over a binary alphabet:

$$\sum_{i=1}^{m} 2^{-\ell_i} \leq 1$$

Conversely, if lengths satisfy Kraft, a prefix code with those lengths EXISTS.

This is the fundamental constraint on code design — you cannot make all codewords short; the budget is limited to 1.

⚠️ CRITICAL: Entropy Bound Proof

For any uniquely decodable code with lengths ${\ell_i}$:

$$L - H(X) = \sum p_i \ell_i - \sum p_i \log_2 \frac{1}{p_i}$$

$$= \sum p_i \log_2 \frac{p_i}{2^{-\ell_i}} = \sum p_i \log_2 \frac{p_i}{q_i} = D_{KL}(P \parallel Q) \geq 0$$

where $q_i = 2^{-\ell_i} / \sum 2^{-\ell_j}$ (normalised from Kraft). The KL divergence is non-negative, so $L \geq H(X)$, with equality iff $p_i = q_i$, i.e., when $\ell_i = -\log_2 p_i$ (ideal codeword lengths).

Huffman Coding

Huffman's algorithm constructs the optimal prefix code for a given distribution:

Take the two least probable symbols
Combine them into a meta-symbol with probability = sum
Repeat until one symbol remains
Assign 0/1 at each merge to build the code tree

Properties: - Optimal among all prefix codes (minimum expected length) - $H(X) \leq L_{\text{Huffman}} < H(X) + 1$ - Can be far from optimal if one symbol has probability close to 1

🚩 Common Pitfall: Huffman coding is optimal for symbol-by-symbol encoding a KNOWN distribution. It is NOT optimal for unknown distributions, non-i.i.d. sources, or block coding.

Key Terms

Block coding
Huffman coding
Kraft inequality
Prefix codes

Worked Examples

Example 1: Code design and Kraft inequality

Symbols A, B, C, D with codewords: A→0, B→10, C→110, D→111.

Check Kraft: $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1.0$ ✓

This is a complete prefix code (Kraft sum = 1). Any binary string can be uniquely decoded.

Decode "0101101110": 0 → A 10 → B 110 → C 111 → D 0 → A → "ABCDA"

Example 2: Huffman code construction

Distribution: P(A)=0.4, P(B)=0.3, P(C)=0.2, P(D)=0.1

Merge D(0.1) and C(0.2) → CD(0.3), assign 0→C, 1→D Merge CD(0.3) and B(0.3) → BCD(0.6), assign 0→B, 1→CD Merge BCD(0.6) and A(0.4) → root(1.0), assign 0→A, 1→BCD

Tracing back: A→0, B→10, C→110, D→111

Expected length: $L = 0.4(1) + 0.3(2) + 0.2(3) + 0.1(3) = 0.4 + 0.6 + 0.6 + 0.3 = 1.9$ bits

Entropy: $H = -0.4\log_2 0.4 - 0.3\log_2 0.3 - 0.2\log_2 0.2 - 0.1\log_2 0.1 \approx 1.846$ bits

$1.846 \leq 1.9 < 2.846$ ✓ (theorem satisfied)

Example 3: Block coding improves efficiency

Source outputs A (p=0.7) or B (p=0.3). $H(X) = H_b(0.3) \approx 0.881$ bits.

Single-symbol Huffman: A→0, B→1. $L = 1$ bit → gap of 0.119 bits/symbol.

Block of 2: code pairs of symbols. P(AA)=0.49, P(AB)=0.21, P(BA)=0.21, P(BB)=0.09

Huffman on pairs: yields $L_2 \approx 1.81$ bits per pair → 0.905 bits/symbol.

Gap shrinks from 0.119 to 0.024 bits/symbol. Block of 3 would get even closer!

Quiz

Q1: What does the concept of Block coding primarily refer to in this subject?

A) A visual representation of Block coding B) A computational error related to Block coding C) The definition and application of Block coding D) A historical anecdote about Block coding

Correct: C)

If you chose A: This is incorrect. Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus.
If you chose C: Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) L \geq H(X) D) A simplified version of L \geq H(X)...

Correct: C)

If you chose A: This is incorrect. The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: This is incorrect. The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose D: This is incorrect. The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Huffman coding?

A) It replaces all other methods in this domain B) It is used to huffman coding in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: B)

If you chose A: This is incorrect. Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose C: This is incorrect. Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Kraft inequality is TRUE?

A) Kraft inequality is a fundamental concept covered in this subject B) Kraft inequality is an advanced topic beyond this subject's scope C) Kraft inequality is not related to this subject D) Kraft inequality is mentioned only as a historical footnote

Correct: A)

If you chose A: Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content. Correct!
If you chose B: This is incorrect. Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content.
If you chose C: This is incorrect. Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content.
If you chose D: This is incorrect. Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) The inverse of the correct answer C) 1.0 \leq 1$ ✓ D) A different result from a common mistake

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors.
If you chose C: The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors.

Q6: How are Kraft inequality and Prefix codes related?

A) Kraft inequality and Prefix codes are completely unrelated topics B) Kraft inequality is the inverse of Prefix codes C) Kraft inequality and Prefix codes are closely related concepts D) Kraft inequality is a special case of Prefix codes

Correct: C)

If you chose A: This is incorrect. Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics.
If you chose C: Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics. Correct!
If you chose D: This is incorrect. Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics.

Q7: What is a common pitfall when working with ⚠️ Critical: Shannon'S Source Coding Theorem?

A) A common mistake is confusing ⚠️ Critical: Shannon'S Source Coding Theorem with a similar concept B) The main error with ⚠️ Critical: Shannon'S Source Coding Theorem is using it when it is not needed C) ⚠️ Critical: Shannon'S Source Coding Theorem has no common misconceptions D) ⚠️ Critical: Shannon'S Source Coding Theorem is always computed the same way in all contexts

Correct: A)

If you chose A: Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose B: This is incorrect. Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply Types Of Codes?

A) Use Types Of Codes only in pure mathematics contexts B) Types Of Codes is not practically useful C) Apply Types Of Codes to solve problems in this subject's domain D) Avoid Types Of Codes unless explicitly instructed

Correct: C)

If you chose A: This is incorrect. Types Of Codes is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. Types Of Codes is a practical tool used throughout this subject to solve relevant problems.
If you chose C: Types Of Codes is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose D: This is incorrect. Types Of Codes is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Why is {0, 01, 10} not a prefix code? Can you decode "010" unambiguously?

Click for answer
0 is a prefix of 01 — not prefix-free. Attempting to decode "010": first bit is 0 → A. Remaining "10" → C. So "010" → "AC". But is "01" at the start a codeword? Yes, 01 → B. Then "0" → A? But "0" on its own after that... The string "010" can be parsed as A-C or B-? (ambiguous — B leaves "0" which is A). So the decoding is NOT unique: "010" could be AC or BA. Not uniquely decodable.
Given codeword lengths {1, 2, 3, 3}, do there exist prefix codes with these lengths? What about {1, 1, 2, 3}?

Click for answer
{1, 2, 3, 3}: $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1.0 \leq 1$ ✓ — exists. {1, 1, 2, 3}: $2^{-1} + 2^{-1} + 2^{-2} + 2^{-3} = 0.5 + 0.5 + 0.25 + 0.125 = 1.375 > 1$ ✗ — impossible. You can't have two 1-bit codewords.
For P(A)=0.5, P(B)=0.25, P(C)=0.125, P(D)=0.125, what is the optimal expected code length?

Click for answer
$H(X) = 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75$ bits. Since the probabilities are powers of 2, ideal lengths $\ell_i = -\log_2 p_i$ are integers: (1, 2, 3, 3). Kraft sum = 1.0. So $L_{\text{opt}} = H(X) = 1.75$ bits — the entropy bound is achieved exactly!
Why do modern compressors (gzip, LZMA) outperform Huffman coding on real data?

Click for answer
Huffman assumes an i.i.d. source with a KNOWN, FIXED distribution. Real data has structure: patterns, repetitions, context dependence. LZ-based methods exploit repeated substrings. Context-modelling compressors (PPM, PAQ) use conditional probabilities $P(x_i \mid x_{i-1}, \ldots, x_{i-k})$. Block coding and adaptive coding also help close the gap to entropy.
Prove that the expected length of any prefix code satisfies $L \geq H(X)$.

Click for answer
Define $q_i = 2^{-\ell_i} / c$ where $c = \sum 2^{-\ell_j} \leq 1$ (Kraft). So $q_i \geq 2^{-\ell_i}$. $L - H(X) = \sum p_i \ell_i + \sum p_i \log p_i = \sum p_i \log \frac{p_i}{2^{-\ell_i}}$ $\geq \sum p_i \log \frac{p_i}{q_i} = D_{KL}(P \parallel Q) \geq 0$ Therefore $L \geq H(X)$, with equality iff $p_i = q_i = 2^{-\ell_i}$ (integer-length constraint permitting).

Summary

Key takeaways:

Source coding theorem: $H(X) \leq L < H(X) + 1$ — entropy is the fundamental compression limit
Prefix codes allow instantaneous decoding; governed by the Kraft inequality
Huffman coding produces optimal prefix codes for known distributions
Block coding closes the gap to entropy: $L_n/n \to H(X)$ as $n \to \infty$
Real compressors go beyond Huffman by exploiting structure, context, and adaptation

Pitfalls

Expecting Huffman coding to always achieve the entropy bound: Huffman coding guarantees $H(X) \leq L < H(X) + 1$ per symbol, which leaves a gap of up to 1 bit. This gap is worst when one symbol has probability near 1 — single-symbol Huffman can be far from optimal. Block coding closes this gap by amortising the overhead over multiple symbols.
Using static Huffman codes for unknown or changing distributions: Huffman coding assumes a known, fixed distribution. If the distribution is estimated from data or changes over time, a static Huffman code is suboptimal. Adaptive Huffman coding or arithmetic coding should be used when the source statistics are unknown or non-stationary.
Designing codes that violate the Kraft inequality: For a prefix code with lengths ${\ell_i}$, the Kraft sum $\sum 2^{-\ell_i}$ must not exceed 1. A common mistake is assigning short codewords to too many symbols without checking this constraint. Any set of codeword lengths violating Kraft cannot correspond to any prefix code — the code tree would be impossible to construct.
Confusing prefix codes with all uniquely decodable codes: All prefix codes are uniquely decodable, but not all uniquely decodable codes are prefix-free. However, the Kraft inequality applies to any uniquely decodable code (McMillan's theorem), and any uniquely decodable code can be converted to a prefix code with the same codeword lengths. So restricting attention to prefix codes incurs no loss in achievable compression.
Assuming Huffman coding is optimal for all compression scenarios: Huffman is optimal for symbol-by-symbol encoding of a known i.i.d. source. It is not optimal for sources with memory, unknown distributions, or when block coding is allowed (arithmetic coding can approach entropy more closely on a per-symbol basis without needing very large blocks).

Next Steps

Next up: 13-07-channel-capacity.md

Progress

Phases

### 13.6 — Source Coding Theorem

Learning Objectives

Core Content

⚠️ CRITICAL: Shannon's Source Coding Theorem

Types of Codes

Kraft Inequality

⚠️ CRITICAL: Entropy Bound Proof

Huffman Coding

Key Terms

Worked Examples

Example 1: Code design and Kraft inequality

Example 2: Huffman code construction

Example 3: Block coding improves efficiency

Quiz

Practice Problems

Summary

Pitfalls

Next Steps