### 13.6 — Source Coding Theorem
Phase: Information Theory Prerequisites: 13-01-entropy, 13-04-kl-divergence
Learning Objectives
By the end of this subject, you will be able to:
- State Shannon's source coding theorem and its operational significance
- Define prefix codes, uniquely decodable codes, and the Kraft inequality
- Prove that the expected code length $L \geq H(X)$ (entropy bound)
- Construct a Huffman code for a given distribution
- Explain why $H(X)$ is the fundamental compression limit
Core Content
⚠️ CRITICAL: Shannon's Source Coding Theorem
Theorem (Shannon, 1948): For any discrete memoryless source with entropy $H(X)$, there exists a uniquely decodable code with expected length $L$ satisfying:
$$H(X) \leq L < H(X) + 1$$
What this means operationally: - You CANNOT compress below $H(X)$ bits per symbol on average (no matter how clever your code) - You CAN achieve average length arbitrarily close to $H(X)$ by coding blocks of symbols together - For a block of $n$ symbols: $H(X) \leq \frac{L_n}{n} < H(X) + \frac{1}{n}$
As $n \to \infty$, the per-symbol rate approaches the entropy.
Entropy is the fundamental compression limit. No lossless compression algorithm can beat it in expectation.
Types of Codes
Uniquely decodable: Any encoded string corresponds to exactly one source message. No ambiguity.
Prefix-free (instantaneous) code: No codeword is a prefix of another. Can be decoded on-the-fly without lookahead. Example: {0, 10, 110, 111} is prefix-free; {0, 01, 011} is NOT (0 is prefix of 01, 01 is prefix of 011).
All prefix codes are uniquely decodable, but not vice versa. However, any uniquely decodable code can be converted to a prefix code with the same lengths.
Kraft Inequality
For a prefix code with codeword lengths $\ell_1, \ell_2, \ldots, \ell_m$ over a binary alphabet:
$$\sum_{i=1}^{m} 2^{-\ell_i} \leq 1$$
Conversely, if lengths satisfy Kraft, a prefix code with those lengths EXISTS.
This is the fundamental constraint on code design — you cannot make all codewords short; the budget is limited to 1.
⚠️ CRITICAL: Entropy Bound Proof
For any uniquely decodable code with lengths ${\ell_i}$:
$$L - H(X) = \sum p_i \ell_i - \sum p_i \log_2 \frac{1}{p_i}$$
$$= \sum p_i \log_2 \frac{p_i}{2^{-\ell_i}} = \sum p_i \log_2 \frac{p_i}{q_i} = D_{KL}(P \parallel Q) \geq 0$$
where $q_i = 2^{-\ell_i} / \sum 2^{-\ell_j}$ (normalised from Kraft). The KL divergence is non-negative, so $L \geq H(X)$, with equality iff $p_i = q_i$, i.e., when $\ell_i = -\log_2 p_i$ (ideal codeword lengths).
Huffman Coding
Huffman's algorithm constructs the optimal prefix code for a given distribution:
- Take the two least probable symbols
- Combine them into a meta-symbol with probability = sum
- Repeat until one symbol remains
- Assign 0/1 at each merge to build the code tree
Properties: - Optimal among all prefix codes (minimum expected length) - $H(X) \leq L_{\text{Huffman}} < H(X) + 1$ - Can be far from optimal if one symbol has probability close to 1
🚩 Common Pitfall: Huffman coding is optimal for symbol-by-symbol encoding a KNOWN distribution. It is NOT optimal for unknown distributions, non-i.i.d. sources, or block coding.
Key Terms
- Block coding
- Huffman coding
- Kraft inequality
- Prefix codes
Worked Examples
Example 1: Code design and Kraft inequality
Symbols A, B, C, D with codewords: A→0, B→10, C→110, D→111.
Check Kraft: $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1.0$ ✓
This is a complete prefix code (Kraft sum = 1). Any binary string can be uniquely decoded.
Decode "0101101110": 0 → A 10 → B 110 → C 111 → D 0 → A → "ABCDA"
Example 2: Huffman code construction
Distribution: P(A)=0.4, P(B)=0.3, P(C)=0.2, P(D)=0.1
Merge D(0.1) and C(0.2) → CD(0.3), assign 0→C, 1→D Merge CD(0.3) and B(0.3) → BCD(0.6), assign 0→B, 1→CD Merge BCD(0.6) and A(0.4) → root(1.0), assign 0→A, 1→BCD
Tracing back: A→0, B→10, C→110, D→111
Expected length: $L = 0.4(1) + 0.3(2) + 0.2(3) + 0.1(3) = 0.4 + 0.6 + 0.6 + 0.3 = 1.9$ bits
Entropy: $H = -0.4\log_2 0.4 - 0.3\log_2 0.3 - 0.2\log_2 0.2 - 0.1\log_2 0.1 \approx 1.846$ bits
$1.846 \leq 1.9 < 2.846$ ✓ (theorem satisfied)
Example 3: Block coding improves efficiency
Source outputs A (p=0.7) or B (p=0.3). $H(X) = H_b(0.3) \approx 0.881$ bits.
Single-symbol Huffman: A→0, B→1. $L = 1$ bit → gap of 0.119 bits/symbol.
Block of 2: code pairs of symbols. P(AA)=0.49, P(AB)=0.21, P(BA)=0.21, P(BB)=0.09
Huffman on pairs: yields $L_2 \approx 1.81$ bits per pair → 0.905 bits/symbol.
Gap shrinks from 0.119 to 0.024 bits/symbol. Block of 3 would get even closer!
Quiz
Q1: What does the concept of Block coding primarily refer to in this subject?
A) A visual representation of Block coding B) A computational error related to Block coding C) The definition and application of Block coding D) A historical anecdote about Block coding
Correct: C)
- If you chose A: This is incorrect. Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus.
- If you chose C: Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Block coding is defined as: the definition and application of block coding. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) L \geq H(X) D) A simplified version of L \geq H(X)...
Correct: C)
- If you chose A: This is incorrect. The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose D: This is incorrect. The formula L \geq H(X) is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Huffman coding?
A) It replaces all other methods in this domain B) It is used to huffman coding in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Huffman coding serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Kraft inequality is TRUE?
A) Kraft inequality is a fundamental concept covered in this subject B) Kraft inequality is an advanced topic beyond this subject's scope C) Kraft inequality is not related to this subject D) Kraft inequality is mentioned only as a historical footnote
Correct: A)
- If you chose A: Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content. Correct!
- If you chose B: This is incorrect. Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content.
- If you chose C: This is incorrect. Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content.
- If you chose D: This is incorrect. Kraft inequality is a fundamental concept covered in this subject. This subject covers Kraft inequality as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) The inverse of the correct answer C) 1.0 \leq 1$ ✓ D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors.
- If you chose C: The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is 1.0 \leq 1$ ✓. The other options represent common errors.
Q6: How are Kraft inequality and Prefix codes related?
A) Kraft inequality and Prefix codes are completely unrelated topics B) Kraft inequality is the inverse of Prefix codes C) Kraft inequality and Prefix codes are closely related concepts D) Kraft inequality is a special case of Prefix codes
Correct: C)
- If you chose A: This is incorrect. Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics.
- If you chose C: Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both Kraft inequality and Prefix codes are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with ⚠️ Critical: Shannon'S Source Coding Theorem?
A) A common mistake is confusing ⚠️ Critical: Shannon'S Source Coding Theorem with a similar concept B) The main error with ⚠️ Critical: Shannon'S Source Coding Theorem is using it when it is not needed C) ⚠️ Critical: Shannon'S Source Coding Theorem has no common misconceptions D) ⚠️ Critical: Shannon'S Source Coding Theorem is always computed the same way in all contexts
Correct: A)
- If you chose A: Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse ⚠️ Critical: Shannon'S Source Coding Theorem with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Types Of Codes?
A) Use Types Of Codes only in pure mathematics contexts B) Types Of Codes is not practically useful C) Apply Types Of Codes to solve problems in this subject's domain D) Avoid Types Of Codes unless explicitly instructed
Correct: C)
- If you chose A: This is incorrect. Types Of Codes is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Types Of Codes is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: Types Of Codes is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. Types Of Codes is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Why is {0, 01, 10} not a prefix code? Can you decode "010" unambiguously?
Click for answer
0 is a prefix of 01 — not prefix-free. Attempting to decode "010": first bit is 0 → A. Remaining "10" → C. So "010" → "AC". But is "01" at the start a codeword? Yes, 01 → B. Then "0" → A? But "0" on its own after that... The string "010" can be parsed as A-C or B-? (ambiguous — B leaves "0" which is A). So the decoding is NOT unique: "010" could be AC or BA. Not uniquely decodable. -
Given codeword lengths {1, 2, 3, 3}, do there exist prefix codes with these lengths? What about {1, 1, 2, 3}?
Click for answer
{1, 2, 3, 3}: $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1.0 \leq 1$ ✓ — exists. {1, 1, 2, 3}: $2^{-1} + 2^{-1} + 2^{-2} + 2^{-3} = 0.5 + 0.5 + 0.25 + 0.125 = 1.375 > 1$ ✗ — impossible. You can't have two 1-bit codewords. -
For P(A)=0.5, P(B)=0.25, P(C)=0.125, P(D)=0.125, what is the optimal expected code length?
Click for answer
$H(X) = 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75$ bits. Since the probabilities are powers of 2, ideal lengths $\ell_i = -\log_2 p_i$ are integers: (1, 2, 3, 3). Kraft sum = 1.0. So $L_{\text{opt}} = H(X) = 1.75$ bits — the entropy bound is achieved exactly! -
Why do modern compressors (gzip, LZMA) outperform Huffman coding on real data?
Click for answer
Huffman assumes an i.i.d. source with a KNOWN, FIXED distribution. Real data has structure: patterns, repetitions, context dependence. LZ-based methods exploit repeated substrings. Context-modelling compressors (PPM, PAQ) use conditional probabilities $P(x_i \mid x_{i-1}, \ldots, x_{i-k})$. Block coding and adaptive coding also help close the gap to entropy. -
Prove that the expected length of any prefix code satisfies $L \geq H(X)$.
Click for answer
Define $q_i = 2^{-\ell_i} / c$ where $c = \sum 2^{-\ell_j} \leq 1$ (Kraft). So $q_i \geq 2^{-\ell_i}$. $L - H(X) = \sum p_i \ell_i + \sum p_i \log p_i = \sum p_i \log \frac{p_i}{2^{-\ell_i}}$ $\geq \sum p_i \log \frac{p_i}{q_i} = D_{KL}(P \parallel Q) \geq 0$ Therefore $L \geq H(X)$, with equality iff $p_i = q_i = 2^{-\ell_i}$ (integer-length constraint permitting).
Summary
Key takeaways:
- Source coding theorem: $H(X) \leq L < H(X) + 1$ — entropy is the fundamental compression limit
- Prefix codes allow instantaneous decoding; governed by the Kraft inequality
- Huffman coding produces optimal prefix codes for known distributions
- Block coding closes the gap to entropy: $L_n/n \to H(X)$ as $n \to \infty$
- Real compressors go beyond Huffman by exploiting structure, context, and adaptation
Pitfalls
-
Expecting Huffman coding to always achieve the entropy bound: Huffman coding guarantees $H(X) \leq L < H(X) + 1$ per symbol, which leaves a gap of up to 1 bit. This gap is worst when one symbol has probability near 1 — single-symbol Huffman can be far from optimal. Block coding closes this gap by amortising the overhead over multiple symbols.
-
Using static Huffman codes for unknown or changing distributions: Huffman coding assumes a known, fixed distribution. If the distribution is estimated from data or changes over time, a static Huffman code is suboptimal. Adaptive Huffman coding or arithmetic coding should be used when the source statistics are unknown or non-stationary.
-
Designing codes that violate the Kraft inequality: For a prefix code with lengths ${\ell_i}$, the Kraft sum $\sum 2^{-\ell_i}$ must not exceed 1. A common mistake is assigning short codewords to too many symbols without checking this constraint. Any set of codeword lengths violating Kraft cannot correspond to any prefix code — the code tree would be impossible to construct.
-
Confusing prefix codes with all uniquely decodable codes: All prefix codes are uniquely decodable, but not all uniquely decodable codes are prefix-free. However, the Kraft inequality applies to any uniquely decodable code (McMillan's theorem), and any uniquely decodable code can be converted to a prefix code with the same codeword lengths. So restricting attention to prefix codes incurs no loss in achievable compression.
-
Assuming Huffman coding is optimal for all compression scenarios: Huffman is optimal for symbol-by-symbol encoding of a known i.i.d. source. It is not optimal for sources with memory, unknown distributions, or when block coding is allowed (arithmetic coding can approach entropy more closely on a per-symbol basis without needing very large blocks).
Next Steps
Next up: 13-07-channel-capacity.md