18-01 — Tokenization Mathematics
Phase: 18 — Large Language Model Mathematics Subject: 18-01 Prerequisites: 13-01 (Entropy), 13-04 (KL Divergence), 10-04 (Discrete Random Variables), 10-01 (Probability Foundations) Next subject: 18-02 — Embedding Layers
Learning Objectives
By the end of this subject, you will be able to:
- Derive the Byte-Pair Encoding (BPE) algorithm from first principles, including the merge rule and vocabulary construction
- Compare BPE, WordPiece, and Unigram tokenization mathematically, stating each method's training objective and segmentation criterion
- Compute the vocabulary size tradeoff — how |V| affects sequence length, embedding parameters, and coverage of rare tokens
- Explain subword regularization using the Unigram model and the BPE-dropout technique
- Calculate compression ratios and token efficiency metrics for a given tokenizer on sample text
Core Content
1. Why Tokenization? The Mathematical Framing
An LLM operates on a finite vocabulary V of tokens. Given a text string s (a sequence of Unicode characters), tokenization is a function:
T: Σ → V
where Σ is the character set (Unicode) and V is the token vocabulary. The tokenizer must:
- Coverage: Every possible input string must be tokenizable (or produce
) - Efficiency: Shorter token sequences are better (faster inference, O(n²) attention cost)
- Reversibility: The original string should be recoverable (or nearly so — detokenization)
The fundamental tradeoff: Larger |V| means fewer tokens per sentence (faster), but larger embedding matrix E ∈ ℝ^(V×d) (more parameters) and more training data needed per token type (rarer tokens = poorly trained embeddings).
2. Byte-Pair Encoding (BPE)
⚠️ THIS IS CRITICAL — BPE's base vocabulary always includes individual bytes (or characters). This guarantees that ANY Unicode string can be tokenized: unseen characters become their UTF-8 byte sequence, and all 256 bytes are in the vocabulary. Without byte-level fallback, unseen characters produce
BPE is the most common LLM tokenizer (GPT-2/3/4, Llama, Mistral). It works bottom-up from characters.
Algorithm
Training:
1. Initialize vocabulary V = set of all unique characters in the training corpus
2. Tokenize the corpus: each word is split into characters, with a special end-of-word symbol (e.g., </w>)
3. Count all adjacent token pairs in the corpus
4. Find the most frequent pair (a,b)
5. Merge (a,b) → new token ab, add ab to V, and update all occurrences
6. Repeat steps 3-5 until |V| reaches the desired size
Encoding (tokenization): 1. Split the input into characters 2. Repeatedly apply merges in the order they were learned, using the most recently applicable merge at each step 3. Stop when no more merges apply
Formal Training Objective
Let M = {m₁, m₂, ..., m_K} be the sequence of merge operations learned. Each merge m_k replaces every occurrence of token pair (a,b) with a new single token ab.
The training objective is greedy frequency maximization:
m_k = argmax_{(a,b)} count(a,b in corpus)
This is equivalent to minimizing the total number of tokens in the encoded corpus after each merge:
ΔL_k = −count(a,b)
Each merge reduces the corpus representation by count(a,b) tokens. After K merges, the total compression (in tokens saved) is:
S = Σ_{k=1}^{K} count(m_k)
Worked Derivation: BPE Compression
Consider a corpus where "low" appears 5 times, "lower" appears 2 times, "newest" appears 6 times, and "wider" appears 3 times. All words are appended with </w>.
Initial vocabulary: {l, o, w, e, r, n, s, t, i, d, }
Initial tokenization: Each word is character-split:
- low</w> → l o w
- lower</w> → l o w e r
- newest</w> → n e w e s t
- wider</w> → w i d e r
Pair counts (most frequent):
- (l, o) appears in low (×5) + lower (×2) = 7
- (o, w) appears in low (×5) + lower (×2) = 7
- (e, r) appears in lower (×2) + wider (×3) = 5
- (e, s) appears in newest (×6) = 6
- (e, w) appears nowhere (they're in separate tokens after (o,w)... wait, careful with overlap)
After merges: the BPE order depends on tie-breaking. Typical implementations use frequency first, then lexical order. Let's track:
Merge 1: (l, o) → lo (count=7)
Merge 2: (lo, w) → low (count=7) — now low and lower both start with low
Merge 3: (e, r) → er (count=5)
...
After training, new words like "lowest" tokenize as: low + est (if est was merged during training) or low + e + s + t otherwise.
3. WordPiece
Used by BERT. The key difference from BPE is the merge criterion: instead of raw frequency, WordPiece maximizes the likelihood of the training data under a unigram language model.
Merge Criterion
For a candidate merge of tokens a and b:
score(a,b) = count(a,b) / (count(a) · count(b))
This is the pointwise mutual information (PMI) between a and b:
PMI(a,b) = log(p(a,b) / (p(a) · p(b)))
where p(a,b) = count(a,b) / total_bigrams, p(a) = count(a) / total_tokens, p(b) = count(b) / total_tokens.
Interpretation: BPE merges the most frequent pair. WordPiece merges the pair whose co-occurrence is most surprising relative to their individual frequencies — i.e., the pair that's most "glued together." This tends to produce more linguistically meaningful subwords.
Why PMI?
If tokens a and b appear independently, p(a,b) = p(a)·p(b), so PMI = 0. If they co-occur more than expected by chance, PMI > 0 — they "belong together." WordPiece merges the pair with highest PMI.
Contrast with BPE: In BPE, if "t" and "h" are very common individually, count(t,h) will dominate even if count(t) · count(h) is huge. WordPiece normalizes by individual frequency, avoiding merging tokens that are just individually common.
4. Unigram Language Model Tokenization
Used by SentencePiece (T5, Llama, Alpaca). This takes a fundamentally different approach.
Model
Assume tokens are generated independently from a unigram distribution over V:
p(x) = Π_{i=1}^{n} p(t_i)
where x is a sentence, tokenized as t₁, t₂, ..., t_n, and p(t) is learned.
Training objective: Maximize the marginal likelihood of the training data over all possible tokenizations:
L = Σ_{x∈corpus} log( Σ_{t∈S(x)} Π_{i} p(t_i) )
where S(x) is the set of all valid tokenizations of x under V.
EM Algorithm for Training
- Initialize: Start with a large seed vocabulary (e.g., all characters + frequent substrings from BPE)
- E-step: For each word, compute the posterior probability of each tokenization using the Viterbi algorithm:
P(tokenization | word) ∝ Π p(t_i)
- M-step: Re-estimate token probabilities:
p(t) = (expected count of t) / (Σ expected count of all tokens)
where expected count = Σ_{word} Σ_{tokenization containing t} P(tokenization | word)
- Prune: Remove tokens with lowest p(t) · |V| < threshold (typically 20-30% of tokens per iteration)
- Repeat until |V| reaches target size
Encoding (Viterbi)
At inference, the tokenizer finds the highest-probability tokenization:
t* = argmax_{t∈S(x)} Π p(t_i)
This is computed via dynamic programming (Viterbi algorithm). Let dp[i] = max log-probability of tokenizing prefix x[0:i]:
dp[0] = 0 dp[j] = max_{i < j, x[i:j]∈V} [dp[i] + log p(x[i:j])]
⚠️ THIS IS CRITICAL — The Viterbi recurrence dp[j] = max_{i<j, x[i:j]∈V} [dp[i] + log p(x[i:j])] is the foundation of Unigram tokenization. It finds the globally optimal segmentation under the unigram model. Greedy longest-match is NOT equivalent and gives different (worse) results. This DP is also used in BPE for encoding, but with a different objective (applying learned merges).
5. Vocabulary Size Tradeoffs
Let V be the vocabulary size, L be the average sequence length, and T be the total characters in the corpus.
Sequence length vs. vocabulary size:
L ≈ T / log_{|V|}(T)
(Larger vocabulary → fewer tokens needed. But the relationship is sublinear — doubling |V| doesn't halve L.)
Embedding parameters:
Parameters_emb = |V| · d_model
Doubling |V| doubles the embedding parameter count. For d_model = 4096 and |V| = 32,000: Parameters_emb = 131M. For |V| = 64,000: 262M. This is ~15-25% of total model parameters for typical LLMs.
Coverage: Smaller |V| means more rare words are broken into many subwords. Extremely rare or unseen character sequences produce long token sequences. The out-of-vocabulary (OOV) rate for trained tokenizers on in-distribution text should be near zero (everything maps to known subword units).
Typical choices: GPT-2: 50,257 tokens. GPT-3/4: ~100,000 tokens. Llama: 32,000 tokens. Mistral: 32,000 tokens.
6. Subword Regularization
During training, we want the model to be robust to different tokenizations of the same text. Subword regularization exposes the model to multiple valid tokenizations.
BPE-Dropout
During tokenizer training, randomly drop p% of merge operations (typically p = 0.1). This produces different tokenizations of the same word. Example: "lower" might tokenize as low + er (merges intact) or l + ow + e + r (some merges dropped).
Unigram Sampling
The Unigram model naturally supports sampling multiple tokenizations: instead of taking the Viterbi best, sample from the n-best list or from the full posterior. The probability of a specific segmentation t = (t₁, ..., t_n) is proportional to Π p(t_i).
Regularization loss: For a given input, sample K different tokenizations and average the loss:
L_reg = (1/K) Σ_{k=1}^{K} L(model, tokenization_k)
7. Compression Ratio
The compression ratio of a tokenizer is:
CR = #characters / #tokens
For English text with a 32K BPE tokenizer, CR ≈ 4 (each token covers ~4 characters). For GPT-4's 100K tokenizer, CR ≈ 3.0-3.5. For non-English or code (with more unique characters), CR is lower.
Pitfalls
⚠️ Pitfall 1: Confusing BPE merge order with greedy encoding. BPE training builds merges in order of decreasing frequency, but inference applies merges in THAT order, not re-ranking on-the-fly. If you re-sort merges by current corpus frequency, you get a different tokenizer.
⚠️ Pitfall 2: Thinking PMI alone is the WordPiece criterion. WordPiece uses count(a,b)/(count(a)·count(b)), which is proportional to PMI only for ranking (the constant total_tokens²/total_bigrams drops out in argmax). The raw score is not PMI.
⚠️ Pitfall 3: Forgetting that vocabulary size affects BOTH embedding parameters AND the softmax output. For |V|=32K, d=4096: 131M params JUST for input embeddings. The output projection (if not weight-tied) adds another 131M — a combined 262M, nearly 4% of a 7B model.
Key Terms
- Wait
Worked Examples
Example 1: BPE Merge Sequence
Problem: Given a corpus: "aaabdaaabac" (a single string, no spaces). The initial vocabulary is {a, b, c, d}. Run 3 BPE merges.
Solution:
Step 0 — Initial tokenization: a a a b d a a a b a c
Count pairs: - (a, a): positions (0,1), (1,2), (5,6), (6,7) = 4 - (a, b): positions (2,3), (7,8) = 2 - (b, d): position (3,4) = 1 - (d, a): position (4,5) = 1 - (b, a): position (8,9) = 1 - (a, c): position (9,10) = 1
Merge 1: (a, a) → aa (count=4)
New: aa a b d aa a b a c
Token sequence: aa, a, b, d, aa, a, b, a, c
Pairs: - (aa, a): positions 0,3 → 2 - (a, b): positions 1,5 → 2 - (b, d): 1 - (d, aa): 1 - (b, a): position 6 → 1 - (a, c): 1
Merge 2: Tie between (aa, a) and (a, b). Break by first seen. Merge (aa, a) → aaa (count=2).
New: aaa, b, d, aaa, b, a, c
Pairs: - (aaa, b): 2 - (b, d): 1 - (d, aaa): 1 - (b, a): 1 - (a, c): 1
Merge 3: (aaa, b) → aaab (count=2)
New: aaab, d, aaab, a, c
Vocabulary after 3 merges: {a, b, c, d, aa, aaa, aaab}
Example 2: WordPiece Merge Score
Problem: In a corpus, token a appears 100 times, token b appears 50 times. The pair (a, b) appears 20 times. Total bigrams = 1000. Compute the PMI and compare with BPE's criterion.
Solution:
PMI(a,b) = log₂(count(a,b) / (count(a) · count(b) / total_unigrams)) = log₂(20 / (100 · 50 / total_unigrams))
Need total_unigrams. Let's say total_unigrams = 5000.
p(a) = 100/5000 = 0.02 p(b) = 50/5000 = 0.01 p(a,b) = 20/1000 = 0.02
PMI = log₂(0.02 / (0.02 · 0.01)) = log₂(0.02 / 0.0002) = log₂(100) ≈ 6.64 bits
WordPiece score: count(a,b) / (count(a) · count(b)) = 20/5000 = 0.004 (unnormalized; the actual WordPiece score uses the normalized version but the argmax is equivalent).
BPE would consider: raw frequency 20. If another pair (c,d) appears 30 times, BPE merges (c,d). WordPiece would only merge (a,b) over (c,d) if its PMI is higher — i.e., (a,b) is more "surprising" as a co-occurrence.
Example 3: Viterbi Tokenization with Unigram Model
Problem: Vocabulary V = {a, b, c, ab, bc, abc} with probabilities: p(a) = 0.3, p(b) = 0.2, p(c) = 0.1, p(ab) = 0.15, p(bc) = 0.1, p(abc) = 0.15 Find the optimal tokenization of "abc".
Solution:
Let dp[i] = max log-probability for prefix of length i.
dp[0] = 0 (empty prefix)
i=1 (prefix "a"): - token "a" from position 0: dp[0] + log(0.3) = 0 + (-1.204) = -1.204 dp[1] = -1.204, best split: [0,"a"]
i=2 (prefix "ab"): - token "a" then "b": dp[1] + log(0.2) = -1.204 + (-1.609) = -2.813 - token "ab" directly: dp[0] + log(0.15) = 0 + (-1.897) = -1.897 dp[2] = -1.897, best split: [0,"ab"]
i=3 (prefix "abc"): - token "a" then optimal for "bc": dp[1] + (check "bc": dp[1] + log(0.1)) - Actually: from dp[1], can take "bc": -1.204 + log(0.1) = -1.204 + (-2.303) = -3.507 - token "ab" then "c": dp[2] + log(0.1) = -1.897 + (-2.303) = -4.200 - token "abc" directly: dp[0] + log(0.15) = 0 + (-1.897) = -1.897
Wait — we also need: "a" + "b" + "c": dp[1] + log p(b) + log p(c) or more precisely, the optimal for "bc" from position 1.
Let me redo i=3 more carefully. The recurrence is: dp[3] = max( dp[2] + log p("c") = -1.897 + (-2.303) = -4.200 [ab + c] dp[1] + log p("bc") = -1.204 + (-2.303) = -3.507 [a + bc] dp[0] + log p("abc") = 0 + (-1.897) = -1.897 [abc] )
dp[3] = -1.897, best: abc
So the optimal tokenization is the single token abc with probability 0.15.
Note: If p(abc) were much smaller (say 0.01), the optimal could be a + bc or ab + c.
Quiz
Q1: What does the concept of Why Tokenization? The Mathematical Framing primarily refer to in this subject?
A) A historical anecdote about Why Tokenization? The Mathematical Framing B) A computational error related to Why Tokenization? The Mathematical Framing C) The definition and application of Why Tokenization? The Mathematical Framing D) A visual representation of Why Tokenization? The Mathematical Framing
Correct: C)
- If you chose A: This is incorrect. Why Tokenization? The Mathematical Framing is defined as: the definition and application of why tokenization? the mathematical framing. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Why Tokenization? The Mathematical Framing is defined as: the definition and application of why tokenization? the mathematical framing. The other options describe different aspects that are not the primary focus.
- If you chose C: Why Tokenization? The Mathematical Framing is defined as: the definition and application of why tokenization? the mathematical framing. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Why Tokenization? The Mathematical Framing is defined as: the definition and application of why tokenization? the mathematical framing. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Byte-Pair Encoding (Bpe)?
A) It replaces all other methods in this domain B) It is used only in advanced research contexts C) It is used to byte-pair encoding (bpe) in mathematical analysis D) It is primarily a historical notation system
Correct: C)
- If you chose A: This is incorrect. Byte-Pair Encoding (Bpe) serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Byte-Pair Encoding (Bpe) serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: Byte-Pair Encoding (Bpe) serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. Byte-Pair Encoding (Bpe) serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Wordpiece is TRUE?
A) Wordpiece is a fundamental concept covered in this subject B) Wordpiece is mentioned only as a historical footnote C) Wordpiece is an advanced topic beyond this subject's scope D) Wordpiece is not related to this subject
Correct: A)
- If you chose A: Wordpiece is a fundamental concept covered in this subject. This subject covers Wordpiece as part of its core content. Correct!
- If you chose B: This is incorrect. Wordpiece is a fundamental concept covered in this subject. This subject covers Wordpiece as part of its core content.
- If you chose C: This is incorrect. Wordpiece is a fundamental concept covered in this subject. This subject covers Wordpiece as part of its core content.
- If you chose D: This is incorrect. Wordpiece is a fundamental concept covered in this subject. This subject covers Wordpiece as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) b a n a n a B) A different result from a common mistake C) An unrelated numerical value D) The inverse of the correct answer
Correct: A)
- If you chose A: The worked examples show that the result is b a n a n a. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is b a n a n a. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is b a n a n a. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is b a n a n a. The other options represent common errors.
Q5: How are Wordpiece and Unigram Language Model Tokenization related?
A) Wordpiece is the inverse of Unigram Language Model Tokenization B) Wordpiece and Unigram Language Model Tokenization are completely unrelated topics C) Wordpiece is a special case of Unigram Language Model Tokenization D) Wordpiece and Unigram Language Model Tokenization are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Wordpiece and Unigram Language Model Tokenization are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Wordpiece and Unigram Language Model Tokenization are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Wordpiece and Unigram Language Model Tokenization are covered in this subject as interconnected topics.
- If you chose D: Both Wordpiece and Unigram Language Model Tokenization are covered in this subject as interconnected topics. Correct!
Q6: What is a common pitfall when working with Vocabulary Size Tradeoffs?
A) Vocabulary Size Tradeoffs has no common misconceptions B) A common mistake is confusing Vocabulary Size Tradeoffs with a similar concept C) Vocabulary Size Tradeoffs is always computed the same way in all contexts D) The main error with Vocabulary Size Tradeoffs is using it when it is not needed
Correct: B)
- If you chose A: This is incorrect. Students often confuse Vocabulary Size Tradeoffs with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Vocabulary Size Tradeoffs with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Vocabulary Size Tradeoffs with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Vocabulary Size Tradeoffs with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Subword Regularization?
A) Avoid Subword Regularization unless explicitly instructed B) Subword Regularization is not practically useful C) Use Subword Regularization only in pure mathematics contexts D) Apply Subword Regularization to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. Subword Regularization is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Subword Regularization is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Subword Regularization is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: Subword Regularization is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
Problem 1
Given a training corpus string "banana" processed character-by-character, run 2 iterations of BPE. Count all pairs carefully.
Answer
Initial tokens after character split: b a n a n a Counts: - (b,a): 1 - (a,n): positions (1,2), (3,4) = 2 - (n,a): positions (2,3), (4,5) = 2 Merge 1: Tie between (a,n) and (n,a). Break arbitrarily → merge (a,n) → `an`. New: b an an a Counts: - (b, an): 1 - (an, an): 1 - (an, a): 1 Merge 2: All pairs have count 1. Merge any, e.g., (an, an) → `anan`. New: b anan a Vocabulary: {b, a, n, an, anan}Problem 2
Compute the PMI for the pair (lo, w) in BPE Example 1's initial state, assuming total unigrams = 50.
Answer
count(lo) = frequency of merged token — not yet merged in initial state. This question is better framed for the state after some merges. Let's use the state after Merge 1 in Example 1 (corpus has 10 characters, tokens after merge 1: aa, a, b, d, aa, a, b, a, c = 9 tokens). PMI(aa, a) = log₂(count(aa,a) / (count(aa)·count(a) / total_unigrams²)) = log₂(2 / (2·3 / 9²)) Wait — PMI uses bigram counts. Total bigrams = 8 (adjacent pairs). p(aa, a) = 2/8 = 0.25 p(aa) = 2/9 ≈ 0.2222 p(a) = 3/9 ≈ 0.3333 PMI = log₂(0.25 / (0.2222 · 0.3333)) = log₂(0.25 / 0.07407) = log₂(3.375) ≈ 1.75 bitsProblem 3
For a vocabulary size |V| = 32,000 and d_model = 4,096, compute the embedding parameter count. What fraction is this of a 7B parameter model?
Answer
Parameters_emb = |V| · d_model = 32,000 · 4,096 = 131,072,000 ≈ 131M Fraction = 131M / 7000M = 0.0187 ≈ 1.87% This is why LLMs can afford large vocabularies — the embedding matrix is small relative to the full model. The output projection (if not weight-tied) adds another 131M, making it ~3.7%.Problem 4
A tokenizer produces 250 tokens for a text with 1000 characters. What is the compression ratio? If the LLM has causal attention with O(n²) complexity, by what factor is the compute reduced compared to character-level tokenization?
Answer
CR = 1000 / 250 = 4.0 Character-level would produce ~1000 tokens. Attention complexity reduction: Complexity with tokenizer: O(250²) = O(62,500) Complexity character-level: O(1000²) = O(1,000,000) Reduction factor: 1,000,000 / 62,500 = 16× The O(n²) scaling makes tokenization efficiency extremely impactful.Problem 5
Explain why BPE can tokenize any arbitrary Unicode string, even one containing characters never seen during training.
Answer
BPE's base vocabulary always includes individual bytes (or characters). During training, the initial vocabulary contains every unique character in the training corpus. When encoding a new string with an unseen character, if the tokenizer uses byte-level BPE (tokenizing at the byte level), the unseen Unicode character is represented as its UTF-8 byte sequence, and each byte IS in the vocabulary (there are only 256 possible bytes). This guarantees coverage. Without byte-level fallback, unseen characters produceSummary
- Tokenization maps text to discrete tokens; BPE builds vocabulary bottom-up by greedily merging the most frequent adjacent token pairs
- WordPiece uses PMI = log(p(a,b) / (p(a)·p(b))) as its merge criterion, favoring pairs that co-occur more than expected by chance
- The Unigram model treats tokenization as a latent variable and uses EM to learn token probabilities, then Viterbi decoding for optimal segmentation
- Vocabulary size trades off sequence length (larger |V| → shorter sequences, less compute) against embedding parameters (|V| · d_model) and rare-token quality
- Subword regularization (BPE-dropout, Unigram sampling) exposes the model to diverse tokenizations, improving robustness
Next Steps
Continue to 18-02 — Embedding Layers to learn how token indices are mapped to dense vector representations that serve as input to the Transformer.