18-06 — Pre-training Objective Mathematics
Phase: 18 — Large Language Model Mathematics Subject: 18-06 Prerequisites: 18-05 (Decoder-Only Architecture), 13-05 (Cross Entropy and Applications), 13-01 (Entropy), 13-04 (KL Divergence), 12-03 (Point Estimation — MLE connection) Next subject: 18-07 — Scaling Laws
Learning Objectives
By the end of this subject, you will be able to:
- Derive the language modeling loss as maximum likelihood estimation and show its equivalence to minimizing cross-entropy
- Compute perplexity from cross-entropy loss and explain why PPL = exp(loss) is the natural metric
- Derive bits-per-byte (BPB) and bits-per-character (BPC) from the loss, accounting for tokenizer compression
- Analyze training loss curves mathematically: the power-law form L(N,D) and its implications
- Explain the connection between pre-training loss and downstream task performance
Core Content
1. The Pre-training Objective as Maximum Likelihood
Given a corpus of token sequences D = {x^(1), x^(2), ..., x^(M)}, the pre-training objective is to maximize the log-likelihood of the data under the model:
θ* = argmax_θ Σ_{k=1}^{M} log p_θ(x^(k))
For each sequence x = (x₁, x₂, ..., x_T), the autoregressive factorization gives:
log p_θ(x) = Σ_{t=1}^{T} log p_θ(x_t | x₁, ..., x_{t−1})
This is maximum likelihood estimation (MLE) for the autoregressive model. The negative log-likelihood (NLL) is the loss:
L(θ) = −(1/T_total) Σ_{all positions} log p_θ(x_t | x_{<t})
⚠️ THIS IS CRITICAL — The language modeling loss is exactly the cross-entropy between the true data distribution p_data and the model distribution p_θ: H(p_data, p_θ) = −E_{x∼p_data}[log p_θ(x)]. Minimizing cross-entropy = minimizing KL divergence from p_data to p_θ (since H(p_data) is constant w.r.t. θ).
2. Perplexity
Perplexity (PPL) is the standard metric for language models:
PPL = exp(L)
where L is the average cross-entropy loss (using natural log).
Interpretation: Perplexity is the effective branching factor — if the model assigns uniform probability to k equally likely options, perplexity = k. A perplexity of 10 means the model is "as confused" as if it had to choose uniformly among 10 options at each step.
Derivation: PPL = 2^(H_bits) where H_bits = L / ln(2) is the entropy in bits. Or equivalently:
PPL = Π_{t} p(x_t | x_{<t})^(−1/T) = exp(−(1/T) Σ log p(x_t | x_{<t}))
PPL is the geometric mean of the inverse probabilities. Lower is better.
Typical values: - GPT-3 (175B) on WikiText-103: ~20 PPL - Llama 2 (7B) on C4: ~8-10 PPL - Llama 3 (8B) on C4: ~6-8 PPL - Random guessing among 50K tokens: ~50,000 PPL
3. Bits-Per-Byte (BPB) and Bits-Per-Character (BPC)
Cross-entropy loss uses natural log (nats). To convert to bits:
L_bits = L / ln(2)
Bits-per-token: L_bits directly (loss in bits per token).
Bits-per-byte (BPB): Accounts for tokenizer compression.
BPB = L_bits / (bytes_per_token)
where bytes_per_token = total_bytes / total_tokens in the evaluation set.
For GPT-2's tokenizer: ~4 characters/token, ~1.3 tokens/word, so ~3 bytes/token (English). Then BPB ≈ L_bits / 3.
Bits-per-character (BPC):
BPC = L_bits / (chars_per_token)
where chars_per_token = total_chars / total_tokens.
Why BPB/BPC matters: Different tokenizers produce different numbers of tokens for the same text. BPB/BPC normalizes for tokenizer differences, enabling fair comparisons between models with different tokenizers.
4. The Connection Between Pre-training Loss and Compression
The pre-training loss has a fundamental connection to data compression:
Shannon's source coding theorem: The optimal code length (in bits) for a symbol from distribution p is −log₂ p(x). The expected code length is the entropy H(p). For a model q, the expected code length is H(p, q) = H(p) + D_KL(p||q) ≥ H(p).
Language modeling as compression: A language model with cross-entropy L_bits can (in principle) compress text to L_bits bits per token. Better language models = better compressors.
The pre-training loss lower bound: No model can achieve loss below the true entropy of language H(p_data). This is an unobservable quantity (we don't know the "true" distribution of language), but it provides a theoretical floor.
5. Training Dynamics and Loss Curves
Training loss typically follows a power law (plus a constant floor):
L(t) ≈ L_min + C · t^(−α)
where t is the training step, L_min is the irreducible loss (entropy of language + model capacity limit), and α > 0 is the learning rate exponent.
Log-log plot: log(L(t) − L_min) = log(C) − α·log(t). A straight line on a log-log plot indicates power-law behavior.
Three phases of training: 1. Rapid initial drop: The model quickly learns token frequencies and basic syntax. Loss drops from ~10-11 (random) to ~5-6. 2. Power-law decay: Gradual improvement as the model learns grammar, facts, and reasoning. Loss decreases as t^(−α) with α ≈ 0.05-0.1. 3. Plateau: Approach to L_min. Further training yields diminishing returns.
6. The Loss Decomposition
The expected loss can be decomposed:
L = H(p_data) + D_KL(p_data || p_θ)
where H(p_data) is the intrinsic entropy of language (unknown but fixed) and D_KL is the model's imperfection.
The KL term decomposes further: - Epistemic uncertainty: Model capacity limitation (finite parameters) - Approximation error: Training hasn't converged (finite data/steps) - Optimization error: Didn't find the global optimum (non-convex optimization)
7. Loss and Downstream Performance
A consistent empirical finding: lower pre-training loss → better downstream task performance. This is formalized in scaling laws (subject 18-07).
The relationship is approximately:
downstream_metric ≈ A · log(L) + B
where better metrics (higher accuracy, lower word error rate) correlate with lower loss. This is why the entire field focuses on reducing pre-training loss — it's a reliable proxy for model quality.
Pitfalls
⚠️ Pitfall 1: Using perplexity to compare models with different tokenizers. A model with a more aggressive tokenizer (higher compression ratio) will have lower per-token loss simply because it makes fewer predictions. Always use BPB (bits-per-byte) for tokenizer-independent comparisons.
⚠️ Pitfall 2: Computing PPL = 2^L. This only works when L is in bits (base-2 log). Standard implementations use natural log, so PPL = exp(L). Check which base your framework uses before converting.
⚠️ Pitfall 3: Treating L_∞ as a known constant. The irreducible loss is a FITTED parameter, not a theoretical truth. Different papers estimate different values. The true entropy of language is unknown and context-dependent (textbooks have lower entropy than Twitter).
Key Terms
- 18 06 Pretraining Objective Mathematics
- Bits-Per-Byte (BPB) and Bits-Per-Character (BPC)
- Common Pitfalls
- Example 1: Perplexity from Probabilities
- Example 2: BPB Conversion
- Example 3: Loss Reduction from Doubling Data
- Loss and Downstream Performance
- Perplexity
- Pitfall 2: Computing PPL = 2^L.
- Pitfall 3: Treating L_∞ as a known constant.
- Problem 1
- Problem 2
Worked Examples
Example 1: Perplexity from Probabilities
Problem: A model assigns the following probabilities to the correct next tokens at 5 positions: [0.5, 0.1, 0.25, 0.8, 0.05]. Compute the perplexity.
Solution:
NLL per position = −(1/5)[log(0.5) + log(0.1) + log(0.25) + log(0.8) + log(0.05)] = −(1/5)[−0.693 − 2.303 − 1.386 − 0.223 − 2.996] = −(1/5)[−7.601] = 1.520
PPL = exp(1.520) ≈ 4.57
Also directly: PPL = (0.5·0.1·0.25·0.8·0.05)^(−1/5) = (0.0005)^(−0.2) = (1/0.0005)^0.2 = 2000^0.2 ≈ 4.57
Example 2: BPB Conversion
Problem: A model achieves L = 2.0 nats/token on a test set with 1,000,000 bytes and 250,000 tokens. Compute PPL, bits/token, and BPB.
Solution:
PPL = exp(2.0) ≈ 7.39
Bits per token = L / ln(2) = 2.0 / 0.6931 = 2.885 bits/token
Bytes per token = 1,000,000 / 250,000 = 4.0 bytes/token
BPB = 2.885 / 4.0 = 0.721 bits/byte
Check: This is plausible. English text at ~4 bytes/token with 0.72 bits/byte = 2.88 bits/token = effective vocab of ~7.4 tokens (PPL). Raw English has entropy ~1 bit/character ≈ 1 bit/byte for ASCII. The model's 0.72 BPB is close to but above true entropy.
Example 3: Loss Reduction from Doubling Data
Problem: A model follows the power law L(D) = 2.0 + 5.0·D^(−0.07) where D is in billions of tokens. Compute L at D=100B and D=200B. What is the reduction?
Solution:
L(100B) = 2.0 + 5.0·100^(−0.07) = 2.0 + 5.0·(100^(−0.07))
100^(−0.07) = e^(−0.07·ln(100)) = e^(−0.07·4.605) = e^(−0.3224) = 0.7244
L(100B) = 2.0 + 5.0·0.7244 = 2.0 + 3.622 = 5.622
L(200B) = 2.0 + 5.0·200^(−0.07) = 2.0 + 5.0·e^(−0.07·ln(200)) = 2.0 + 5.0·e^(−0.07·5.298) = 2.0 + 5.0·e^(−0.3709) = 2.0 + 5.0·0.6901 = 2.0 + 3.451 = 5.451
Reduction = 5.622 − 5.451 = 0.171 nats PPL improvement: exp(5.622) → exp(5.451) = 275.9 → 233.1 (43 PPL reduction)
Note: The diminishing returns: doubling data from 100B to 200B reduces loss by only 0.17 nats. This is the power law in action — each doubling gives a smaller absolute improvement.
Quiz
Q1: What does the concept of Common Pitfalls primarily refer to in this subject?
A) A visual representation of Common Pitfalls B) A historical anecdote about Common Pitfalls C) The definition and application of Common Pitfalls D) A computational error related to Common Pitfalls
Correct: C)
- If you chose A: This is incorrect. Common Pitfalls is defined as: the definition and application of common pitfalls. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Common Pitfalls is defined as: the definition and application of common pitfalls. The other options describe different aspects that are not the primary focus.
- If you chose C: Common Pitfalls is defined as: the definition and application of common pitfalls. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Common Pitfalls is defined as: the definition and application of common pitfalls. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Loss and Downstream Performance?
A) It is used only in advanced research contexts B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used to loss and downstream performance in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Loss and Downstream Performance serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Loss and Downstream Performance serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Loss and Downstream Performance serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Loss and Downstream Performance serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q3: Which statement about Perplexity is TRUE?
A) Perplexity is mentioned only as a historical footnote B) Perplexity is an advanced topic beyond this subject's scope C) Perplexity is not related to this subject D) Perplexity is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Perplexity is a fundamental concept covered in this subject. This subject covers Perplexity as part of its core content.
- If you chose B: This is incorrect. Perplexity is a fundamental concept covered in this subject. This subject covers Perplexity as part of its core content.
- If you chose C: This is incorrect. Perplexity is a fundamental concept covered in this subject. This subject covers Perplexity as part of its core content.
- If you chose D: Perplexity is a fundamental concept covered in this subject. This subject covers Perplexity as part of its core content. Correct!
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) 0.263 bits/token C) A different result from a common mistake D) An unrelated numerical value
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is 0.263 bits/token. The other options represent common errors.
- If you chose B: The worked examples show that the result is 0.263 bits/token. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is 0.263 bits/token. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is 0.263 bits/token. The other options represent common errors.
Q5: How are Perplexity and The Pre-Training Objective As Maximum Likelihood related?
A) Perplexity and The Pre-Training Objective As Maximum Likelihood are completely unrelated topics B) Perplexity is the inverse of The Pre-Training Objective As Maximum Likelihood C) Perplexity and The Pre-Training Objective As Maximum Likelihood are closely related concepts D) Perplexity is a special case of The Pre-Training Objective As Maximum Likelihood
Correct: C)
- If you chose A: This is incorrect. Both Perplexity and The Pre-Training Objective As Maximum Likelihood are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Perplexity and The Pre-Training Objective As Maximum Likelihood are covered in this subject as interconnected topics.
- If you chose C: Both Perplexity and The Pre-Training Objective As Maximum Likelihood are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both Perplexity and The Pre-Training Objective As Maximum Likelihood are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc)?
A) Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) is always computed the same way in all contexts B) The main error with Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) is using it when it is not needed C) A common mistake is confusing Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) with a similar concept D) Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Bits-Per-Byte (Bpb) And Bits-Per-Character (Bpc) with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply The Connection Between Pre-Training Loss And Compression?
A) Use The Connection Between Pre-Training Loss And Compression only in pure mathematics contexts B) Apply The Connection Between Pre-Training Loss And Compression to solve problems in this subject's domain C) The Connection Between Pre-Training Loss And Compression is not practically useful D) Avoid The Connection Between Pre-Training Loss And Compression unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. The Connection Between Pre-Training Loss And Compression is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: The Connection Between Pre-Training Loss And Compression is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. The Connection Between Pre-Training Loss And Compression is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. The Connection Between Pre-Training Loss And Compression is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
A model achieves PPL = 15.0 on a test set. What is the cross-entropy loss in nats per token?
Answer
L = ln(PPL) = ln(15.0) ≈ 2.708 nats/tokenProblem 2
Two models have perplexities 10.0 and 12.0. What is the difference in their cross-entropy loss in nats? In bits?
Answer
L₁ = ln(10.0) = 2.303 nats L₂ = ln(12.0) = 2.485 nats Difference = 0.182 nats In bits: 0.182 / ln(2) = 0.182 / 0.693 = 0.263 bits/token This shows that perplexity differences at low values are more significant than they appear — a PPL difference of 2 at PPL=10 is a ~0.18 nat improvement, much larger than the same PPL difference at PPL=100 (where ln(100)−ln(102) ≈ 4.605−4.625 = 0.020 nats).Problem 3
A model trained on a corpus with tokenizer A (compression ratio = 4.2 chars/token) achieves L = 1.8 nats/token. What would the equivalent L be for tokenizer B (compression ratio = 3.5 chars/token) if the model has the same BPC?
Answer
BPC (same for both): BPC_A = 1.8 / (ln(2) · 4.2) = 1.8 / (0.693·4.2) = 1.8/2.911 = 0.618 bits/char L_bits_B = BPC · chars_per_token_B = 0.618 · 3.5 = 2.164 bits/token L_nats_B = 2.164 · ln(2) = 2.164 · 0.693 = 1.50 nats/token So tokenizer A's loss of 1.8 nats/token is equivalent to tokenizer B's loss of 1.50 nats/token in terms of compression of the original text. Always normalize by BPC/BPB for fair comparisons!Problem 4
Derive that minimizing cross-entropy is equivalent to minimizing KL divergence for the pre-training objective.
Answer
H(p_data, p_θ) = −Σ p_data(x) log p_θ(x) D_KL(p_data || p_θ) = Σ p_data(x) log(p_data(x)/p_θ(x)) = Σ p_data(x) log p_data(x) − Σ p_data(x) log p_θ(x) = −H(p_data) + H(p_data, p_θ) So H(p_data, p_θ) = H(p_data) + D_KL(p_data || p_θ) Since H(p_data) is independent of θ: argmin_θ H(p_data, p_θ) = argmin_θ D_KL(p_data || p_θ) Minimizing cross-entropy = minimizing KL divergence from the true distribution to the model.Problem 5
Explain why perplexity cannot go below 1.0. Under what circumstances would PPL = 1?
Answer
PPL = exp(L) ≥ exp(0) = 1 (since L ≥ 0 for cross-entropy). PPL = 1 occurs when L = 0, which means p(x_t|x_{Summary
- Pre-training uses MLE: minimize L = −E[log p_θ(x_t|x_{<t})], which equals cross-entropy H(p_data, p_θ) between data and model distributions
- Perplexity PPL = exp(L) is the geometric mean of inverse probabilities — the effective branching factor per token
- BPB = L_bits / bytes_per_token normalizes loss across tokenizers for fair model comparison
- Training loss follows power-law decay: L(t) ≈ L_min + C·t^(−α), with diminishing returns from more data/compute
- Lower pre-training loss reliably predicts better downstream performance, making it the central metric in LLM development
Next Steps
Continue to 18-07 — Scaling Laws to understand how model performance depends on parameter count, data size, and compute, and how to optimally allocate resources.