18-08 โ Inference Mathematics
Phase: 18 โ Large Language Model Mathematics Subject: 18-08 Prerequisites: 18-05 (Decoder-Only Architecture), 13-05 (Cross Entropy), 10-07 (Continuous Distributions โ sampling), 14-03 (SGD โ temperature concept) Next subject: 18-09 โ KV Caching
Learning Objectives
By the end of this subject, you will be able to:
- Derive the autoregressive decoding algorithm and analyze its O(Lยท(Lยทdยฒ + Lยฒยทd)) time complexity
- Derive temperature sampling mathematically: p_i โ exp(z_i / T) and explain its effect on the output distribution
- Formulate top-k and nucleus (top-p) sampling as constrained categorical sampling problems
- Derive beam search as a breadth-first search with the scoring function score = ฮฃ log p(x_t|x_{<t}) / len^ฮฑ
- Compute sequence-level log-probabilities and perplexity at inference time
Core Content
1. The Autoregressive Generation Algorithm
At inference, the model generates tokens one by one:
Algorithm:
Input: prompt tokens x_0, ..., x_{k-1}, max_length L_max
Output: generated tokens x_k, ..., x_{L-1}
for t = k to L_max - 1:
h = decoder_forward(x_0, ..., x_{t-1})
z = h ยท W_out^T # logits โ โ^V
p = sampling_strategy(z) # distribution over next token
x_t ~ p # sample
if x_t == <eos>: break
โ ๏ธ THIS IS CRITICAL โ Each step requires a full forward pass through all L Transformer blocks over the growing sequence. Without KV caching (18-09), the complexity for generating T tokens from a prompt of length P is O(ฮฃ_{t=P}^{P+T-1} tยฒยทd + tยทdยฒ) โ O((P+T)ยณยทd), dominated by attention's quadratic scaling.
2. Greedy Decoding
The simplest strategy: always pick the most likely token.
x_t = argmax_j z_j
Properties: - Deterministic: same input always produces the same output - Fast: one argmax operation - Problem: gets stuck in repetitive loops ("I went to the store. I went to the store. I went...") - Optimal for: tasks with a single correct answer (factual QA, translation sometimes)
Mathematical justification: Greedy decoding maximizes the probability of each individual step, but does NOT maximize the joint probability of the sequence. The sequence-level optimum may require temporarily choosing a lower-probability token.
3. Temperature Sampling
Temperature modifies the softmax before sampling:
p_i = exp(z_i / T) / ฮฃ_j exp(z_j / T)
where T > 0 is the temperature parameter.
Effect of T:
| T | Effect | Use case |
|---|---|---|
| T โ 0 | p โ one-hot at argmax (greedy) | Deterministic tasks |
| T = 1 | Standard softmax (default) | Normal generation |
| T > 1 | Flatter distribution, more diverse | Creative writing |
| T โ โ | p โ uniform (random) | Nonsense output |
Logit scaling interpretation: T scales the logits. At T = 0.5, logits are doubled โ the model becomes more confident (peakier distribution). At T = 2, logits are halved โ the model becomes less confident (flatter).
Gradient connection: The gradient of cross-entropy with temperature:
p_i^T = exp(z_i/T) / ฮฃ_j exp(z_j/T) โL/โz_i = (1/T)(p_i^T โ 1_{i=y})
Higher T โ smaller gradients โ effectively smaller learning rate for those examples.
4. Top-k Sampling
Restrict sampling to the k most likely tokens, then renormalize:
V_k = {indices of top k values in z} p_i = exp(z_i) / ฮฃ_{jโV_k} exp(z_j) for i โ V_k p_i = 0 for i โ V_k
This prevents the model from sampling extremely unlikely tokens that would derail generation.
Typical k: 10-50. GPT-2 used k=40 or k=50.
Problem with fixed k: For a peaked distribution (model is confident), k=50 includes many bad options. For a flat distribution, k=50 might exclude reasonable options. This motivates...
5. Nucleus (Top-p) Sampling
Sample from the smallest set of tokens whose cumulative probability exceeds p:
Let z be sorted in descending order: z_(1) โฅ z_(2) โฅ ... โฅ z_(V) Find smallest m such that ฮฃ_{i=1}^{m} softmax(z_(i)) โฅ p Set V_p = {indices of the top m tokens} Sample from V_p with renormalized probabilities
Mathematical definition:
p_i^top-p = exp(z_i) / ฮฃ_{jโV_p} exp(z_j) ยท 1_{iโV_p}
where V_p = minimal set such that ฮฃ_{jโV_p} softmax_j(z) โฅ p.
Advantage over top-k: Adapts to the confidence of the distribution. When the model is very confident (peak at 0.95), top-p=0.9 only includes that one token. When uncertain, it includes more options.
Typical p: 0.9-0.95. Higher p = more diverse, lower p = more focused.
6. Beam Search
Beam search maintains k candidate sequences (the "beam") and expands all of them simultaneously.
Algorithm:
beam = { (prompt, score=0) } # beam_width = b
for t = prompt_len to max_len:
candidates = []
for (seq, score) in beam:
z = forward(seq) # logits for next token
log_probs = log_softmax(z)
for each token j:
candidates.append( (seq + [j], score + log_probs[j]) )
beam = top_b candidates by score # keep b best
Scoring function: The score accumulated is the sum of log-probabilities:
score(seq) = ฮฃ_{t} log p(x_t | x_{<t})
Length normalization: Without length penalty, beam search favors short sequences (each step adds negative log-prob, so shorter = higher score). Fix:
score_norm(seq) = score(seq) / |seq|^ฮฑ
Typical ฮฑ โ [0.6, 1.0]. ฮฑ = 1 is average log-prob per token.
Repetition penalty: Subtract a penalty for already-generated tokens:
z'j = z_j โ penalty ยท 1{j already generated}
This discourages the model from repeating itself.
7. Computing Log-Probabilities at Inference
For a given input-output pair, we can compute the model's assessment:
log p(output | input) = ฮฃ_{t} log p_ฮธ(y_t | input, y_1, ..., y_{tโ1})
This is useful for: - Scoring: Which of several completions does the model prefer? - Perplexity evaluation: Compute PPL on held-out text - Uncertainty estimation: Low log-prob โ model is uncertain
Implementation: Run one forward pass over the concatenated [input, output] sequence. For each output position, extract the log-prob of the actual token from the logits at the previous position.
Pitfalls
โ ๏ธ Pitfall 1: Confusing temperature with top-k/top-p. Temperature modifies the SOFTMAX distribution before any truncation. Top-k then truncates. The order matters: p_i โ exp(z_i/T) first, THEN top-k/top-p filtering, THEN sampling. Applying top-k before temperature would produce different results.
โ ๏ธ Pitfall 2: Using greedy decoding and expecting diverse output. Greedy is deterministic โ same input always produces the same output. For creative tasks, use T > 1 with top-p sampling. For factual tasks, T=0 (greedy) is usually best.
โ ๏ธ Pitfall 3: Forgetting length normalization in beam search. Without it, beam search always prefers the shortest sequence (every step adds negative log-prob). A model generating "Paris." will score higher than "Paris is the capital of France." simply because it's shorter.
Key Terms
- 18 08 Inference Mathematics
- Beam Search
- Common Pitfalls
- Computing Log-Probabilities at Inference
- Example 1: Temperature Effect on a Distribution
- Example 2: Top-p Sampling
- Example 3: Beam Search
- Greedy Decoding
- Nucleus (Top-p) Sampling
- Problem 1
- Problem 2
- Problem 3
Worked Examples
Example 1: Temperature Effect on a Distribution
Problem: Logits z = [3.0, 1.0, 0.5, 0.2]. Compute the sampling probabilities at T = 1.0, T = 0.5, and T = 2.0.
Solution:
T = 1.0 (standard): z/T = [3.0, 1.0, 0.5, 0.2] exp: [20.09, 2.718, 1.649, 1.221], sum = 25.68 p = [0.782, 0.106, 0.064, 0.048]
T = 0.5 (sharper): z/T = [6.0, 2.0, 1.0, 0.4] exp: [403.4, 7.389, 2.718, 1.492], sum = 415.0 p = [0.972, 0.018, 0.007, 0.004]
T = 2.0 (flatter): z/T = [1.5, 0.5, 0.25, 0.1] exp: [4.482, 1.649, 1.284, 1.105], sum = 8.520 p = [0.526, 0.194, 0.151, 0.130]
Observation: At T=0.5, the top token dominates (97.2%). At T=2.0, the distribution is much flatter โ the model's "confidence" is spread out. This is how T controls the diversity/vs-quality tradeoff.
Example 2: Top-p Sampling
Problem: For the T=1.0 probabilities from Example 1: p = [0.782, 0.106, 0.064, 0.048], with token indices 0,1,2,3. Find the nucleus for p = 0.9 and the renormalized probabilities.
Solution:
Sorted probabilities: 0.782, 0.106, 0.064, 0.048
Cumulative: m=1: 0.782 < 0.9 m=2: 0.782 + 0.106 = 0.888 < 0.9 m=3: 0.888 + 0.064 = 0.952 โฅ 0.9 โ
Nucleus = {0, 1, 2} (tokens with indices 0,1,2)
Renormalized: sum = 0.782 + 0.106 + 0.064 = 0.952 p'โ = 0.782/0.952 = 0.821 p'โ = 0.106/0.952 = 0.111 p'โ = 0.064/0.952 = 0.067 p'โ = 0
Token 3 is excluded from sampling despite having non-negligible (4.8%) probability. This prevents the model from ever choosing that token.
Example 3: Beam Search
Problem: Use beam search with b=2 to continue the sequence from vocabulary {A, B, C,
Step 1 (from prompt): p(A)=0.5, p(B)=0.3
Step 2 (from prompt+A): p(A)=0.6, p(B)=0.2
Step 2 (from prompt+B): p(C)=0.4, p(
Find the top 2 beams after Step 2.
Solution:
Step 1 candidates: - prompt+A: score = log(0.5) = โ0.693 - prompt+B: score = log(0.3) = โ1.204
Top 2 beams: [prompt+A (โ0.693), prompt+B (โ1.204)]
Step 2 (expand both beams):
From prompt+A: - prompt+A+A: score = โ0.693 + log(0.6) = โ0.693 โ 0.511 = โ1.204 - prompt+A+B: score = โ0.693 + log(0.2) = โ0.693 โ 1.609 = โ2.302
From prompt+B:
- prompt+B+C: score = โ1.204 + log(0.4) = โ1.204 โ 0.916 = โ2.120
- prompt+B+
Top 2 beams: 1. prompt+A+A: โ1.204 2. prompt+B+C: โ2.120
Note: The greedy path would be A (0.5) โ A (0.6), giving "A A". Beam search also found "B C" as the second-best. Without beam search, we'd never discover that "A A" might score better than "A B" or other alternatives explored by the wider beam.
Quiz
Q1: What does the concept of Beam Search primarily refer to in this subject?
A) A historical anecdote about Beam Search B) A computational error related to Beam Search C) A visual representation of Beam Search D) The definition and application of Beam Search
Correct: D)
- If you chose A: This is incorrect. Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus.
- If you chose D: Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus. Correct!
Q2: What is the primary purpose of Common Pitfalls?
A) It is primarily a historical notation system B) It replaces all other methods in this domain C) It is used only in advanced research contexts D) It is used to common pitfalls in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q3: Which statement about Computing Log-Probabilities at Inference is TRUE?
A) Computing Log-Probabilities at Inference is a fundamental concept covered in this subject B) Computing Log-Probabilities at Inference is not related to this subject C) Computing Log-Probabilities at Inference is mentioned only as a historical footnote D) Computing Log-Probabilities at Inference is an advanced topic beyond this subject's scope
Correct: A)
- If you chose A: Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content. Correct!
- If you chose B: This is incorrect. Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content.
- If you chose C: This is incorrect. Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content.
- If you chose D: This is incorrect. Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) 25.061 B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value
Correct: A)
- If you chose A: The worked examples show that the result is 25.061. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is 25.061. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is 25.061. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is 25.061. The other options represent common errors.
Q5: How are Computing Log-Probabilities at Inference and Greedy Decoding related?
A) Computing Log-Probabilities at Inference is a special case of Greedy Decoding B) Computing Log-Probabilities at Inference and Greedy Decoding are closely related concepts C) Computing Log-Probabilities at Inference is the inverse of Greedy Decoding D) Computing Log-Probabilities at Inference and Greedy Decoding are completely unrelated topics
Correct: B)
- If you chose A: This is incorrect. Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics.
- If you chose B: Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Nucleus (Top-p) Sampling?
A) Nucleus (Top-p) Sampling is always computed the same way in all contexts B) A common mistake is confusing Nucleus (Top-p) Sampling with a similar concept C) The main error with Nucleus (Top-p) Sampling is using it when it is not needed D) Nucleus (Top-p) Sampling has no common misconceptions
Correct: B)
- If you chose A: This is incorrect. Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply The Autoregressive Generation Algorithm?
A) Avoid The Autoregressive Generation Algorithm unless explicitly instructed B) The Autoregressive Generation Algorithm is not practically useful C) Apply The Autoregressive Generation Algorithm to solve problems in this subject's domain D) Use The Autoregressive Generation Algorithm only in pure mathematics contexts
Correct: C)
- If you chose A: This is incorrect. The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
For logits z = [2.0, 1.5, 1.0, 0.5, 0.0], compute the probability of sampling each token at T = 0.8.
Answer
z/T = [2.5, 1.875, 1.25, 0.625, 0.0] exp: [12.182, 6.521, 3.490, 1.868, 1.0], sum = 25.061 p = [0.486, 0.260, 0.139, 0.075, 0.040]Problem 2
For a distribution p = [0.4, 0.3, 0.15, 0.10, 0.05], what is the smallest top-k that includes at least 90% of probability mass (nucleus for p=0.9)?
Answer
Sorted: 0.4, 0.3, 0.15, 0.10, 0.05 Cumulative: 0.4, 0.7, 0.85, 0.95, 1.0 m=4: 0.95 โฅ 0.9 โ V_p has size 4 Top-k with k=4 is equivalent for this distribution.Problem 3
A beam search with b=4 generates sequences with scores: [โ5.2, โ5.3, โ5.3, โ5.8, โ5.9, ...]. Apply length normalization with ฮฑ=0.6 for sequences of lengths 10, 12, 15, 8, 20 respectively. What's the new ranking?
Answer
Wait โ need lengths for each score. Let's assign: score โ5.2 (len 10), โ5.3 (len 12), โ5.3 (len 10), โ5.8 (len 15), โ5.9 (len 8) Normalized: โ5.2 / 10^0.6 = โ5.2 / 3.981 = โ1.306 โ5.3 / 12^0.6 = โ5.3 / 4.445 = โ1.192 โ best โ5.3 / 10^0.6 = โ5.3 / 3.981 = โ1.331 โ5.8 / 15^0.6 = โ5.8 / 5.098 = โ1.138 โ new best after normalization โ5.9 / 8^0.6 = โ5.9 / 3.482 = โ1.695 Ranking: โ5.8(len 15) โ โ5.3(len 12) โ โ5.2(len 10) โ โ5.3(len 10) โ โ5.9(len 8) Length normalization boosted the longer sequence (โ5.8 over 15 tokens) above shorter ones with better raw scores.Problem 4
Explain why greedy decoding can fail to find the globally optimal sequence.