📐 Concept diagram

18-08 — Inference Mathematics

Phase: 18 — Large Language Model Mathematics Subject: 18-08 Prerequisites: 18-05 (Decoder-Only Architecture), 13-05 (Cross Entropy), 10-07 (Continuous Distributions — sampling), 14-03 (SGD — temperature concept) Next subject: 18-09 — KV Caching

Learning Objectives

By the end of this subject, you will be able to:

Derive the autoregressive decoding algorithm and analyze its O(L·(L·d² + L²·d)) time complexity
Derive temperature sampling mathematically: p_i ∝ exp(z_i / T) and explain its effect on the output distribution
Formulate top-k and nucleus (top-p) sampling as constrained categorical sampling problems
Derive beam search as a breadth-first search with the scoring function score = Σ log p(x_t|x_{<t}) / len^α
Compute sequence-level log-probabilities and perplexity at inference time

Core Content

1. The Autoregressive Generation Algorithm

At inference, the model generates tokens one by one:

Algorithm:

Input: prompt tokens x_0, ..., x_{k-1}, max_length L_max
Output: generated tokens x_k, ..., x_{L-1}

for t = k to L_max - 1:
    h = decoder_forward(x_0, ..., x_{t-1})
    z = h · W_out^T                    # logits ∈ ℝ^V
    p = sampling_strategy(z)           # distribution over next token
    x_t ~ p                            # sample
    if x_t == <eos>: break

⚠️ THIS IS CRITICAL — Each step requires a full forward pass through all L Transformer blocks over the growing sequence. Without KV caching (18-09), the complexity for generating T tokens from a prompt of length P is O(Σ_{t=P}^{P+T-1} t²·d + t·d²) ≈ O((P+T)³·d), dominated by attention's quadratic scaling.

2. Greedy Decoding

The simplest strategy: always pick the most likely token.

x_t = argmax_j z_j

Properties: - Deterministic: same input always produces the same output - Fast: one argmax operation - Problem: gets stuck in repetitive loops ("I went to the store. I went to the store. I went...") - Optimal for: tasks with a single correct answer (factual QA, translation sometimes)

Mathematical justification: Greedy decoding maximizes the probability of each individual step, but does NOT maximize the joint probability of the sequence. The sequence-level optimum may require temporarily choosing a lower-probability token.

3. Temperature Sampling

Temperature modifies the softmax before sampling:

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

where T > 0 is the temperature parameter.

Effect of T:

T	Effect	Use case
T → 0	p → one-hot at argmax (greedy)	Deterministic tasks
T = 1	Standard softmax (default)	Normal generation
T > 1	Flatter distribution, more diverse	Creative writing
T → ∞	p → uniform (random)	Nonsense output

Logit scaling interpretation: T scales the logits. At T = 0.5, logits are doubled — the model becomes more confident (peakier distribution). At T = 2, logits are halved — the model becomes less confident (flatter).

Gradient connection: The gradient of cross-entropy with temperature:

p_i^T = exp(z_i/T) / Σ_j exp(z_j/T) ∂L/∂z_i = (1/T)(p_i^T − 1_{i=y})

Higher T → smaller gradients → effectively smaller learning rate for those examples.

4. Top-k Sampling

Restrict sampling to the k most likely tokens, then renormalize:

V_k = {indices of top k values in z} p_i = exp(z_i) / Σ_{j∈V_k} exp(z_j) for i ∈ V_k p_i = 0 for i ∉ V_k

This prevents the model from sampling extremely unlikely tokens that would derail generation.

Typical k: 10-50. GPT-2 used k=40 or k=50.

Problem with fixed k: For a peaked distribution (model is confident), k=50 includes many bad options. For a flat distribution, k=50 might exclude reasonable options. This motivates...

5. Nucleus (Top-p) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds p:

Let z be sorted in descending order: z_(1) ≥ z_(2) ≥ ... ≥ z_(V) Find smallest m such that Σ_{i=1}^{m} softmax(z_(i)) ≥ p Set V_p = {indices of the top m tokens} Sample from V_p with renormalized probabilities

Mathematical definition:

p_i^top-p = exp(z_i) / Σ_{j∈V_p} exp(z_j) · 1_{i∈V_p}

where V_p = minimal set such that Σ_{j∈V_p} softmax_j(z) ≥ p.

Advantage over top-k: Adapts to the confidence of the distribution. When the model is very confident (peak at 0.95), top-p=0.9 only includes that one token. When uncertain, it includes more options.

Typical p: 0.9-0.95. Higher p = more diverse, lower p = more focused.

6. Beam Search

Beam search maintains k candidate sequences (the "beam") and expands all of them simultaneously.

Algorithm:

beam = { (prompt, score=0) }    # beam_width = b

for t = prompt_len to max_len:
    candidates = []
    for (seq, score) in beam:
        z = forward(seq)              # logits for next token
        log_probs = log_softmax(z)
        for each token j:
            candidates.append( (seq + [j], score + log_probs[j]) )

    beam = top_b candidates by score   # keep b best

Scoring function: The score accumulated is the sum of log-probabilities:

score(seq) = Σ_{t} log p(x_t | x_{<t})

Length normalization: Without length penalty, beam search favors short sequences (each step adds negative log-prob, so shorter = higher score). Fix:

score_norm(seq) = score(seq) / |seq|^α

Typical α ∈ [0.6, 1.0]. α = 1 is average log-prob per token.

Repetition penalty: Subtract a penalty for already-generated tokens:

z'j = z_j − penalty · 1{j already generated}

This discourages the model from repeating itself.

7. Computing Log-Probabilities at Inference

For a given input-output pair, we can compute the model's assessment:

log p(output | input) = Σ_{t} log p_θ(y_t | input, y_1, ..., y_{t−1})

This is useful for: - Scoring: Which of several completions does the model prefer? - Perplexity evaluation: Compute PPL on held-out text - Uncertainty estimation: Low log-prob → model is uncertain

Implementation: Run one forward pass over the concatenated [input, output] sequence. For each output position, extract the log-prob of the actual token from the logits at the previous position.

Pitfalls

⚠️ Pitfall 1: Confusing temperature with top-k/top-p. Temperature modifies the SOFTMAX distribution before any truncation. Top-k then truncates. The order matters: p_i ∝ exp(z_i/T) first, THEN top-k/top-p filtering, THEN sampling. Applying top-k before temperature would produce different results.

⚠️ Pitfall 2: Using greedy decoding and expecting diverse output. Greedy is deterministic — same input always produces the same output. For creative tasks, use T > 1 with top-p sampling. For factual tasks, T=0 (greedy) is usually best.

⚠️ Pitfall 3: Forgetting length normalization in beam search. Without it, beam search always prefers the shortest sequence (every step adds negative log-prob). A model generating "Paris." will score higher than "Paris is the capital of France." simply because it's shorter.

Key Terms

18 08 Inference Mathematics
Beam Search
Common Pitfalls
Computing Log-Probabilities at Inference
Example 1: Temperature Effect on a Distribution
Example 2: Top-p Sampling
Example 3: Beam Search
Greedy Decoding
Nucleus (Top-p) Sampling
Problem 1
Problem 2
Problem 3

Worked Examples

Example 1: Temperature Effect on a Distribution

Problem: Logits z = [3.0, 1.0, 0.5, 0.2]. Compute the sampling probabilities at T = 1.0, T = 0.5, and T = 2.0.

Solution:

T = 1.0 (standard): z/T = [3.0, 1.0, 0.5, 0.2] exp: [20.09, 2.718, 1.649, 1.221], sum = 25.68 p = [0.782, 0.106, 0.064, 0.048]

T = 0.5 (sharper): z/T = [6.0, 2.0, 1.0, 0.4] exp: [403.4, 7.389, 2.718, 1.492], sum = 415.0 p = [0.972, 0.018, 0.007, 0.004]

T = 2.0 (flatter): z/T = [1.5, 0.5, 0.25, 0.1] exp: [4.482, 1.649, 1.284, 1.105], sum = 8.520 p = [0.526, 0.194, 0.151, 0.130]

Observation: At T=0.5, the top token dominates (97.2%). At T=2.0, the distribution is much flatter — the model's "confidence" is spread out. This is how T controls the diversity/vs-quality tradeoff.

Example 2: Top-p Sampling

Problem: For the T=1.0 probabilities from Example 1: p = [0.782, 0.106, 0.064, 0.048], with token indices 0,1,2,3. Find the nucleus for p = 0.9 and the renormalized probabilities.

Solution:

Sorted probabilities: 0.782, 0.106, 0.064, 0.048

Cumulative: m=1: 0.782 < 0.9 m=2: 0.782 + 0.106 = 0.888 < 0.9 m=3: 0.888 + 0.064 = 0.952 ≥ 0.9 ✓

Nucleus = {0, 1, 2} (tokens with indices 0,1,2)

Renormalized: sum = 0.782 + 0.106 + 0.064 = 0.952 p'₀ = 0.782/0.952 = 0.821 p'₁ = 0.106/0.952 = 0.111 p'₂ = 0.064/0.952 = 0.067 p'₃ = 0

Token 3 is excluded from sampling despite having non-negligible (4.8%) probability. This prevents the model from ever choosing that token.

Example 3: Beam Search

Problem: Use beam search with b=2 to continue the sequence from vocabulary {A, B, C, }. The model's top-2 probabilities at each step are:

Step 1 (from prompt): p(A)=0.5, p(B)=0.3 Step 2 (from prompt+A): p(A)=0.6, p(B)=0.2 Step 2 (from prompt+B): p(C)=0.4, p()=0.3

Find the top 2 beams after Step 2.

Solution:

Step 1 candidates: - prompt+A: score = log(0.5) = −0.693 - prompt+B: score = log(0.3) = −1.204

Top 2 beams: [prompt+A (−0.693), prompt+B (−1.204)]

Step 2 (expand both beams):

From prompt+A: - prompt+A+A: score = −0.693 + log(0.6) = −0.693 − 0.511 = −1.204 - prompt+A+B: score = −0.693 + log(0.2) = −0.693 − 1.609 = −2.302

From prompt+B: - prompt+B+C: score = −1.204 + log(0.4) = −1.204 − 0.916 = −2.120 - prompt+B+: score = −1.204 + log(0.3) = −1.204 − 1.204 = −2.408

Top 2 beams: 1. prompt+A+A: −1.204 2. prompt+B+C: −2.120

Note: The greedy path would be A (0.5) → A (0.6), giving "A A". Beam search also found "B C" as the second-best. Without beam search, we'd never discover that "A A" might score better than "A B" or other alternatives explored by the wider beam.

Quiz

Q1: What does the concept of Beam Search primarily refer to in this subject?

A) A historical anecdote about Beam Search B) A computational error related to Beam Search C) A visual representation of Beam Search D) The definition and application of Beam Search

Correct: D)

If you chose A: This is incorrect. Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus.
If you chose D: Beam Search is defined as: the definition and application of beam search. The other options describe different aspects that are not the primary focus. Correct!

Q2: What is the primary purpose of Common Pitfalls?

A) It is primarily a historical notation system B) It replaces all other methods in this domain C) It is used only in advanced research contexts D) It is used to common pitfalls in mathematical analysis

Correct: D)

If you chose A: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: Common Pitfalls serves the purpose described in the correct answer. The other options misrepresent its role. Correct!

Q3: Which statement about Computing Log-Probabilities at Inference is TRUE?

A) Computing Log-Probabilities at Inference is a fundamental concept covered in this subject B) Computing Log-Probabilities at Inference is not related to this subject C) Computing Log-Probabilities at Inference is mentioned only as a historical footnote D) Computing Log-Probabilities at Inference is an advanced topic beyond this subject's scope

Correct: A)

If you chose A: Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content. Correct!
If you chose B: This is incorrect. Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content.
If you chose C: This is incorrect. Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content.
If you chose D: This is incorrect. Computing Log-Probabilities at Inference is a fundamental concept covered in this subject. This subject covers Computing Log-Probabilities at Inference as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) 25.061 B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value

Correct: A)

If you chose A: The worked examples show that the result is 25.061. The other options represent common errors. Correct!
If you chose B: This is incorrect. The worked examples show that the result is 25.061. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is 25.061. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is 25.061. The other options represent common errors.

Q5: How are Computing Log-Probabilities at Inference and Greedy Decoding related?

A) Computing Log-Probabilities at Inference is a special case of Greedy Decoding B) Computing Log-Probabilities at Inference and Greedy Decoding are closely related concepts C) Computing Log-Probabilities at Inference is the inverse of Greedy Decoding D) Computing Log-Probabilities at Inference and Greedy Decoding are completely unrelated topics

Correct: B)

If you chose A: This is incorrect. Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics.
If you chose B: Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics. Correct!
If you chose C: This is incorrect. Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Computing Log-Probabilities at Inference and Greedy Decoding are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with Nucleus (Top-p) Sampling?

A) Nucleus (Top-p) Sampling is always computed the same way in all contexts B) A common mistake is confusing Nucleus (Top-p) Sampling with a similar concept C) The main error with Nucleus (Top-p) Sampling is using it when it is not needed D) Nucleus (Top-p) Sampling has no common misconceptions

Correct: B)

If you chose A: This is incorrect. Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose C: This is incorrect. Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse Nucleus (Top-p) Sampling with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply The Autoregressive Generation Algorithm?

A) Avoid The Autoregressive Generation Algorithm unless explicitly instructed B) The Autoregressive Generation Algorithm is not practically useful C) Apply The Autoregressive Generation Algorithm to solve problems in this subject's domain D) Use The Autoregressive Generation Algorithm only in pure mathematics contexts

Correct: C)

If you chose A: This is incorrect. The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems.
If you chose B: This is incorrect. The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems.
If you chose C: The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose D: This is incorrect. The Autoregressive Generation Algorithm is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

For logits z = [2.0, 1.5, 1.0, 0.5, 0.0], compute the probability of sampling each token at T = 0.8.

Answer

z/T = [2.5, 1.875, 1.25, 0.625, 0.0] exp: [12.182, 6.521, 3.490, 1.868, 1.0], sum = 25.061 p = [0.486, 0.260, 0.139, 0.075, 0.040]

Problem 2

For a distribution p = [0.4, 0.3, 0.15, 0.10, 0.05], what is the smallest top-k that includes at least 90% of probability mass (nucleus for p=0.9)?

Answer

Sorted: 0.4, 0.3, 0.15, 0.10, 0.05 Cumulative: 0.4, 0.7, 0.85, 0.95, 1.0 m=4: 0.95 ≥ 0.9 → V_p has size 4 Top-k with k=4 is equivalent for this distribution.

Problem 3

A beam search with b=4 generates sequences with scores: [−5.2, −5.3, −5.3, −5.8, −5.9, ...]. Apply length normalization with α=0.6 for sequences of lengths 10, 12, 15, 8, 20 respectively. What's the new ranking?

Answer

Wait — need lengths for each score. Let's assign: score −5.2 (len 10), −5.3 (len 12), −5.3 (len 10), −5.8 (len 15), −5.9 (len 8) Normalized: −5.2 / 10^0.6 = −5.2 / 3.981 = −1.306 −5.3 / 12^0.6 = −5.3 / 4.445 = −1.192 ← best −5.3 / 10^0.6 = −5.3 / 3.981 = −1.331 −5.8 / 15^0.6 = −5.8 / 5.098 = −1.138 ← new best after normalization −5.9 / 8^0.6 = −5.9 / 3.482 = −1.695 Ranking: −5.8(len 15) → −5.3(len 12) → −5.2(len 10) → −5.3(len 10) → −5.9(len 8) Length normalization boosted the longer sequence (−5.8 over 15 tokens) above shorter ones with better raw scores.

Problem 4

Explain why greedy decoding can fail to find the globally optimal sequence.

Answer

Greedy decoding maximizes p(x_t | x_{ ### Problem 5 At inference time, you run the model on a test set of 1000 tokens and compute sum of log-probabilities = −2500. What is the perplexity?

Answer

Average NLL = 2500/1000 = 2.5 nats/token PPL = exp(2.5) ≈ 12.18 Note: inference-time perplexity matches training-time if the evaluation settings (temperature=1, no top-k/top-p) are used.

--- ## Summary 1. Autoregressive generation produces tokens sequentially: each forward pass over the growing context produces logits, which are converted to a distribution via softmax and sampled 2. Temperature T scales logits: p_i ∝ exp(z_i/T); T < 1 sharpens (more deterministic), T > 1 flattens (more random) 3. Top-k restricts to the k most probable tokens; nucleus (top-p) restricts to the minimal set covering probability p — nucleus adapts to distribution shape 4. Beam search maintains b parallel hypotheses, using accumulated log-probability with length normalization to find high-probability sequences 5. Sequence log-probability = Σ log p(y_t|input, y_{

Progress

Phases

18-08 — Inference Mathematics

Learning Objectives

Core Content

1. The Autoregressive Generation Algorithm

2. Greedy Decoding

3. Temperature Sampling

4. Top-k Sampling

5. Nucleus (Top-p) Sampling

6. Beam Search

7. Computing Log-Probabilities at Inference

Pitfalls

Key Terms

Worked Examples

Example 1: Temperature Effect on a Distribution

Example 2: Top-p Sampling

Example 3: Beam Search

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4