Math graphic
๐Ÿ“ Concept diagram

18-08 โ€” Inference Mathematics

Phase: 18 โ€” Large Language Model Mathematics Subject: 18-08 Prerequisites: 18-05 (Decoder-Only Architecture), 13-05 (Cross Entropy), 10-07 (Continuous Distributions โ€” sampling), 14-03 (SGD โ€” temperature concept) Next subject: 18-09 โ€” KV Caching


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the autoregressive decoding algorithm and analyze its O(Lยท(Lยทdยฒ + Lยฒยทd)) time complexity
  2. Derive temperature sampling mathematically: p_i โˆ exp(z_i / T) and explain its effect on the output distribution
  3. Formulate top-k and nucleus (top-p) sampling as constrained categorical sampling problems
  4. Derive beam search as a breadth-first search with the scoring function score = ฮฃ log p(x_t|x_{<t}) / len^ฮฑ
  5. Compute sequence-level log-probabilities and perplexity at inference time

Core Content

1. The Autoregressive Generation Algorithm

At inference, the model generates tokens one by one:

Algorithm:

Input: prompt tokens x_0, ..., x_{k-1}, max_length L_max
Output: generated tokens x_k, ..., x_{L-1}

for t = k to L_max - 1:
    h = decoder_forward(x_0, ..., x_{t-1})
    z = h ยท W_out^T                    # logits โˆˆ โ„^V
    p = sampling_strategy(z)           # distribution over next token
    x_t ~ p                            # sample
    if x_t == <eos>: break

โš ๏ธ THIS IS CRITICAL โ€” Each step requires a full forward pass through all L Transformer blocks over the growing sequence. Without KV caching (18-09), the complexity for generating T tokens from a prompt of length P is O(ฮฃ_{t=P}^{P+T-1} tยฒยทd + tยทdยฒ) โ‰ˆ O((P+T)ยณยทd), dominated by attention's quadratic scaling.

2. Greedy Decoding

The simplest strategy: always pick the most likely token.

x_t = argmax_j z_j

Properties: - Deterministic: same input always produces the same output - Fast: one argmax operation - Problem: gets stuck in repetitive loops ("I went to the store. I went to the store. I went...") - Optimal for: tasks with a single correct answer (factual QA, translation sometimes)

Mathematical justification: Greedy decoding maximizes the probability of each individual step, but does NOT maximize the joint probability of the sequence. The sequence-level optimum may require temporarily choosing a lower-probability token.

3. Temperature Sampling

Temperature modifies the softmax before sampling:

p_i = exp(z_i / T) / ฮฃ_j exp(z_j / T)

where T > 0 is the temperature parameter.

Effect of T:

T Effect Use case
T โ†’ 0 p โ†’ one-hot at argmax (greedy) Deterministic tasks
T = 1 Standard softmax (default) Normal generation
T > 1 Flatter distribution, more diverse Creative writing
T โ†’ โˆž p โ†’ uniform (random) Nonsense output

Logit scaling interpretation: T scales the logits. At T = 0.5, logits are doubled โ€” the model becomes more confident (peakier distribution). At T = 2, logits are halved โ€” the model becomes less confident (flatter).

Gradient connection: The gradient of cross-entropy with temperature:

p_i^T = exp(z_i/T) / ฮฃ_j exp(z_j/T) โˆ‚L/โˆ‚z_i = (1/T)(p_i^T โˆ’ 1_{i=y})

Higher T โ†’ smaller gradients โ†’ effectively smaller learning rate for those examples.

4. Top-k Sampling

Restrict sampling to the k most likely tokens, then renormalize:

V_k = {indices of top k values in z} p_i = exp(z_i) / ฮฃ_{jโˆˆV_k} exp(z_j) for i โˆˆ V_k p_i = 0 for i โˆ‰ V_k

This prevents the model from sampling extremely unlikely tokens that would derail generation.

Typical k: 10-50. GPT-2 used k=40 or k=50.

Problem with fixed k: For a peaked distribution (model is confident), k=50 includes many bad options. For a flat distribution, k=50 might exclude reasonable options. This motivates...

5. Nucleus (Top-p) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds p:

Let z be sorted in descending order: z_(1) โ‰ฅ z_(2) โ‰ฅ ... โ‰ฅ z_(V) Find smallest m such that ฮฃ_{i=1}^{m} softmax(z_(i)) โ‰ฅ p Set V_p = {indices of the top m tokens} Sample from V_p with renormalized probabilities

Mathematical definition:

p_i^top-p = exp(z_i) / ฮฃ_{jโˆˆV_p} exp(z_j) ยท 1_{iโˆˆV_p}

where V_p = minimal set such that ฮฃ_{jโˆˆV_p} softmax_j(z) โ‰ฅ p.

Advantage over top-k: Adapts to the confidence of the distribution. When the model is very confident (peak at 0.95), top-p=0.9 only includes that one token. When uncertain, it includes more options.

Typical p: 0.9-0.95. Higher p = more diverse, lower p = more focused.

Beam search maintains k candidate sequences (the "beam") and expands all of them simultaneously.

Algorithm:

beam = { (prompt, score=0) }    # beam_width = b

for t = prompt_len to max_len:
    candidates = []
    for (seq, score) in beam:
        z = forward(seq)              # logits for next token
        log_probs = log_softmax(z)
        for each token j:
            candidates.append( (seq + [j], score + log_probs[j]) )

    beam = top_b candidates by score   # keep b best

Scoring function: The score accumulated is the sum of log-probabilities:

score(seq) = ฮฃ_{t} log p(x_t | x_{<t})

Length normalization: Without length penalty, beam search favors short sequences (each step adds negative log-prob, so shorter = higher score). Fix:

score_norm(seq) = score(seq) / |seq|^ฮฑ

Typical ฮฑ โˆˆ [0.6, 1.0]. ฮฑ = 1 is average log-prob per token.

Repetition penalty: Subtract a penalty for already-generated tokens:

z'j = z_j โˆ’ penalty ยท 1{j already generated}

This discourages the model from repeating itself.

7. Computing Log-Probabilities at Inference

For a given input-output pair, we can compute the model's assessment:

log p(output | input) = ฮฃ_{t} log p_ฮธ(y_t | input, y_1, ..., y_{tโˆ’1})

This is useful for: - Scoring: Which of several completions does the model prefer? - Perplexity evaluation: Compute PPL on held-out text - Uncertainty estimation: Low log-prob โ†’ model is uncertain

Implementation: Run one forward pass over the concatenated [input, output] sequence. For each output position, extract the log-prob of the actual token from the logits at the previous position.



Pitfalls

โš ๏ธ Pitfall 1: Confusing temperature with top-k/top-p. Temperature modifies the SOFTMAX distribution before any truncation. Top-k then truncates. The order matters: p_i โˆ exp(z_i/T) first, THEN top-k/top-p filtering, THEN sampling. Applying top-k before temperature would produce different results.

โš ๏ธ Pitfall 2: Using greedy decoding and expecting diverse output. Greedy is deterministic โ€” same input always produces the same output. For creative tasks, use T > 1 with top-p sampling. For factual tasks, T=0 (greedy) is usually best.

โš ๏ธ Pitfall 3: Forgetting length normalization in beam search. Without it, beam search always prefers the shortest sequence (every step adds negative log-prob). A model generating "Paris." will score higher than "Paris is the capital of France." simply because it's shorter.


Key Terms

Worked Examples

Example 1: Temperature Effect on a Distribution

Problem: Logits z = [3.0, 1.0, 0.5, 0.2]. Compute the sampling probabilities at T = 1.0, T = 0.5, and T = 2.0.

Solution:

T = 1.0 (standard): z/T = [3.0, 1.0, 0.5, 0.2] exp: [20.09, 2.718, 1.649, 1.221], sum = 25.68 p = [0.782, 0.106, 0.064, 0.048]

T = 0.5 (sharper): z/T = [6.0, 2.0, 1.0, 0.4] exp: [403.4, 7.389, 2.718, 1.492], sum = 415.0 p = [0.972, 0.018, 0.007, 0.004]

T = 2.0 (flatter): z/T = [1.5, 0.5, 0.25, 0.1] exp: [4.482, 1.649, 1.284, 1.105], sum = 8.520 p = [0.526, 0.194, 0.151, 0.130]

Observation: At T=0.5, the top token dominates (97.2%). At T=2.0, the distribution is much flatter โ€” the model's "confidence" is spread out. This is how T controls the diversity/vs-quality tradeoff.

Example 2: Top-p Sampling

Problem: For the T=1.0 probabilities from Example 1: p = [0.782, 0.106, 0.064, 0.048], with token indices 0,1,2,3. Find the nucleus for p = 0.9 and the renormalized probabilities.

Solution:

Sorted probabilities: 0.782, 0.106, 0.064, 0.048

Cumulative: m=1: 0.782 < 0.9 m=2: 0.782 + 0.106 = 0.888 < 0.9 m=3: 0.888 + 0.064 = 0.952 โ‰ฅ 0.9 โœ“

Nucleus = {0, 1, 2} (tokens with indices 0,1,2)

Renormalized: sum = 0.782 + 0.106 + 0.064 = 0.952 p'โ‚€ = 0.782/0.952 = 0.821 p'โ‚ = 0.106/0.952 = 0.111 p'โ‚‚ = 0.064/0.952 = 0.067 p'โ‚ƒ = 0

Token 3 is excluded from sampling despite having non-negligible (4.8%) probability. This prevents the model from ever choosing that token.

Problem: Use beam search with b=2 to continue the sequence from vocabulary {A, B, C, }. The model's top-2 probabilities at each step are:

Step 1 (from prompt): p(A)=0.5, p(B)=0.3 Step 2 (from prompt+A): p(A)=0.6, p(B)=0.2 Step 2 (from prompt+B): p(C)=0.4, p()=0.3

Find the top 2 beams after Step 2.

Solution:

Step 1 candidates: - prompt+A: score = log(0.5) = โˆ’0.693 - prompt+B: score = log(0.3) = โˆ’1.204

Top 2 beams: [prompt+A (โˆ’0.693), prompt+B (โˆ’1.204)]

Step 2 (expand both beams):

From prompt+A: - prompt+A+A: score = โˆ’0.693 + log(0.6) = โˆ’0.693 โˆ’ 0.511 = โˆ’1.204 - prompt+A+B: score = โˆ’0.693 + log(0.2) = โˆ’0.693 โˆ’ 1.609 = โˆ’2.302

From prompt+B: - prompt+B+C: score = โˆ’1.204 + log(0.4) = โˆ’1.204 โˆ’ 0.916 = โˆ’2.120 - prompt+B+: score = โˆ’1.204 + log(0.3) = โˆ’1.204 โˆ’ 1.204 = โˆ’2.408

Top 2 beams: 1. prompt+A+A: โˆ’1.204 2. prompt+B+C: โˆ’2.120

Note: The greedy path would be A (0.5) โ†’ A (0.6), giving "A A". Beam search also found "B C" as the second-best. Without beam search, we'd never discover that "A A" might score better than "A B" or other alternatives explored by the wider beam.



Quiz

Q1: What does the concept of Beam Search primarily refer to in this subject?

A) A historical anecdote about Beam Search B) A computational error related to Beam Search C) A visual representation of Beam Search D) The definition and application of Beam Search

Correct: D)

Q2: What is the primary purpose of Common Pitfalls?

A) It is primarily a historical notation system B) It replaces all other methods in this domain C) It is used only in advanced research contexts D) It is used to common pitfalls in mathematical analysis

Correct: D)

Q3: Which statement about Computing Log-Probabilities at Inference is TRUE?

A) Computing Log-Probabilities at Inference is a fundamental concept covered in this subject B) Computing Log-Probabilities at Inference is not related to this subject C) Computing Log-Probabilities at Inference is mentioned only as a historical footnote D) Computing Log-Probabilities at Inference is an advanced topic beyond this subject's scope

Correct: A)

Q4: Based on the worked examples in this subject, what is the correct result?

A) 25.061 B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value

Correct: A)

Q5: How are Computing Log-Probabilities at Inference and Greedy Decoding related?

A) Computing Log-Probabilities at Inference is a special case of Greedy Decoding B) Computing Log-Probabilities at Inference and Greedy Decoding are closely related concepts C) Computing Log-Probabilities at Inference is the inverse of Greedy Decoding D) Computing Log-Probabilities at Inference and Greedy Decoding are completely unrelated topics

Correct: B)

Q6: What is a common pitfall when working with Nucleus (Top-p) Sampling?

A) Nucleus (Top-p) Sampling is always computed the same way in all contexts B) A common mistake is confusing Nucleus (Top-p) Sampling with a similar concept C) The main error with Nucleus (Top-p) Sampling is using it when it is not needed D) Nucleus (Top-p) Sampling has no common misconceptions

Correct: B)

Q7: When should you apply The Autoregressive Generation Algorithm?

A) Avoid The Autoregressive Generation Algorithm unless explicitly instructed B) The Autoregressive Generation Algorithm is not practically useful C) Apply The Autoregressive Generation Algorithm to solve problems in this subject's domain D) Use The Autoregressive Generation Algorithm only in pure mathematics contexts

Correct: C)

Practice Problems

Problem 1

For logits z = [2.0, 1.5, 1.0, 0.5, 0.0], compute the probability of sampling each token at T = 0.8.

Answer z/T = [2.5, 1.875, 1.25, 0.625, 0.0] exp: [12.182, 6.521, 3.490, 1.868, 1.0], sum = 25.061 p = [0.486, 0.260, 0.139, 0.075, 0.040]

Problem 2

For a distribution p = [0.4, 0.3, 0.15, 0.10, 0.05], what is the smallest top-k that includes at least 90% of probability mass (nucleus for p=0.9)?

Answer Sorted: 0.4, 0.3, 0.15, 0.10, 0.05 Cumulative: 0.4, 0.7, 0.85, 0.95, 1.0 m=4: 0.95 โ‰ฅ 0.9 โ†’ V_p has size 4 Top-k with k=4 is equivalent for this distribution.

Problem 3

A beam search with b=4 generates sequences with scores: [โˆ’5.2, โˆ’5.3, โˆ’5.3, โˆ’5.8, โˆ’5.9, ...]. Apply length normalization with ฮฑ=0.6 for sequences of lengths 10, 12, 15, 8, 20 respectively. What's the new ranking?

Answer Wait โ€” need lengths for each score. Let's assign: score โˆ’5.2 (len 10), โˆ’5.3 (len 12), โˆ’5.3 (len 10), โˆ’5.8 (len 15), โˆ’5.9 (len 8) Normalized: โˆ’5.2 / 10^0.6 = โˆ’5.2 / 3.981 = โˆ’1.306 โˆ’5.3 / 12^0.6 = โˆ’5.3 / 4.445 = โˆ’1.192 โ† best โˆ’5.3 / 10^0.6 = โˆ’5.3 / 3.981 = โˆ’1.331 โˆ’5.8 / 15^0.6 = โˆ’5.8 / 5.098 = โˆ’1.138 โ† new best after normalization โˆ’5.9 / 8^0.6 = โˆ’5.9 / 3.482 = โˆ’1.695 Ranking: โˆ’5.8(len 15) โ†’ โˆ’5.3(len 12) โ†’ โˆ’5.2(len 10) โ†’ โˆ’5.3(len 10) โ†’ โˆ’5.9(len 8) Length normalization boosted the longer sequence (โˆ’5.8 over 15 tokens) above shorter ones with better raw scores.

Problem 4

Explain why greedy decoding can fail to find the globally optimal sequence.

Answer Greedy decoding maximizes p(x_t | x_{ ### Problem 5 At inference time, you run the model on a test set of 1000 tokens and compute sum of log-probabilities = โˆ’2500. What is the perplexity?
Answer Average NLL = 2500/1000 = 2.5 nats/token PPL = exp(2.5) โ‰ˆ 12.18 Note: inference-time perplexity matches training-time if the evaluation settings (temperature=1, no top-k/top-p) are used.
--- ## Summary 1. Autoregressive generation produces tokens sequentially: each forward pass over the growing context produces logits, which are converted to a distribution via softmax and sampled 2. Temperature T scales logits: p_i โˆ exp(z_i/T); T < 1 sharpens (more deterministic), T > 1 flattens (more random) 3. Top-k restricts to the k most probable tokens; nucleus (top-p) restricts to the minimal set covering probability p โ€” nucleus adapts to distribution shape 4. Beam search maintains b parallel hypotheses, using accumulated log-probability with length normalization to find high-probability sequences 5. Sequence log-probability = ฮฃ log p(y_t|input, y_{