Math graphic
๐Ÿ“ Concept diagram

18-05 โ€” Decoder-Only Architecture (GPT-style)

Phase: 18 โ€” Large Language Model Mathematics Subject: 18-05 Prerequisites: 18-04 (RoPE), 17-09 (Transformer Architecture), 17-07 (Scaled Dot-Product Attention), 17-05 (Residual Connections), 13-05 (Cross Entropy) Next subject: 18-06 โ€” Pre-training Objective Mathematics


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the autoregressive generation equation p(x_t | x_{<t}) and explain why it factorizes the joint probability
  2. Construct the causal (masked) self-attention mechanism mathematically, including the mask matrix and its role in the softmax
  3. Trace the complete forward pass through a decoder-only Transformer from token indices to next-token logits
  4. Compute the cross-entropy loss for next-token prediction and derive its gradient with respect to the logits
  5. Explain how teacher forcing enables parallel training despite the sequential nature of autoregressive generation

Core Content

1. The Autoregressive Language Modeling Objective

A language model defines a probability distribution over sequences of tokens:

p(xโ‚, xโ‚‚, ..., x_T) = ฮ _{t=1}^{T} p(x_t | xโ‚, ..., x_{tโˆ’1})

This is the chain rule of probability applied to tokens. The autoregressive assumption is that each token depends only on previous tokens.

The model is a function f_ฮธ that, given a prefix x_{<t} = (xโ‚, ..., x_{tโˆ’1}), outputs a probability distribution over the next token:

p_ฮธ(ยท | x_{<t}) = softmax(f_ฮธ(x_{<t}))

where f_ฮธ(x_{<t}) โˆˆ โ„^V are the logits.

The decoder-only architecture processes ALL prefix tokens simultaneously and produces logits for EVERY position's next token in one forward pass:

[zโ‚, zโ‚‚, ..., z_T] = Decoder([xโ‚, xโ‚‚, ..., x_T]) p(x_t | x_{<t}) = softmax(z_{tโˆ’1}) for t โ‰ฅ 2 p(xโ‚) = softmax(zโ‚€) (or from a special start token)

โš ๏ธ THIS IS CRITICAL โ€” One forward pass produces predictions for ALL positions simultaneously. z_i predicts token x_{i+1} because position i has only seen tokens 0...i. This is the power of causal masking combined with teacher forcing.

2. The Causal (Masked) Self-Attention

The key difference from encoder self-attention: position i can only attend to positions j โ‰ค i.

The causal mask is an upper-triangular matrix with โˆ’โˆž in positions where j > i:

M_{ij} = 0 if j โ‰ค i, โˆ’โˆž if j > i

The attention computation:

S = (QKแต€) / โˆšd_k S_masked = S + M A = softmax(S_masked) [applied row-wise] Output = A ยท V

In matrix form for sequence length n:

$S = [[sโ‚€โ‚€, sโ‚€โ‚, sโ‚€โ‚‚, ..., sโ‚€_{nโˆ’1}],
     [sโ‚โ‚€, sโ‚โ‚, sโ‚โ‚‚, ..., sโ‚_{nโˆ’1}],
     ...
     [s_{nโˆ’1,0}, s_{nโˆ’1,1}, ..., s_{nโˆ’1,nโˆ’1}]]

M = [[0,   โˆ’โˆž, โˆ’โˆž, ..., โˆ’โˆž],
     [0,   0,  โˆ’โˆž, ..., โˆ’โˆž],
     [0,   0,  0,  ..., โˆ’โˆž],
     ...
     [0,   0,  0,  ..., 0]]

S_masked = [[sโ‚€โ‚€,   โˆ’โˆž,   โˆ’โˆž, ...,   โˆ’โˆž],
            [sโ‚โ‚€, sโ‚โ‚,   โˆ’โˆž, ...,   โˆ’โˆž],
            ...
            [s_{nโˆ’1,0}, ..., s_{nโˆ’1,nโˆ’1}]]
$

After softmax (row-wise): exp(โˆ’โˆž) = 0, so any masked position gets zero attention weight. Position i only attends to positions 0...i.

Implementation efficiency: The mask is typically applied by setting values to a large negative number (like โˆ’1e9) rather than actual โˆ’โˆž.

3. Complete Forward Pass

Starting from token indices t โˆˆ {1..V}^n:

Step 1: Token Embedding

Xโ‚€ = E[t] โˆˆ โ„^(nร—d) [lookup]

Step 2: Add Positional Encoding (RoPE is applied inside attention, not here โ€” but for absolute PE:)

Xโ‚€ += PE[0:n] [if using absolute PE; skip for RoPE]

Step 3: L Transformer Blocks

For each block โ„“ โˆˆ {1..L}:

Pre-norm variant (modern):

Xnorm = RMSNorm(X{โ„“โˆ’1}) A = CausalMultiHeadAttention(Xnorm) [with RoPE inside Q,K projections] X_mid = X{โ„“โˆ’1} + A X_norm2 = RMSNorm(X_mid) F = FFN(X_norm2) X_โ„“ = X_mid + F

Step 4: Final Layer Norm

H = RMSNorm(X_L)

Step 5: Output Projection (LM Head)

Z = H ยท W_outแต€ + b_out โˆˆ โ„^(nร—V)

Row i of Z contains the logits for predicting token at position i+1.

4. Training: Cross-Entropy Loss

For a sequence of length n, the model predicts tokens 1..n from prefix tokens 0..nโˆ’1.

The prediction for position t (predicting token x_{t+1}):

p_t = softmax(Z[t, :]) โˆˆ โ„^V

The loss is the mean cross-entropy over all positions:

L = โˆ’(1/n) ฮฃ_{t=0}^{nโˆ’1} log p_t(x_{t+1}) = โˆ’(1/n) ฮฃ_{t=0}^{nโˆ’1} log(softmax(Z[t, :])[x_{t+1}])

Expanding:

L = โˆ’(1/n) ฮฃ_{t=0}^{nโˆ’1} [Z[t, x_{t+1}] โˆ’ logsumexp(Z[t, :])]

where logsumexp(z) = log(ฮฃ_j exp(z_j)).

Gradient of loss with respect to logits:

For a single position t with target token y = x_{t+1}:

โˆ‚L_t/โˆ‚z_j = p_t(j) โˆ’ 1_{j=y}

This is the fundamental gradient of cross-entropy with softmax: predicted probability minus one-hot target. When the model is confidently correct (p_t(y) โ‰ˆ 1), gradients are near zero. When wrong, strong gradients push up the correct logit and push down others.

5. Teacher Forcing

During training, we feed the GROUND TRUTH tokens as input, not the model's own predictions. This is called teacher forcing.

Why it works: Consider predicting xโ‚ƒ given xโ‚€, xโ‚, xโ‚‚. With teacher forcing: - The model receives the real xโ‚€, xโ‚, xโ‚‚ as input - It produces predictions for xโ‚ (from xโ‚€), xโ‚‚ (from xโ‚€,xโ‚), xโ‚ƒ (from xโ‚€,xโ‚,xโ‚‚) - Loss is computed at ALL positions simultaneously - Gradients flow in parallel through all positions

Without teacher forcing (free-running mode): - The model receives xโ‚€ only - Generates xฬ‚โ‚, feeds xฬ‚โ‚ back, generates xฬ‚โ‚‚, etc. - Sequential, slow, and errors compound - Gradients must flow through the sampling step (non-differentiable)

Teacher forcing enables parallel training: one forward pass computes all losses. The causal mask ensures causality โ€” position i never sees future tokens, even though the whole sequence is processed at once.

Loss masking for variable-length sequences: For a batch of sequences with different lengths, create a boolean mask that zeros out loss for padding positions:

L = โˆ’(1/ฮฃ m_t) ฮฃ_{t=0}^{nโˆ’1} m_t ยท log p_t(x_{t+1})

where m_t = 1 for valid prediction positions, 0 for padding.

6. Inference: Autoregressive Generation

At inference, generation is sequential:

  1. Start with a prompt: tokens xโ‚€, ..., x_{kโˆ’1}
  2. Forward pass โ†’ logits for position kโˆ’1: Z[kโˆ’1, :]
  3. Sample or argmax: x_k ~ p(ยท | Z[kโˆ’1, :])
  4. Append x_k to the sequence
  5. Repeat from step 2 with the extended sequence

KV caching (covered in 18-09) makes this efficient by avoiding recomputation of K and V for previous tokens.

7. Parameter Count (GPT-style)

For a decoder-only Transformer: - Embedding: |V|ยทd - Per block: attention (4dยฒ for Q,K,V,O projections) + FFN (8dยฒ typically) + 2 RMSNorm (2d) โ‰ˆ 12dยฒ per block - Output: dยท|V| (or tied with embedding: 0 additional) - Total: |V|ยทd + Lยท12dยฒ (+ dยท|V| if not tied)

For GPT-3 (175B): d = 12288, L = 96, |V| = 50257 โ‰ˆ 50Kยท12K + 96ยท12ยท(12K)ยฒ โ‰ˆ 0.6B + 96ยท12ยท151M โ‰ˆ 0.6B + 174B โ‰ˆ 175B โœ“



Pitfalls

โš ๏ธ Pitfall 1: Thinking teacher forcing leaks future information. Teacher forcing feeds ground-truth tokens, but the CAUSAL MASK still prevents position i from attending to positions > i. The model sees the ground-truth prefix, not the future. The magic is that one forward pass computes predictions for ALL positions simultaneously while respecting causality.

โš ๏ธ Pitfall 2: Forgetting the n in the loss denominator. L = -(1/n) ฮฃ log p_t(x_{t+1}). For variable-length sequences, the mean should be over VALID tokens only, not including padding. Getting this wrong biases the loss.

โš ๏ธ Pitfall 3: Confusing inference with training. At inference, generation is SEQUENTIAL (each step feeds the previous output back as input). Training is PARALLEL (all positions at once). The model architecture is the same โ€” teacher forcing is the trick that makes parallel training possible.


Key Terms

Worked Examples

Example 1: Causal Mask for n=4

Problem: Write out the full S, M, S_masked, and A matrices for a 4-token sequence with causal masking. Use example scores.

Solution:

Let scores S = QKแต€/โˆšd_k:

$S = [[2.0,  1.0,  0.5, โˆ’0.3],
     [1.5,  3.0,  1.2,  0.1],
     [0.8,  2.0,  2.5,  1.0],
     [0.2,  0.5,  1.8,  3.0]]
$

Mask M (0 = attend, โˆ’โˆž = masked):

$M = [[0,   โˆ’โˆž, โˆ’โˆž, โˆ’โˆž],
     [0,   0,  โˆ’โˆž, โˆ’โˆž],
     [0,   0,  0,  โˆ’โˆž],
     [0,   0,  0,  0]]
$

S_masked = S + M:

$S_masked = [[2.0,  โˆ’โˆž,  โˆ’โˆž,  โˆ’โˆž],
            [1.5, 3.0,  โˆ’โˆž,  โˆ’โˆž],
            [0.8, 2.0, 2.5,  โˆ’โˆž],
            [0.2, 0.5, 1.8, 3.0]]
$

A = softmax(S_masked) [row-wise]:

Row 0: softmax([2.0]) = [1.0] Row 1: softmax([1.5, 3.0]) = [0.182, 0.818] Row 2: softmax([0.8, 2.0, 2.5]) = [0.113, 0.376, 0.511] Row 3: softmax([0.2, 0.5, 1.8, 3.0]) = [0.041, 0.056, 0.205, 0.698]

$A = [[1.0,   0,     0,     0    ],
     [0.182, 0.818, 0,     0    ],
     [0.113, 0.376, 0.511, 0    ],
     [0.041, 0.056, 0.205, 0.698]]
$

Notice: upper triangle is all zeros. Row i sums to 1 over columns 0..i only.

Example 2: Cross-Entropy Loss for a Sequence

Problem: For vocabulary V = 4 (tokens a, b, c, d), a sequence [a, b, c, a], and model logits:

Z = [[2.0, 1.0, 0.5, 0.3],   # predict token 1 from token 0
     [0.5, 3.0, 1.0, 0.2],   # predict token 2 from tokens 0,1
     [0.3, 0.8, 2.5, 1.0],   # predict token 3 from tokens 0,1,2
     [2.5, 0.5, 0.3, 0.1]]   # predict token 4 from tokens 0,1,2,3

Compute the average cross-entropy loss. Token indices: a=0, b=1, c=2, d=3. Target sequence for prediction: [b, c, a, ?] but we only have 3 predictions (no target after position 3).

Actually: the model takes [a,b,c,a] and predicts next tokens: zโ‚€โ†’xโ‚=b, zโ‚โ†’xโ‚‚=c, zโ‚‚โ†’xโ‚ƒ=a, and zโ‚ƒ would predict xโ‚„ but we don't have it. Let's compute loss for positions 0,1,2.

Solution:

Position 0 (target = b = index 1): softmax([2.0, 1.0, 0.5, 0.3]): exp: [7.389, 2.718, 1.649, 1.350], sum = 13.106 pโ‚€ = [0.564, 0.207, 0.126, 0.103] Lโ‚€ = โˆ’log(0.207) = 1.575

Position 1 (target = c = index 2): softmax([0.5, 3.0, 1.0, 0.2]): exp: [1.649, 20.086, 2.718, 1.221], sum = 25.674 pโ‚ = [0.064, 0.782, 0.106, 0.048] Lโ‚ = โˆ’log(0.106) = 2.245

Position 2 (target = a = index 0): softmax([0.3, 0.8, 2.5, 1.0]): exp: [1.350, 2.226, 12.182, 2.718], sum = 18.476 pโ‚‚ = [0.073, 0.120, 0.659, 0.147] Lโ‚‚ = โˆ’log(0.073) = 2.617

Average loss: L = (1.575 + 2.245 + 2.617) / 3 = 6.437/3 = 2.146

Example 3: Gradient Derivation Check

Problem: Verify that โˆ‚L/โˆ‚z_j = p_j โˆ’ 1_{j=y} for a single position. Use z = [1, 2], y = index 1.

Solution:

z = [1, 2], y = 1 softmax: exp(z) = [2.718, 7.389], sum = 10.107 p = [0.269, 0.731]

L = โˆ’log(pโ‚) = โˆ’log(0.731) = 0.313

โˆ‚L/โˆ‚zโ‚€ = ? Chain rule: โˆ‚L/โˆ‚zโ‚€ = โˆ‚L/โˆ‚pโ‚€ ยท โˆ‚pโ‚€/โˆ‚zโ‚€ + โˆ‚L/โˆ‚pโ‚ ยท โˆ‚pโ‚/โˆ‚zโ‚€

โˆ‚L/โˆ‚pโ‚€ = 0 (loss only depends on p_y) โˆ‚L/โˆ‚pโ‚ = โˆ’1/pโ‚

Softmax derivatives: โˆ‚pโ‚€/โˆ‚zโ‚€ = pโ‚€(1โˆ’pโ‚€) = 0.269ยท0.731 = 0.197 โˆ‚pโ‚/โˆ‚zโ‚€ = โˆ’pโ‚€pโ‚ = โˆ’0.269ยท0.731 = โˆ’0.197

โˆ‚L/โˆ‚zโ‚€ = 0ยท0.197 + (โˆ’1/0.731)ยท(โˆ’0.197) = 0.269

โˆ‚L/โˆ‚zโ‚: โˆ‚pโ‚€/โˆ‚zโ‚ = โˆ’pโ‚€pโ‚ = โˆ’0.197 โˆ‚pโ‚/โˆ‚zโ‚ = pโ‚(1โˆ’pโ‚) = 0.197

โˆ‚L/โˆ‚zโ‚ = 0ยท(โˆ’0.197) + (โˆ’1/0.731)ยท(0.197) = โˆ’0.269

Check: p โˆ’ 1_{y=1} = [0.269, 0.731โˆ’1] = [0.269, โˆ’0.269] โœ“



Quiz

Q1: What does the concept of The decoder-only architecture primarily refer to in this subject?

A) A historical anecdote about The decoder-only architecture B) A visual representation of The decoder-only architecture C) A computational error related to The decoder-only architecture D) The definition and application of The decoder-only architecture

Correct: D)

Q2: What is the primary purpose of The causal mask?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used to the causal mask in mathematical analysis

Correct: D)

Q3: Which statement about Output is TRUE?

A) Output is an advanced topic beyond this subject's scope B) Output is mentioned only as a historical footnote C) Output is not related to this subject D) Output is a fundamental concept covered in this subject

Correct: D)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) softmax([1.0, 0.5, โˆ’0.2]) D) A different result from a common mistake

Correct: C)

Q5: How are Output and KV caching related?

A) Output and KV caching are closely related concepts B) Output is a special case of KV caching C) Output is the inverse of KV caching D) Output and KV caching are completely unrelated topics

Correct: A)

Q6: What is a common pitfall when working with The Autoregressive Language Modeling Objective?

A) A common mistake is confusing The Autoregressive Language Modeling Objective with a similar concept B) The main error with The Autoregressive Language Modeling Objective is using it when it is not needed C) The Autoregressive Language Modeling Objective has no common misconceptions D) The Autoregressive Language Modeling Objective is always computed the same way in all contexts

Correct: A)

Q7: When should you apply The Causal (Masked) Self-Attention?

A) Avoid The Causal (Masked) Self-Attention unless explicitly instructed B) Apply The Causal (Masked) Self-Attention to solve problems in this subject's domain C) The Causal (Masked) Self-Attention is not practically useful D) Use The Causal (Masked) Self-Attention only in pure mathematics contexts

Correct: B)

Practice Problems

Problem 1

For a 3-token sequence, the logits matrix Z is: [[1.0, 0.5, โˆ’0.2], [0.3, 2.0, 0.8], [โˆ’0.5, 0.2, 1.5]] Target next tokens: [token_1, token_2, token_0]. Compute the average cross-entropy loss.

Answer Position 0 (target idx 1): softmax([1.0, 0.5, โˆ’0.2]) exp: [2.718, 1.649, 0.819], sum = 5.186 p = [0.524, 0.318, 0.158] Lโ‚€ = โˆ’log(0.318) = 1.146 Position 1 (target idx 2): softmax([0.3, 2.0, 0.8]) exp: [1.350, 7.389, 2.226], sum = 10.965 p = [0.123, 0.674, 0.203] Lโ‚ = โˆ’log(0.203) = 1.595 Position 2 (target idx 0): softmax([โˆ’0.5, 0.2, 1.5]) exp: [0.607, 1.221, 4.482], sum = 6.310 p = [0.096, 0.194, 0.710] Lโ‚‚ = โˆ’log(0.096) = 2.343 Average L = (1.146 + 1.595 + 2.343)/3 = 1.695

Problem 2

Prove that with causal masking, the attention output at position i does not depend on tokens at positions j > i.

Answer The attention output at position i is: o_i = ฮฃ_{j=0}^{nโˆ’1} A_{ij} ยท v_j With causal masking, A_{ij} = 0 for all j > i (since exp(โˆ’โˆž) = 0). Therefore: o_i = ฮฃ_{j=0}^{i} A_{ij} ยท v_j This sum involves only positions 0..i. Tokens at positions j > i do not contribute. The causal mask enforces the autoregressive property.

Problem 3

A GPT-style model has d = 2048, L = 24, |V| = 50257. Estimate the total parameter count (without weight tying).

Answer Embedding: |V|ยทd = 50257ยท2048 โ‰ˆ 103M Per block: ~12dยฒ = 12ยท2048ยฒ = 12ยท4,194,304 โ‰ˆ 50.3M Total blocks: 24ยท50.3M โ‰ˆ 1,208M Output: dยท|V| โ‰ˆ 103M Total: 103 + 1208 + 103 โ‰ˆ 1,414M = 1.4B parameters This is approximately GPT-2 XL (1.5B parameters).

Problem 4

Explain why teacher forcing allows parallel training of an autoregressive model. What would happen if you tried to backpropagate through free-running generation?

Answer Teacher forcing feeds ground-truth tokens as input, so the model processes all positions simultaneously. The causal mask handles causality. Loss at all positions is computed in one pass and gradients flow in parallel. Free-running (feeding the model's own predictions back as input) creates two problems: 1. Sequential dependency: must generate tokens one at a time, no parallelism 2. Non-differentiability: sampling or argmax from softmax is non-differentiable. Even if you use the softmax probabilities directly (soft" forcing), errors compound โ€” a mistake at position 5 corrupts all subsequent predictions. Teacher forcing is the standard training method; free-running is only used at inference.

Problem 5

For the causal mask, why is it sufficient to add โˆ’โˆž rather than explicitly zeroing out attention weights?

Answer Adding โˆ’โˆž to the pre-softmax scores means exp(โˆ’โˆž) = 0. After softmax normalization, these positions get exactly zero weight, and the remaining weights sum to 1. This is mathematically cleaner than post-softmax masking (which would require re-normalization) and is efficiently implemented as a single addition before softmax.

Summary

  1. The decoder-only Transformer factorizes the joint token probability via the chain rule: p(xโ‚,...,x_T) = ฮ  p(x_t|x_{<t}), predicting one token at a time
  2. Causal masking adds โˆ’โˆž to upper-triangular positions of QKแต€, ensuring position i attends only to positions โ‰ค i
  3. Training uses teacher forcing with cross-entropy loss L = โˆ’(1/n)ฮฃ log p_t(x_{t+1}), computable in parallel due to the causal mask
  4. One forward pass produces predictions for all positions; the gradient โˆ‚L/โˆ‚z_j = p_j โˆ’ 1_{j=y} has the same elegant form as standard cross-entropy
  5. Parameter count: ~12dยฒ per block dominates for large models; GPT-3's 175B โ‰ˆ 96ยท12ยท(12288)ยฒ


Next Steps

Continue to 18-06 โ€” Pre-training Objective Mathematics for a deeper analysis of the loss function, perplexity, bits-per-byte, and training dynamics.