18-05 โ Decoder-Only Architecture (GPT-style)
Phase: 18 โ Large Language Model Mathematics Subject: 18-05 Prerequisites: 18-04 (RoPE), 17-09 (Transformer Architecture), 17-07 (Scaled Dot-Product Attention), 17-05 (Residual Connections), 13-05 (Cross Entropy) Next subject: 18-06 โ Pre-training Objective Mathematics
Learning Objectives
By the end of this subject, you will be able to:
- Derive the autoregressive generation equation p(x_t | x_{<t}) and explain why it factorizes the joint probability
- Construct the causal (masked) self-attention mechanism mathematically, including the mask matrix and its role in the softmax
- Trace the complete forward pass through a decoder-only Transformer from token indices to next-token logits
- Compute the cross-entropy loss for next-token prediction and derive its gradient with respect to the logits
- Explain how teacher forcing enables parallel training despite the sequential nature of autoregressive generation
Core Content
1. The Autoregressive Language Modeling Objective
A language model defines a probability distribution over sequences of tokens:
p(xโ, xโ, ..., x_T) = ฮ _{t=1}^{T} p(x_t | xโ, ..., x_{tโ1})
This is the chain rule of probability applied to tokens. The autoregressive assumption is that each token depends only on previous tokens.
The model is a function f_ฮธ that, given a prefix x_{<t} = (xโ, ..., x_{tโ1}), outputs a probability distribution over the next token:
p_ฮธ(ยท | x_{<t}) = softmax(f_ฮธ(x_{<t}))
where f_ฮธ(x_{<t}) โ โ^V are the logits.
The decoder-only architecture processes ALL prefix tokens simultaneously and produces logits for EVERY position's next token in one forward pass:
[zโ, zโ, ..., z_T] = Decoder([xโ, xโ, ..., x_T]) p(x_t | x_{<t}) = softmax(z_{tโ1}) for t โฅ 2 p(xโ) = softmax(zโ) (or from a special start token)
โ ๏ธ THIS IS CRITICAL โ One forward pass produces predictions for ALL positions simultaneously. z_i predicts token x_{i+1} because position i has only seen tokens 0...i. This is the power of causal masking combined with teacher forcing.
2. The Causal (Masked) Self-Attention
The key difference from encoder self-attention: position i can only attend to positions j โค i.
The causal mask is an upper-triangular matrix with โโ in positions where j > i:
M_{ij} = 0 if j โค i, โโ if j > i
The attention computation:
S = (QKแต) / โd_k S_masked = S + M A = softmax(S_masked) [applied row-wise] Output = A ยท V
In matrix form for sequence length n:
$S = [[sโโ, sโโ, sโโ, ..., sโ_{nโ1}],
[sโโ, sโโ, sโโ, ..., sโ_{nโ1}],
...
[s_{nโ1,0}, s_{nโ1,1}, ..., s_{nโ1,nโ1}]]
M = [[0, โโ, โโ, ..., โโ],
[0, 0, โโ, ..., โโ],
[0, 0, 0, ..., โโ],
...
[0, 0, 0, ..., 0]]
S_masked = [[sโโ, โโ, โโ, ..., โโ],
[sโโ, sโโ, โโ, ..., โโ],
...
[s_{nโ1,0}, ..., s_{nโ1,nโ1}]]
$
After softmax (row-wise): exp(โโ) = 0, so any masked position gets zero attention weight. Position i only attends to positions 0...i.
Implementation efficiency: The mask is typically applied by setting values to a large negative number (like โ1e9) rather than actual โโ.
3. Complete Forward Pass
Starting from token indices t โ {1..V}^n:
Step 1: Token Embedding
Xโ = E[t] โ โ^(nรd) [lookup]
Step 2: Add Positional Encoding (RoPE is applied inside attention, not here โ but for absolute PE:)
Xโ += PE[0:n] [if using absolute PE; skip for RoPE]
Step 3: L Transformer Blocks
For each block โ โ {1..L}:
Pre-norm variant (modern):
Xnorm = RMSNorm(X{โโ1}) A = CausalMultiHeadAttention(Xnorm) [with RoPE inside Q,K projections] X_mid = X{โโ1} + A X_norm2 = RMSNorm(X_mid) F = FFN(X_norm2) X_โ = X_mid + F
Step 4: Final Layer Norm
H = RMSNorm(X_L)
Step 5: Output Projection (LM Head)
Z = H ยท W_outแต + b_out โ โ^(nรV)
Row i of Z contains the logits for predicting token at position i+1.
4. Training: Cross-Entropy Loss
For a sequence of length n, the model predicts tokens 1..n from prefix tokens 0..nโ1.
The prediction for position t (predicting token x_{t+1}):
p_t = softmax(Z[t, :]) โ โ^V
The loss is the mean cross-entropy over all positions:
L = โ(1/n) ฮฃ_{t=0}^{nโ1} log p_t(x_{t+1}) = โ(1/n) ฮฃ_{t=0}^{nโ1} log(softmax(Z[t, :])[x_{t+1}])
Expanding:
L = โ(1/n) ฮฃ_{t=0}^{nโ1} [Z[t, x_{t+1}] โ logsumexp(Z[t, :])]
where logsumexp(z) = log(ฮฃ_j exp(z_j)).
Gradient of loss with respect to logits:
For a single position t with target token y = x_{t+1}:
โL_t/โz_j = p_t(j) โ 1_{j=y}
This is the fundamental gradient of cross-entropy with softmax: predicted probability minus one-hot target. When the model is confidently correct (p_t(y) โ 1), gradients are near zero. When wrong, strong gradients push up the correct logit and push down others.
5. Teacher Forcing
During training, we feed the GROUND TRUTH tokens as input, not the model's own predictions. This is called teacher forcing.
Why it works: Consider predicting xโ given xโ, xโ, xโ. With teacher forcing: - The model receives the real xโ, xโ, xโ as input - It produces predictions for xโ (from xโ), xโ (from xโ,xโ), xโ (from xโ,xโ,xโ) - Loss is computed at ALL positions simultaneously - Gradients flow in parallel through all positions
Without teacher forcing (free-running mode): - The model receives xโ only - Generates xฬโ, feeds xฬโ back, generates xฬโ, etc. - Sequential, slow, and errors compound - Gradients must flow through the sampling step (non-differentiable)
Teacher forcing enables parallel training: one forward pass computes all losses. The causal mask ensures causality โ position i never sees future tokens, even though the whole sequence is processed at once.
Loss masking for variable-length sequences: For a batch of sequences with different lengths, create a boolean mask that zeros out loss for padding positions:
L = โ(1/ฮฃ m_t) ฮฃ_{t=0}^{nโ1} m_t ยท log p_t(x_{t+1})
where m_t = 1 for valid prediction positions, 0 for padding.
6. Inference: Autoregressive Generation
At inference, generation is sequential:
- Start with a prompt: tokens xโ, ..., x_{kโ1}
- Forward pass โ logits for position kโ1: Z[kโ1, :]
- Sample or argmax: x_k ~ p(ยท | Z[kโ1, :])
- Append x_k to the sequence
- Repeat from step 2 with the extended sequence
KV caching (covered in 18-09) makes this efficient by avoiding recomputation of K and V for previous tokens.
7. Parameter Count (GPT-style)
For a decoder-only Transformer: - Embedding: |V|ยทd - Per block: attention (4dยฒ for Q,K,V,O projections) + FFN (8dยฒ typically) + 2 RMSNorm (2d) โ 12dยฒ per block - Output: dยท|V| (or tied with embedding: 0 additional) - Total: |V|ยทd + Lยท12dยฒ (+ dยท|V| if not tied)
For GPT-3 (175B): d = 12288, L = 96, |V| = 50257 โ 50Kยท12K + 96ยท12ยท(12K)ยฒ โ 0.6B + 96ยท12ยท151M โ 0.6B + 174B โ 175B โ
Pitfalls
โ ๏ธ Pitfall 1: Thinking teacher forcing leaks future information. Teacher forcing feeds ground-truth tokens, but the CAUSAL MASK still prevents position i from attending to positions > i. The model sees the ground-truth prefix, not the future. The magic is that one forward pass computes predictions for ALL positions simultaneously while respecting causality.
โ ๏ธ Pitfall 2: Forgetting the n in the loss denominator. L = -(1/n) ฮฃ log p_t(x_{t+1}). For variable-length sequences, the mean should be over VALID tokens only, not including padding. Getting this wrong biases the loss.
โ ๏ธ Pitfall 3: Confusing inference with training. At inference, generation is SEQUENTIAL (each step feeds the previous output back as input). Training is PARALLEL (all positions at once). The model architecture is the same โ teacher forcing is the trick that makes parallel training possible.
Key Terms
- KV caching
- Output
- The causal mask
- The decoder-only architecture
Worked Examples
Example 1: Causal Mask for n=4
Problem: Write out the full S, M, S_masked, and A matrices for a 4-token sequence with causal masking. Use example scores.
Solution:
Let scores S = QKแต/โd_k:
$S = [[2.0, 1.0, 0.5, โ0.3],
[1.5, 3.0, 1.2, 0.1],
[0.8, 2.0, 2.5, 1.0],
[0.2, 0.5, 1.8, 3.0]]
$
Mask M (0 = attend, โโ = masked):
$M = [[0, โโ, โโ, โโ],
[0, 0, โโ, โโ],
[0, 0, 0, โโ],
[0, 0, 0, 0]]
$
S_masked = S + M:
$S_masked = [[2.0, โโ, โโ, โโ],
[1.5, 3.0, โโ, โโ],
[0.8, 2.0, 2.5, โโ],
[0.2, 0.5, 1.8, 3.0]]
$
A = softmax(S_masked) [row-wise]:
Row 0: softmax([2.0]) = [1.0] Row 1: softmax([1.5, 3.0]) = [0.182, 0.818] Row 2: softmax([0.8, 2.0, 2.5]) = [0.113, 0.376, 0.511] Row 3: softmax([0.2, 0.5, 1.8, 3.0]) = [0.041, 0.056, 0.205, 0.698]
$A = [[1.0, 0, 0, 0 ],
[0.182, 0.818, 0, 0 ],
[0.113, 0.376, 0.511, 0 ],
[0.041, 0.056, 0.205, 0.698]]
$
Notice: upper triangle is all zeros. Row i sums to 1 over columns 0..i only.
Example 2: Cross-Entropy Loss for a Sequence
Problem: For vocabulary V = 4 (tokens a, b, c, d), a sequence [a, b, c, a], and model logits:
Z = [[2.0, 1.0, 0.5, 0.3], # predict token 1 from token 0
[0.5, 3.0, 1.0, 0.2], # predict token 2 from tokens 0,1
[0.3, 0.8, 2.5, 1.0], # predict token 3 from tokens 0,1,2
[2.5, 0.5, 0.3, 0.1]] # predict token 4 from tokens 0,1,2,3
Compute the average cross-entropy loss. Token indices: a=0, b=1, c=2, d=3. Target sequence for prediction: [b, c, a, ?] but we only have 3 predictions (no target after position 3).
Actually: the model takes [a,b,c,a] and predicts next tokens: zโโxโ=b, zโโxโ=c, zโโxโ=a, and zโ would predict xโ but we don't have it. Let's compute loss for positions 0,1,2.
Solution:
Position 0 (target = b = index 1): softmax([2.0, 1.0, 0.5, 0.3]): exp: [7.389, 2.718, 1.649, 1.350], sum = 13.106 pโ = [0.564, 0.207, 0.126, 0.103] Lโ = โlog(0.207) = 1.575
Position 1 (target = c = index 2): softmax([0.5, 3.0, 1.0, 0.2]): exp: [1.649, 20.086, 2.718, 1.221], sum = 25.674 pโ = [0.064, 0.782, 0.106, 0.048] Lโ = โlog(0.106) = 2.245
Position 2 (target = a = index 0): softmax([0.3, 0.8, 2.5, 1.0]): exp: [1.350, 2.226, 12.182, 2.718], sum = 18.476 pโ = [0.073, 0.120, 0.659, 0.147] Lโ = โlog(0.073) = 2.617
Average loss: L = (1.575 + 2.245 + 2.617) / 3 = 6.437/3 = 2.146
Example 3: Gradient Derivation Check
Problem: Verify that โL/โz_j = p_j โ 1_{j=y} for a single position. Use z = [1, 2], y = index 1.
Solution:
z = [1, 2], y = 1 softmax: exp(z) = [2.718, 7.389], sum = 10.107 p = [0.269, 0.731]
L = โlog(pโ) = โlog(0.731) = 0.313
โL/โzโ = ? Chain rule: โL/โzโ = โL/โpโ ยท โpโ/โzโ + โL/โpโ ยท โpโ/โzโ
โL/โpโ = 0 (loss only depends on p_y) โL/โpโ = โ1/pโ
Softmax derivatives: โpโ/โzโ = pโ(1โpโ) = 0.269ยท0.731 = 0.197 โpโ/โzโ = โpโpโ = โ0.269ยท0.731 = โ0.197
โL/โzโ = 0ยท0.197 + (โ1/0.731)ยท(โ0.197) = 0.269
โL/โzโ: โpโ/โzโ = โpโpโ = โ0.197 โpโ/โzโ = pโ(1โpโ) = 0.197
โL/โzโ = 0ยท(โ0.197) + (โ1/0.731)ยท(0.197) = โ0.269
Check: p โ 1_{y=1} = [0.269, 0.731โ1] = [0.269, โ0.269] โ
Quiz
Q1: What does the concept of The decoder-only architecture primarily refer to in this subject?
A) A historical anecdote about The decoder-only architecture B) A visual representation of The decoder-only architecture C) A computational error related to The decoder-only architecture D) The definition and application of The decoder-only architecture
Correct: D)
- If you chose A: This is incorrect. The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus.
- If you chose D: The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus. Correct!
Q2: What is the primary purpose of The causal mask?
A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used to the causal mask in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. The causal mask serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. The causal mask serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. The causal mask serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: The causal mask serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q3: Which statement about Output is TRUE?
A) Output is an advanced topic beyond this subject's scope B) Output is mentioned only as a historical footnote C) Output is not related to this subject D) Output is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content.
- If you chose B: This is incorrect. Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content.
- If you chose C: This is incorrect. Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content.
- If you chose D: Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content. Correct!
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) softmax([1.0, 0.5, โ0.2]) D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is softmax([1.0, 0.5, โ0.2]). The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is softmax([1.0, 0.5, โ0.2]). The other options represent common errors.
- If you chose C: The worked examples show that the result is softmax([1.0, 0.5, โ0.2]). The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is softmax([1.0, 0.5, โ0.2]). The other options represent common errors.
Q5: How are Output and KV caching related?
A) Output and KV caching are closely related concepts B) Output is a special case of KV caching C) Output is the inverse of KV caching D) Output and KV caching are completely unrelated topics
Correct: A)
- If you chose A: Both Output and KV caching are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Output and KV caching are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Output and KV caching are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Output and KV caching are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with The Autoregressive Language Modeling Objective?
A) A common mistake is confusing The Autoregressive Language Modeling Objective with a similar concept B) The main error with The Autoregressive Language Modeling Objective is using it when it is not needed C) The Autoregressive Language Modeling Objective has no common misconceptions D) The Autoregressive Language Modeling Objective is always computed the same way in all contexts
Correct: A)
- If you chose A: Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply The Causal (Masked) Self-Attention?
A) Avoid The Causal (Masked) Self-Attention unless explicitly instructed B) Apply The Causal (Masked) Self-Attention to solve problems in this subject's domain C) The Causal (Masked) Self-Attention is not practically useful D) Use The Causal (Masked) Self-Attention only in pure mathematics contexts
Correct: B)
- If you chose A: This is incorrect. The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
For a 3-token sequence, the logits matrix Z is: [[1.0, 0.5, โ0.2], [0.3, 2.0, 0.8], [โ0.5, 0.2, 1.5]] Target next tokens: [token_1, token_2, token_0]. Compute the average cross-entropy loss.
Answer
Position 0 (target idx 1): softmax([1.0, 0.5, โ0.2]) exp: [2.718, 1.649, 0.819], sum = 5.186 p = [0.524, 0.318, 0.158] Lโ = โlog(0.318) = 1.146 Position 1 (target idx 2): softmax([0.3, 2.0, 0.8]) exp: [1.350, 7.389, 2.226], sum = 10.965 p = [0.123, 0.674, 0.203] Lโ = โlog(0.203) = 1.595 Position 2 (target idx 0): softmax([โ0.5, 0.2, 1.5]) exp: [0.607, 1.221, 4.482], sum = 6.310 p = [0.096, 0.194, 0.710] Lโ = โlog(0.096) = 2.343 Average L = (1.146 + 1.595 + 2.343)/3 = 1.695Problem 2
Prove that with causal masking, the attention output at position i does not depend on tokens at positions j > i.
Answer
The attention output at position i is: o_i = ฮฃ_{j=0}^{nโ1} A_{ij} ยท v_j With causal masking, A_{ij} = 0 for all j > i (since exp(โโ) = 0). Therefore: o_i = ฮฃ_{j=0}^{i} A_{ij} ยท v_j This sum involves only positions 0..i. Tokens at positions j > i do not contribute. The causal mask enforces the autoregressive property.Problem 3
A GPT-style model has d = 2048, L = 24, |V| = 50257. Estimate the total parameter count (without weight tying).
Answer
Embedding: |V|ยทd = 50257ยท2048 โ 103M Per block: ~12dยฒ = 12ยท2048ยฒ = 12ยท4,194,304 โ 50.3M Total blocks: 24ยท50.3M โ 1,208M Output: dยท|V| โ 103M Total: 103 + 1208 + 103 โ 1,414M = 1.4B parameters This is approximately GPT-2 XL (1.5B parameters).Problem 4
Explain why teacher forcing allows parallel training of an autoregressive model. What would happen if you tried to backpropagate through free-running generation?
Answer
Teacher forcing feeds ground-truth tokens as input, so the model processes all positions simultaneously. The causal mask handles causality. Loss at all positions is computed in one pass and gradients flow in parallel. Free-running (feeding the model's own predictions back as input) creates two problems: 1. Sequential dependency: must generate tokens one at a time, no parallelism 2. Non-differentiability: sampling or argmax from softmax is non-differentiable. Even if you use the softmax probabilities directly (soft" forcing), errors compound โ a mistake at position 5 corrupts all subsequent predictions. Teacher forcing is the standard training method; free-running is only used at inference.Problem 5
For the causal mask, why is it sufficient to add โโ rather than explicitly zeroing out attention weights?
Answer
Adding โโ to the pre-softmax scores means exp(โโ) = 0. After softmax normalization, these positions get exactly zero weight, and the remaining weights sum to 1. This is mathematically cleaner than post-softmax masking (which would require re-normalization) and is efficiently implemented as a single addition before softmax.Summary
- The decoder-only Transformer factorizes the joint token probability via the chain rule: p(xโ,...,x_T) = ฮ p(x_t|x_{<t}), predicting one token at a time
- Causal masking adds โโ to upper-triangular positions of QKแต, ensuring position i attends only to positions โค i
- Training uses teacher forcing with cross-entropy loss L = โ(1/n)ฮฃ log p_t(x_{t+1}), computable in parallel due to the causal mask
- One forward pass produces predictions for all positions; the gradient โL/โz_j = p_j โ 1_{j=y} has the same elegant form as standard cross-entropy
- Parameter count: ~12dยฒ per block dominates for large models; GPT-3's 175B โ 96ยท12ยท(12288)ยฒ
Next Steps
Continue to 18-06 โ Pre-training Objective Mathematics for a deeper analysis of the loss function, perplexity, bits-per-byte, and training dynamics.