📐 Concept diagram

18-05 — Decoder-Only Architecture (GPT-style)

Phase: 18 — Large Language Model Mathematics Subject: 18-05 Prerequisites: 18-04 (RoPE), 17-09 (Transformer Architecture), 17-07 (Scaled Dot-Product Attention), 17-05 (Residual Connections), 13-05 (Cross Entropy) Next subject: 18-06 — Pre-training Objective Mathematics

Learning Objectives

By the end of this subject, you will be able to:

Derive the autoregressive generation equation p(x_t | x_{<t}) and explain why it factorizes the joint probability
Construct the causal (masked) self-attention mechanism mathematically, including the mask matrix and its role in the softmax
Trace the complete forward pass through a decoder-only Transformer from token indices to next-token logits
Compute the cross-entropy loss for next-token prediction and derive its gradient with respect to the logits
Explain how teacher forcing enables parallel training despite the sequential nature of autoregressive generation

Core Content

1. The Autoregressive Language Modeling Objective

A language model defines a probability distribution over sequences of tokens:

p(x₁, x₂, ..., x_T) = Π_{t=1}^{T} p(x_t | x₁, ..., x_{t−1})

This is the chain rule of probability applied to tokens. The autoregressive assumption is that each token depends only on previous tokens.

The model is a function f_θ that, given a prefix x_{<t} = (x₁, ..., x_{t−1}), outputs a probability distribution over the next token:

p_θ(· | x_{<t}) = softmax(f_θ(x_{<t}))

where f_θ(x_{<t}) ∈ ℝ^V are the logits.

The decoder-only architecture processes ALL prefix tokens simultaneously and produces logits for EVERY position's next token in one forward pass:

[z₁, z₂, ..., z_T] = Decoder([x₁, x₂, ..., x_T]) p(x_t | x_{<t}) = softmax(z_{t−1}) for t ≥ 2 p(x₁) = softmax(z₀) (or from a special start token)

⚠️ THIS IS CRITICAL — One forward pass produces predictions for ALL positions simultaneously. z_i predicts token x_{i+1} because position i has only seen tokens 0...i. This is the power of causal masking combined with teacher forcing.

2. The Causal (Masked) Self-Attention

The key difference from encoder self-attention: position i can only attend to positions j ≤ i.

The causal mask is an upper-triangular matrix with −∞ in positions where j > i:

M_{ij} = 0 if j ≤ i, −∞ if j > i

The attention computation:

S = (QKᵀ) / √d_k S_masked = S + M A = softmax(S_masked) [applied row-wise] Output = A · V

In matrix form for sequence length n:

$S = [[s₀₀, s₀₁, s₀₂, ..., s₀_{n−1}],
     [s₁₀, s₁₁, s₁₂, ..., s₁_{n−1}],
     ...
     [s_{n−1,0}, s_{n−1,1}, ..., s_{n−1,n−1}]]

M = [[0,   −∞, −∞, ..., −∞],
     [0,   0,  −∞, ..., −∞],
     [0,   0,  0,  ..., −∞],
     ...
     [0,   0,  0,  ..., 0]]

S_masked = [[s₀₀,   −∞,   −∞, ...,   −∞],
            [s₁₀, s₁₁,   −∞, ...,   −∞],
            ...
            [s_{n−1,0}, ..., s_{n−1,n−1}]]
$

After softmax (row-wise): exp(−∞) = 0, so any masked position gets zero attention weight. Position i only attends to positions 0...i.

Implementation efficiency: The mask is typically applied by setting values to a large negative number (like −1e9) rather than actual −∞.

3. Complete Forward Pass

Starting from token indices t ∈ {1..V}^n:

Step 1: Token Embedding

X₀ = E[t] ∈ ℝ^(n×d) [lookup]

Step 2: Add Positional Encoding (RoPE is applied inside attention, not here — but for absolute PE:)

X₀ += PE[0:n] [if using absolute PE; skip for RoPE]

Step 3: L Transformer Blocks

For each block ℓ ∈ {1..L}:

Pre-norm variant (modern):

Xnorm = RMSNorm(X{ℓ−1}) A = CausalMultiHeadAttention(Xnorm) [with RoPE inside Q,K projections] X_mid = X{ℓ−1} + A X_norm2 = RMSNorm(X_mid) F = FFN(X_norm2) X_ℓ = X_mid + F

Step 4: Final Layer Norm

H = RMSNorm(X_L)

Step 5: Output Projection (LM Head)

Z = H · W_outᵀ + b_out ∈ ℝ^(n×V)

Row i of Z contains the logits for predicting token at position i+1.

4. Training: Cross-Entropy Loss

For a sequence of length n, the model predicts tokens 1..n from prefix tokens 0..n−1.

The prediction for position t (predicting token x_{t+1}):

p_t = softmax(Z[t, :]) ∈ ℝ^V

The loss is the mean cross-entropy over all positions:

L = −(1/n) Σ_{t=0}^{n−1} log p_t(x_{t+1}) = −(1/n) Σ_{t=0}^{n−1} log(softmax(Z[t, :])[x_{t+1}])

Expanding:

L = −(1/n) Σ_{t=0}^{n−1} [Z[t, x_{t+1}] − logsumexp(Z[t, :])]

where logsumexp(z) = log(Σ_j exp(z_j)).

Gradient of loss with respect to logits:

For a single position t with target token y = x_{t+1}:

∂L_t/∂z_j = p_t(j) − 1_{j=y}

This is the fundamental gradient of cross-entropy with softmax: predicted probability minus one-hot target. When the model is confidently correct (p_t(y) ≈ 1), gradients are near zero. When wrong, strong gradients push up the correct logit and push down others.

5. Teacher Forcing

During training, we feed the GROUND TRUTH tokens as input, not the model's own predictions. This is called teacher forcing.

Why it works: Consider predicting x₃ given x₀, x₁, x₂. With teacher forcing: - The model receives the real x₀, x₁, x₂ as input - It produces predictions for x₁ (from x₀), x₂ (from x₀,x₁), x₃ (from x₀,x₁,x₂) - Loss is computed at ALL positions simultaneously - Gradients flow in parallel through all positions

Without teacher forcing (free-running mode): - The model receives x₀ only - Generates x̂₁, feeds x̂₁ back, generates x̂₂, etc. - Sequential, slow, and errors compound - Gradients must flow through the sampling step (non-differentiable)

Teacher forcing enables parallel training: one forward pass computes all losses. The causal mask ensures causality — position i never sees future tokens, even though the whole sequence is processed at once.

Loss masking for variable-length sequences: For a batch of sequences with different lengths, create a boolean mask that zeros out loss for padding positions:

L = −(1/Σ m_t) Σ_{t=0}^{n−1} m_t · log p_t(x_{t+1})

where m_t = 1 for valid prediction positions, 0 for padding.

6. Inference: Autoregressive Generation

At inference, generation is sequential:

Start with a prompt: tokens x₀, ..., x_{k−1}
Forward pass → logits for position k−1: Z[k−1, :]
Sample or argmax: x_k ~ p(· | Z[k−1, :])
Append x_k to the sequence
Repeat from step 2 with the extended sequence

KV caching (covered in 18-09) makes this efficient by avoiding recomputation of K and V for previous tokens.

7. Parameter Count (GPT-style)

For a decoder-only Transformer: - Embedding: |V|·d - Per block: attention (4d² for Q,K,V,O projections) + FFN (8d² typically) + 2 RMSNorm (2d) ≈ 12d² per block - Output: d·|V| (or tied with embedding: 0 additional) - Total: |V|·d + L·12d² (+ d·|V| if not tied)

For GPT-3 (175B): d = 12288, L = 96, |V| = 50257 ≈ 50K·12K + 96·12·(12K)² ≈ 0.6B + 96·12·151M ≈ 0.6B + 174B ≈ 175B ✓

Pitfalls

⚠️ Pitfall 1: Thinking teacher forcing leaks future information. Teacher forcing feeds ground-truth tokens, but the CAUSAL MASK still prevents position i from attending to positions > i. The model sees the ground-truth prefix, not the future. The magic is that one forward pass computes predictions for ALL positions simultaneously while respecting causality.

⚠️ Pitfall 2: Forgetting the n in the loss denominator. L = -(1/n) Σ log p_t(x_{t+1}). For variable-length sequences, the mean should be over VALID tokens only, not including padding. Getting this wrong biases the loss.

⚠️ Pitfall 3: Confusing inference with training. At inference, generation is SEQUENTIAL (each step feeds the previous output back as input). Training is PARALLEL (all positions at once). The model architecture is the same — teacher forcing is the trick that makes parallel training possible.

Key Terms

KV caching
Output
The causal mask
The decoder-only architecture

Worked Examples

Example 1: Causal Mask for n=4

Problem: Write out the full S, M, S_masked, and A matrices for a 4-token sequence with causal masking. Use example scores.

Solution:

Let scores S = QKᵀ/√d_k:

$S = [[2.0,  1.0,  0.5, −0.3],
     [1.5,  3.0,  1.2,  0.1],
     [0.8,  2.0,  2.5,  1.0],
     [0.2,  0.5,  1.8,  3.0]]
$

Mask M (0 = attend, −∞ = masked):

$M = [[0,   −∞, −∞, −∞],
     [0,   0,  −∞, −∞],
     [0,   0,  0,  −∞],
     [0,   0,  0,  0]]
$

S_masked = S + M:

$S_masked = [[2.0,  −∞,  −∞,  −∞],
            [1.5, 3.0,  −∞,  −∞],
            [0.8, 2.0, 2.5,  −∞],
            [0.2, 0.5, 1.8, 3.0]]
$

A = softmax(S_masked) [row-wise]:

Row 0: softmax([2.0]) = [1.0] Row 1: softmax([1.5, 3.0]) = [0.182, 0.818] Row 2: softmax([0.8, 2.0, 2.5]) = [0.113, 0.376, 0.511] Row 3: softmax([0.2, 0.5, 1.8, 3.0]) = [0.041, 0.056, 0.205, 0.698]

$A = [[1.0,   0,     0,     0    ],
     [0.182, 0.818, 0,     0    ],
     [0.113, 0.376, 0.511, 0    ],
     [0.041, 0.056, 0.205, 0.698]]
$

Notice: upper triangle is all zeros. Row i sums to 1 over columns 0..i only.

Example 2: Cross-Entropy Loss for a Sequence

Problem: For vocabulary V = 4 (tokens a, b, c, d), a sequence [a, b, c, a], and model logits:

Z = [[2.0, 1.0, 0.5, 0.3],   # predict token 1 from token 0
     [0.5, 3.0, 1.0, 0.2],   # predict token 2 from tokens 0,1
     [0.3, 0.8, 2.5, 1.0],   # predict token 3 from tokens 0,1,2
     [2.5, 0.5, 0.3, 0.1]]   # predict token 4 from tokens 0,1,2,3

Compute the average cross-entropy loss. Token indices: a=0, b=1, c=2, d=3. Target sequence for prediction: [b, c, a, ?] but we only have 3 predictions (no target after position 3).

Actually: the model takes [a,b,c,a] and predicts next tokens: z₀→x₁=b, z₁→x₂=c, z₂→x₃=a, and z₃ would predict x₄ but we don't have it. Let's compute loss for positions 0,1,2.

Solution:

Position 0 (target = b = index 1): softmax([2.0, 1.0, 0.5, 0.3]): exp: [7.389, 2.718, 1.649, 1.350], sum = 13.106 p₀ = [0.564, 0.207, 0.126, 0.103] L₀ = −log(0.207) = 1.575

Position 1 (target = c = index 2): softmax([0.5, 3.0, 1.0, 0.2]): exp: [1.649, 20.086, 2.718, 1.221], sum = 25.674 p₁ = [0.064, 0.782, 0.106, 0.048] L₁ = −log(0.106) = 2.245

Position 2 (target = a = index 0): softmax([0.3, 0.8, 2.5, 1.0]): exp: [1.350, 2.226, 12.182, 2.718], sum = 18.476 p₂ = [0.073, 0.120, 0.659, 0.147] L₂ = −log(0.073) = 2.617

Average loss: L = (1.575 + 2.245 + 2.617) / 3 = 6.437/3 = 2.146

Example 3: Gradient Derivation Check

Problem: Verify that ∂L/∂z_j = p_j − 1_{j=y} for a single position. Use z = [1, 2], y = index 1.

Solution:

z = [1, 2], y = 1 softmax: exp(z) = [2.718, 7.389], sum = 10.107 p = [0.269, 0.731]

L = −log(p₁) = −log(0.731) = 0.313

∂L/∂z₀ = ? Chain rule: ∂L/∂z₀ = ∂L/∂p₀ · ∂p₀/∂z₀ + ∂L/∂p₁ · ∂p₁/∂z₀

∂L/∂p₀ = 0 (loss only depends on p_y) ∂L/∂p₁ = −1/p₁

Softmax derivatives: ∂p₀/∂z₀ = p₀(1−p₀) = 0.269·0.731 = 0.197 ∂p₁/∂z₀ = −p₀p₁ = −0.269·0.731 = −0.197

∂L/∂z₀ = 0·0.197 + (−1/0.731)·(−0.197) = 0.269

∂L/∂z₁: ∂p₀/∂z₁ = −p₀p₁ = −0.197 ∂p₁/∂z₁ = p₁(1−p₁) = 0.197

∂L/∂z₁ = 0·(−0.197) + (−1/0.731)·(0.197) = −0.269

Check: p − 1_{y=1} = [0.269, 0.731−1] = [0.269, −0.269] ✓

Quiz

Q1: What does the concept of The decoder-only architecture primarily refer to in this subject?

A) A historical anecdote about The decoder-only architecture B) A visual representation of The decoder-only architecture C) A computational error related to The decoder-only architecture D) The definition and application of The decoder-only architecture

Correct: D)

If you chose A: This is incorrect. The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus.
If you chose D: The decoder-only architecture is defined as: the definition and application of the decoder-only architecture. The other options describe different aspects that are not the primary focus. Correct!

Q2: What is the primary purpose of The causal mask?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used to the causal mask in mathematical analysis

Correct: D)

If you chose A: This is incorrect. The causal mask serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. The causal mask serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. The causal mask serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: The causal mask serves the purpose described in the correct answer. The other options misrepresent its role. Correct!

Q3: Which statement about Output is TRUE?

A) Output is an advanced topic beyond this subject's scope B) Output is mentioned only as a historical footnote C) Output is not related to this subject D) Output is a fundamental concept covered in this subject

Correct: D)

If you chose A: This is incorrect. Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content.
If you chose B: This is incorrect. Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content.
If you chose C: This is incorrect. Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content.
If you chose D: Output is a fundamental concept covered in this subject. This subject covers Output as part of its core content. Correct!

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) softmax([1.0, 0.5, −0.2]) D) A different result from a common mistake

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is softmax([1.0, 0.5, −0.2]). The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is softmax([1.0, 0.5, −0.2]). The other options represent common errors.
If you chose C: The worked examples show that the result is softmax([1.0, 0.5, −0.2]). The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is softmax([1.0, 0.5, −0.2]). The other options represent common errors.

Q5: How are Output and KV caching related?

A) Output and KV caching are closely related concepts B) Output is a special case of KV caching C) Output is the inverse of KV caching D) Output and KV caching are completely unrelated topics

Correct: A)

If you chose A: Both Output and KV caching are covered in this subject as interconnected topics. Correct!
If you chose B: This is incorrect. Both Output and KV caching are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Output and KV caching are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Output and KV caching are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with The Autoregressive Language Modeling Objective?

A) A common mistake is confusing The Autoregressive Language Modeling Objective with a similar concept B) The main error with The Autoregressive Language Modeling Objective is using it when it is not needed C) The Autoregressive Language Modeling Objective has no common misconceptions D) The Autoregressive Language Modeling Objective is always computed the same way in all contexts

Correct: A)

If you chose A: Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose B: This is incorrect. Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse The Autoregressive Language Modeling Objective with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply The Causal (Masked) Self-Attention?

A) Avoid The Causal (Masked) Self-Attention unless explicitly instructed B) Apply The Causal (Masked) Self-Attention to solve problems in this subject's domain C) The Causal (Masked) Self-Attention is not practically useful D) Use The Causal (Masked) Self-Attention only in pure mathematics contexts

Correct: B)

If you chose A: This is incorrect. The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems.
If you chose B: The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. The Causal (Masked) Self-Attention is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

For a 3-token sequence, the logits matrix Z is: [[1.0, 0.5, −0.2], [0.3, 2.0, 0.8], [−0.5, 0.2, 1.5]] Target next tokens: [token_1, token_2, token_0]. Compute the average cross-entropy loss.

Answer

Position 0 (target idx 1): softmax([1.0, 0.5, −0.2]) exp: [2.718, 1.649, 0.819], sum = 5.186 p = [0.524, 0.318, 0.158] L₀ = −log(0.318) = 1.146 Position 1 (target idx 2): softmax([0.3, 2.0, 0.8]) exp: [1.350, 7.389, 2.226], sum = 10.965 p = [0.123, 0.674, 0.203] L₁ = −log(0.203) = 1.595 Position 2 (target idx 0): softmax([−0.5, 0.2, 1.5]) exp: [0.607, 1.221, 4.482], sum = 6.310 p = [0.096, 0.194, 0.710] L₂ = −log(0.096) = 2.343 Average L = (1.146 + 1.595 + 2.343)/3 = 1.695

Problem 2

Prove that with causal masking, the attention output at position i does not depend on tokens at positions j > i.

Answer

The attention output at position i is: o_i = Σ_{j=0}^{n−1} A_{ij} · v_j With causal masking, A_{ij} = 0 for all j > i (since exp(−∞) = 0). Therefore: o_i = Σ_{j=0}^{i} A_{ij} · v_j This sum involves only positions 0..i. Tokens at positions j > i do not contribute. The causal mask enforces the autoregressive property.

Problem 3

A GPT-style model has d = 2048, L = 24, |V| = 50257. Estimate the total parameter count (without weight tying).

Answer

Embedding: |V|·d = 50257·2048 ≈ 103M Per block: ~12d² = 12·2048² = 12·4,194,304 ≈ 50.3M Total blocks: 24·50.3M ≈ 1,208M Output: d·|V| ≈ 103M Total: 103 + 1208 + 103 ≈ 1,414M = 1.4B parameters This is approximately GPT-2 XL (1.5B parameters).

Problem 4

Explain why teacher forcing allows parallel training of an autoregressive model. What would happen if you tried to backpropagate through free-running generation?

Answer

Teacher forcing feeds ground-truth tokens as input, so the model processes all positions simultaneously. The causal mask handles causality. Loss at all positions is computed in one pass and gradients flow in parallel. Free-running (feeding the model's own predictions back as input) creates two problems: 1. Sequential dependency: must generate tokens one at a time, no parallelism 2. Non-differentiability: sampling or argmax from softmax is non-differentiable. Even if you use the softmax probabilities directly (soft" forcing), errors compound — a mistake at position 5 corrupts all subsequent predictions. Teacher forcing is the standard training method; free-running is only used at inference.

Problem 5

For the causal mask, why is it sufficient to add −∞ rather than explicitly zeroing out attention weights?

Answer

Adding −∞ to the pre-softmax scores means exp(−∞) = 0. After softmax normalization, these positions get exactly zero weight, and the remaining weights sum to 1. This is mathematically cleaner than post-softmax masking (which would require re-normalization) and is efficiently implemented as a single addition before softmax.

Summary

The decoder-only Transformer factorizes the joint token probability via the chain rule: p(x₁,...,x_T) = Π p(x_t|x_{<t}), predicting one token at a time
Causal masking adds −∞ to upper-triangular positions of QKᵀ, ensuring position i attends only to positions ≤ i
Training uses teacher forcing with cross-entropy loss L = −(1/n)Σ log p_t(x_{t+1}), computable in parallel due to the causal mask
One forward pass produces predictions for all positions; the gradient ∂L/∂z_j = p_j − 1_{j=y} has the same elegant form as standard cross-entropy
Parameter count: ~12d² per block dominates for large models; GPT-3's 175B ≈ 96·12·(12288)²

Next Steps

Continue to 18-06 — Pre-training Objective Mathematics for a deeper analysis of the loss function, perplexity, bits-per-byte, and training dynamics.

Progress

Phases

18-05 — Decoder-Only Architecture (GPT-style)

Learning Objectives

Core Content

1. The Autoregressive Language Modeling Objective

2. The Causal (Masked) Self-Attention

3. Complete Forward Pass

4. Training: Cross-Entropy Loss

5. Teacher Forcing

6. Inference: Autoregressive Generation

7. Parameter Count (GPT-style)

Pitfalls

Key Terms

Worked Examples

Example 1: Causal Mask for n=4

Example 2: Cross-Entropy Loss for a Sequence

Example 3: Gradient Derivation Check

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Next Steps