17-09 β Transformer Architecture
Phase: 17 β Deep Learning Architectures (Math) Subject: 17-09 Prerequisites: 17-08 (Multi-Head Attention), 17-05 (Residual Connections), 16-10 (Other Normalization Methods β LayerNorm), 16-02 (Activation Functions) Next subject: 17-10 β The Transformer Block (Detailed)
Learning Objectives
By the end of this subject, you will be able to:
- Diagram the complete Transformer encoder-decoder architecture with mathematical precision
- Derive the forward pass through an encoder layer: self-attention β add & norm β FFN β add & norm
- Derive the forward pass through a decoder layer including masked self-attention and cross-attention
- Explain the position-wise feedforward network (FFN) and why it's applied independently per position
- Compare the information flow in encoder self-attention vs. decoder masked self-attention vs. cross-attention
Core Content
1. The Big Picture
The Transformer (Vaswani et al., 2017) replaces recurrence with attention. The architecture has two main stacks:
Encoder: Processes the input sequence into contextualized representations. Each position can attend to ALL other positions (bidirectional).
Decoder: Generates output autoregressively. Each position can attend to previous decoder positions (causal) AND all encoder positions (cross-attention).
Both stacks are built from repeated identical layers (typically N=6 in the original).
2. Encoder Layer
Each encoder layer has two sub-layers:
Sub-layer 1: Multi-Head Self-Attention
Z = MultiHead(X, X, X) [Q=K=V=X β self-attention] X_out = LayerNorm(X + Z) [residual + LayerNorm]
Sub-layer 2: Position-wise Feedforward Network
F = FFN(X_out) = ReLU(X_out Β· Wβ + bβ) Β· Wβ + bβ X_final = LayerNorm(X_out + F) [residual + LayerNorm]
Where: - X is the input to the layer, shape (n Γ d_model) - Wβ β β^(d_model Γ d_ff), Wβ β β^(d_ff Γ d_model) - d_ff is typically 4Β·d_model (e.g., 2048 for d_model=512)
β οΈ THIS IS CRITICAL β The FFN is position-wise: the same weights are applied independently to every position. The attention sub-layer handles mixing information across positions; the FFN handles transforming the representation at each position.
3. Decoder Layer
Each decoder layer has THREE sub-layers:
Sub-layer 1: Masked Multi-Head Self-Attention
Zβ = MultiHead(Y, Y, Y, mask=causal_mask) Yβ = LayerNorm(Y + Zβ)
Sub-layer 2: Multi-Head Cross-Attention
Zβ = MultiHead(Yβ, X_enc, X_enc) [Q from decoder, K,V from encoder] Yβ = LayerNorm(Yβ + Zβ)
Sub-layer 3: Position-wise FFN
F = FFN(Yβ) Y_out = LayerNorm(Yβ + F)
Each decoder layer attends to the FULL encoder output (last encoder layer, or weighted combination of all encoder layers). This is how the decoder "looks at" the source sequence.
4. Position-wise Feedforward Network (FFN)
The FFN is a two-layer MLP applied independently to each position:
FFN(x) = Wβ Β· Ο(Wβ Β· x + bβ) + bβ
Where: - x β β^d_model (a single position's vector) - Wβ β β^(d_ff Γ d_model) β expands from d_model to d_ff (typically 4Γ) - Wβ β β^(d_model Γ d_ff) β projects back to d_model - Ο = ReLU (original) or GELU/SiLU (modern variants) - Applied in parallel across all n positions via batched matrix multiply
Why position-wise? The attention mechanism handles inter-position interactions. The FFN handles per-position non-linear transformation. Separating these concerns is a key design principle:
Information MIXING (attention) β Information TRANSFORMATION (FFN)
Why 4Γ expansion? The expansion to d_ff = 4Β·d_model provides capacity for the FFN to learn complex transformations. The ReLU creates sparsity (~50% of neurons are zero), so the effective capacity is lower than the parameter count suggests.
Parameter count per FFN: 2Β·d_modelΒ·d_ff = 2Β·dΒ·4d = 8dΒ² (plus bias terms)
5. Residual Connections and Layer Normalization
Every sub-layer is wrapped in Add & Norm:
output = LayerNorm(x + Sublayer(x))
The residual connection (x + ...) provides the gradient highway (17-05). The LayerNorm (LN) stabilizes training by normalizing each position's representation.
LayerNorm (review from 16-10): For a vector x β β^d:
LN(x) = Ξ³ β (x β ΞΌ)/Ο + Ξ²
Where ΞΌ = (1/d)Ξ£x_i, ΟΒ² = (1/d)Ξ£(x_i β ΞΌ)Β², and Ξ³, Ξ² are learnable parameters.
The normalization order matters: The original Transformer uses post-norm (sublayer β add β norm). Modern Transformers often use pre-norm (norm β sublayer β add) for better training stability (see 17-10 for detailed comparison).
6. Input and Output Processing
Encoder Input: 1. Token embedding lookup: input tokens β E_in β β^(nΓd_model) 2. Add positional encoding: X = E_in + PE (see 18-03 for details) 3. Feed into encoder stack
Decoder Input: 1. Token embedding lookup: output tokens β E_out β β^(mΓd_model) 2. Add positional encoding: Y = E_out + PE 3. Feed into decoder stack 4. Final linear projection + softmax: P = softmax(Y_final Β· W_out)
Where W_out β β^(d_model Γ V) maps to vocabulary size V, and softmax gives probability distribution over tokens.
7. Information Flow Summary
| Attention Type | Q source | K, V source | Masking | Purpose |
|---|---|---|---|---|
| Encoder Self-Attn | Encoder input | Encoder input | None (bidirectional) | Contextualize each token |
| Decoder Masked Self-Attn | Decoder input | Decoder input | Causal | Autoregressive context |
| Decoder Cross-Attn | Decoder state | Encoder output | None | Attend to source sequence |
The encoder processes the entire input at once (non-autoregressive). The decoder generates one token at a time (autoregressive), but during training it processes all target tokens in parallel using teacher forcing with causal masking.
8. Full Forward Pass (Algorithmic View)
Encoder(input_ids, input_mask):
X = Embed(input_ids) + PositionalEncoding
for each encoder_layer:
X = encoder_layer(X, mask=input_mask)
return X # shape: (n, d_model)
Decoder(target_ids, enc_output, target_mask, enc_mask):
Y = Embed(target_ids) + PositionalEncoding
for each decoder_layer:
Y = decoder_layer(Y, enc_output,
self_mask=target_mask,
cross_mask=enc_mask)
logits = Y @ W_out + b_out # (m, V)
return softmax(logits)
9. Training Objective
The Transformer is typically trained with teacher forcing:
Loss = β(1/N) Ξ£_{t=1}^m Ξ£_{v=1}^V y_{t,v} Β· log(p_{t,v})
This is categorical cross-entropy loss (see 16-04). The model predicts the next token given ALL previous target tokens (teacher forcing) because the causal mask prevents seeing future tokens.
Key Terms
- 17 09 Transformer Architecture
- Attention Type
- C) 4Γ
- Decoder Layer
- Encoder Layer
- End-of-Subject Quiz
- Example 1: Dimension Tracking Through Encoder Layer
- Example 2: Parameter Count for One Transformer Layer
- Example 3: Decoder Cross-Attention Shapes
- Full Forward Pass (Algorithmic View)
- Information Flow Summary
- Input and Output Processing
Worked Examples
Example 1: Dimension Tracking Through Encoder Layer
Problem: Input sequence of length n=10, d_model=512, d_ff=2048. Track the shapes through one encoder layer.
Solution: Input X: (10, 512)
Multi-Head Self-Attention: - Q, K, V projections: each (10, 512) @ (512, 512) = (10, 512) - (For h=8: internally reshape to (8, 10, 64)) - Attention output Z: (10, 512) - Add & Norm: X_out = LN(X + Z) β (10, 512)
FFN: - Wβ: (10, 512) @ (512, 2048) = (10, 2048) - ReLU: (10, 2048) [approx half zero] - Wβ: (10, 2048) @ (2048, 512) = (10, 512) - Add & Norm: X_final = LN(X_out + F) β (10, 512)
Output: (10, 512) β same shape as input, enabling stacking.
Example 2: Parameter Count for One Transformer Layer
Problem: Compute the total parameter count for one encoder layer with d_model=512, d_ff=2048, h=8.
Solution: Multi-Head Attention: 4dΒ² = 4Β·512Β² = 1,048,576
FFN: Wβ (512Γ2048=1,048,576) + bβ (2048) + Wβ (2048Γ512=1,048,576) + bβ (512) = 2,099,712
LayerNorm Γ 2: each has Ξ³ (512) + Ξ² (512) = 1,024. Two norms = 2,048
Total per encoder layer: 1,048,576 + 2,099,712 + 2,048 β 3.15M parameters
For N=6 encoder layers: ~18.9M. Plus ~3.15M per decoder layer Γ 6 + embedding + output projection.
Example 3: Decoder Cross-Attention Shapes
Problem: A decoder processes target sequence of length m=7. Encoder output has length n=15. d_model=512. What are the shapes of Q, K, V in the decoder's cross-attention?
Solution: Q from decoder self-attention output: (7, 512) K from encoder output: (15, 512) V from encoder output: (15, 512)
Attention scores S = QKα΅/βd_k: (7, 15) β each of 7 target positions computes attention over all 15 source positions.
Output: (7, 512) β same shape as input Q.
Quiz
Q1: What does the concept of Attention Type primarily refer to in this subject?
A) A computational error related to Attention Type B) The definition and application of Attention Type C) A visual representation of Attention Type D) A historical anecdote about Attention Type
Correct: B)
- If you chose A: This is incorrect. Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus.
- If you chose B: Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Decoder Layer?
A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to decoder layer in mathematical analysis D) It replaces all other methods in this domain
Correct: C)
- If you chose A: This is incorrect. Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Encoder Layer is TRUE?
A) Encoder Layer is mentioned only as a historical footnote B) Encoder Layer is an advanced topic beyond this subject's scope C) Encoder Layer is a fundamental concept covered in this subject D) Encoder Layer is not related to this subject
Correct: C)
- If you chose A: This is incorrect. Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content.
- If you chose B: This is incorrect. Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content.
- If you chose C: Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content. Correct!
- If you chose D: This is incorrect. Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) attention mixes positions, FFN transforms features D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors.
- If you chose C: The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors.
Q5: How are Encoder Layer and End-of-Subject Quiz related?
A) Encoder Layer and End-of-Subject Quiz are completely unrelated topics B) Encoder Layer is the inverse of End-of-Subject Quiz C) Encoder Layer and End-of-Subject Quiz are closely related concepts D) Encoder Layer is a special case of End-of-Subject Quiz
Correct: C)
- If you chose A: This is incorrect. Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics.
- If you chose C: Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Full Forward Pass (Algorithmic View)?
A) Full Forward Pass (Algorithmic View) has no common misconceptions B) The main error with Full Forward Pass (Algorithmic View) is using it when it is not needed C) Full Forward Pass (Algorithmic View) is always computed the same way in all contexts D) A common mistake is confusing Full Forward Pass (Algorithmic View) with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q7: When should you apply Information Flow Summary?
A) Information Flow Summary is not practically useful B) Apply Information Flow Summary to solve problems in this subject's domain C) Use Information Flow Summary only in pure mathematics contexts D) Avoid Information Flow Summary unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Information Flow Summary is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Information Flow Summary is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Information Flow Summary is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Information Flow Summary is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
List all sub-layers of one decoder layer in order.
Answer
1. Masked Multi-Head Self-Attention β Add & Norm 2. Multi-Head Cross-Attention (Q from decoder, K,V from encoder) β Add & Norm 3. Position-wise Feedforward Network β Add & NormProblem 2
Why is the FFN called "position-wise"? What would happen if it weren't position-wise?
Answer
The FFN applies the same weights independently to each position β it's a 1Γ1 convolution in sequence space. If it mixed positions (like a regular feedforward layer), positions would interact through the FFN in addition to attention β redundant and computationally wasteful. Separation of concerns: attention mixes positions, FFN transforms features.Problem 3
What is the purpose of residual connections in the Transformer?
Answer
Residual connections provide gradient highways through each sub-layer. Without them, the Transformers' depth (6+ layers) would cause vanishing gradients through the attention and FFN operations. They also help with signal propagation: the input can bypass a sub-layer if the sub-layer's transformation isn't useful initially.Problem 4
Calculate the total parameter count for a 6-layer encoder + 6-layer decoder Transformer with d_model=512, d_ff=2048, h=8, vocabulary V=37000. Include embeddings and output projection.
Answer
Per encoder layer: ~3.15M (from Example 2). 6 layers = 18.9M. Per decoder layer: attention (1,048,576) + cross-attention (1,048,576) + FFN (2,099,712) + 3 LayerNorms (3,072) β 4.2M. 6 layers = 25.2M. Embedding: VΒ·d_model = 37000Β·512 = 18.94M. Output projection (if not weight-tied): d_modelΒ·V = 18.94M. Total: 18.9 + 25.2 + 18.94 + 18.94 β 82M parameters (approximate; depends on weight tying).Problem 5
Why can the encoder process all tokens in parallel during training, but the decoder cannot during inference?
Answer
The encoder uses unmasked self-attention β every token attends to every other token. No sequential dependency. The decoder uses masked (causal) self-attention β token t can only attend to tokens 0...t. During inference, token t depends on token tβ1 (which depends on tβ2, etc.), creating a sequential chain. During training, teacher forcing provides all target tokens so they can be processed in parallel with causal masking.Summary
- The Transformer consists of an encoder (bidirectional self-attention + FFN) and a decoder (masked self-attention + cross-attention + FFN), each built from stacked identical layers
- Each layer wraps sub-layers with residual connections and LayerNorm: output = LN(x + Sublayer(x))
- The position-wise FFN (two linear layers with activation) is applied independently to each position, separating feature transformation from inter-position mixing
- Cross-attention in the decoder allows each target position to attend to all source positions via Q (decoder) and K,V (encoder)
- Teacher forcing with causal masking enables parallel training of the decoder despite its autoregressive nature
Pitfalls
- Confusing post-norm and pre-norm placement. Original Transformers used post-norm (Sublayer β Add β Norm), but modern LLMs use pre-norm (Norm β Sublayer β Add). Post-norm gates even the residual path through LayerNorm's derivative, causing gradient vanishing in deep networks. If implementing from scratch, use pre-norm unless you have a specific reason to replicate the 2017 paper exactly.
- Thinking the FFN mixes information across sequence positions. The FFN is position-wise β the same weights are applied independently to each position. Only the attention sub-layers mix information across positions. Adding position-mixing to the FFN (e.g., a 1D convolution) would duplicate attention's role and waste parameters.
- Forgetting that decoder cross-attention Q comes from the decoder, but K and V come from the encoder. A common slip: setting K=V=decoder_states in cross-attention, turning it into another self-attention. The whole point of cross-attention is for the decoder to query the encoder's output. Q from decoder, K,V from encoder.
- Applying a causal mask to cross-attention. Cross-attention should see the FULL encoder output β the decoder needs access to all source positions simultaneously. Applying a causal mask to cross-attention would prevent the decoder from attending to later source positions, breaking translation and similar tasks where word order differs between source and target.
- Treating the encoder and decoder as interchangeable during inference. The encoder processes the entire input in parallel with unmasked attention β fine during both training and inference. But the decoder is autoregressive: during inference it must run sequentially (token by token), even though teacher forcing enabled parallel processing during training. Designing a system that assumes parallel decoder inference will fail at serving time.
Next Steps
Continue to 17-10 β The Transformer Block (Detailed) for a deep mathematical dive into gradient flow, pre-norm vs. post-norm, and the full forward/backward pass of one block.