📐 Concept diagram

17-09 — Transformer Architecture

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-09 Prerequisites: 17-08 (Multi-Head Attention), 17-05 (Residual Connections), 16-10 (Other Normalization Methods — LayerNorm), 16-02 (Activation Functions) Next subject: 17-10 — The Transformer Block (Detailed)

Learning Objectives

By the end of this subject, you will be able to:

Diagram the complete Transformer encoder-decoder architecture with mathematical precision
Derive the forward pass through an encoder layer: self-attention → add & norm → FFN → add & norm
Derive the forward pass through a decoder layer including masked self-attention and cross-attention
Explain the position-wise feedforward network (FFN) and why it's applied independently per position
Compare the information flow in encoder self-attention vs. decoder masked self-attention vs. cross-attention

Core Content

1. The Big Picture

The Transformer (Vaswani et al., 2017) replaces recurrence with attention. The architecture has two main stacks:

Encoder: Processes the input sequence into contextualized representations. Each position can attend to ALL other positions (bidirectional).

Decoder: Generates output autoregressively. Each position can attend to previous decoder positions (causal) AND all encoder positions (cross-attention).

Both stacks are built from repeated identical layers (typically N=6 in the original).

2. Encoder Layer

Each encoder layer has two sub-layers:

Sub-layer 1: Multi-Head Self-Attention

Z = MultiHead(X, X, X) [Q=K=V=X — self-attention] X_out = LayerNorm(X + Z) [residual + LayerNorm]

Sub-layer 2: Position-wise Feedforward Network

F = FFN(X_out) = ReLU(X_out · W₁ + b₁) · W₂ + b₂ X_final = LayerNorm(X_out + F) [residual + LayerNorm]

Where: - X is the input to the layer, shape (n × d_model) - W₁ ∈ ℝ^(d_model × d_ff), W₂ ∈ ℝ^(d_ff × d_model) - d_ff is typically 4·d_model (e.g., 2048 for d_model=512)

⚠️ THIS IS CRITICAL — The FFN is position-wise: the same weights are applied independently to every position. The attention sub-layer handles mixing information across positions; the FFN handles transforming the representation at each position.

3. Decoder Layer

Each decoder layer has THREE sub-layers:

Sub-layer 1: Masked Multi-Head Self-Attention

Z₁ = MultiHead(Y, Y, Y, mask=causal_mask) Y₁ = LayerNorm(Y + Z₁)

Sub-layer 2: Multi-Head Cross-Attention

Z₂ = MultiHead(Y₁, X_enc, X_enc) [Q from decoder, K,V from encoder] Y₂ = LayerNorm(Y₁ + Z₂)

Sub-layer 3: Position-wise FFN

F = FFN(Y₂) Y_out = LayerNorm(Y₂ + F)

Each decoder layer attends to the FULL encoder output (last encoder layer, or weighted combination of all encoder layers). This is how the decoder "looks at" the source sequence.

4. Position-wise Feedforward Network (FFN)

The FFN is a two-layer MLP applied independently to each position:

FFN(x) = W₂ · σ(W₁ · x + b₁) + b₂

Where: - x ∈ ℝ^d_model (a single position's vector) - W₁ ∈ ℝ^(d_ff × d_model) — expands from d_model to d_ff (typically 4×) - W₂ ∈ ℝ^(d_model × d_ff) — projects back to d_model - σ = ReLU (original) or GELU/SiLU (modern variants) - Applied in parallel across all n positions via batched matrix multiply

Why position-wise? The attention mechanism handles inter-position interactions. The FFN handles per-position non-linear transformation. Separating these concerns is a key design principle:

Information MIXING (attention) → Information TRANSFORMATION (FFN)

Why 4× expansion? The expansion to d_ff = 4·d_model provides capacity for the FFN to learn complex transformations. The ReLU creates sparsity (~50% of neurons are zero), so the effective capacity is lower than the parameter count suggests.

Parameter count per FFN: 2·d_model·d_ff = 2·d·4d = 8d² (plus bias terms)

5. Residual Connections and Layer Normalization

Every sub-layer is wrapped in Add & Norm:

output = LayerNorm(x + Sublayer(x))

The residual connection (x + ...) provides the gradient highway (17-05). The LayerNorm (LN) stabilizes training by normalizing each position's representation.

LayerNorm (review from 16-10): For a vector x ∈ ℝ^d:

LN(x) = γ ⊙ (x − μ)/σ + β

Where μ = (1/d)Σx_i, σ² = (1/d)Σ(x_i − μ)², and γ, β are learnable parameters.

The normalization order matters: The original Transformer uses post-norm (sublayer → add → norm). Modern Transformers often use pre-norm (norm → sublayer → add) for better training stability (see 17-10 for detailed comparison).

6. Input and Output Processing

Encoder Input: 1. Token embedding lookup: input tokens → E_in ∈ ℝ^(n×d_model) 2. Add positional encoding: X = E_in + PE (see 18-03 for details) 3. Feed into encoder stack

Decoder Input: 1. Token embedding lookup: output tokens → E_out ∈ ℝ^(m×d_model) 2. Add positional encoding: Y = E_out + PE 3. Feed into decoder stack 4. Final linear projection + softmax: P = softmax(Y_final · W_out)

Where W_out ∈ ℝ^(d_model × V) maps to vocabulary size V, and softmax gives probability distribution over tokens.

7. Information Flow Summary

Attention Type	Q source	K, V source	Masking	Purpose
Encoder Self-Attn	Encoder input	Encoder input	None (bidirectional)	Contextualize each token
Decoder Masked Self-Attn	Decoder input	Decoder input	Causal	Autoregressive context
Decoder Cross-Attn	Decoder state	Encoder output	None	Attend to source sequence

The encoder processes the entire input at once (non-autoregressive). The decoder generates one token at a time (autoregressive), but during training it processes all target tokens in parallel using teacher forcing with causal masking.

8. Full Forward Pass (Algorithmic View)

Encoder(input_ids, input_mask):
    X = Embed(input_ids) + PositionalEncoding
    for each encoder_layer:
        X = encoder_layer(X, mask=input_mask)
    return X  # shape: (n, d_model)

Decoder(target_ids, enc_output, target_mask, enc_mask):
    Y = Embed(target_ids) + PositionalEncoding
    for each decoder_layer:
        Y = decoder_layer(Y, enc_output, 
                          self_mask=target_mask,
                          cross_mask=enc_mask)
    logits = Y @ W_out + b_out  # (m, V)
    return softmax(logits)

9. Training Objective

The Transformer is typically trained with teacher forcing:

Loss = −(1/N) Σ_{t=1}^m Σ_{v=1}^V y_{t,v} · log(p_{t,v})

This is categorical cross-entropy loss (see 16-04). The model predicts the next token given ALL previous target tokens (teacher forcing) because the causal mask prevents seeing future tokens.

Key Terms

17 09 Transformer Architecture
Attention Type
C) 4×
Decoder Layer
Encoder Layer
End-of-Subject Quiz
Example 1: Dimension Tracking Through Encoder Layer
Example 2: Parameter Count for One Transformer Layer
Example 3: Decoder Cross-Attention Shapes
Full Forward Pass (Algorithmic View)
Information Flow Summary
Input and Output Processing

Worked Examples

Example 1: Dimension Tracking Through Encoder Layer

Problem: Input sequence of length n=10, d_model=512, d_ff=2048. Track the shapes through one encoder layer.

Solution: Input X: (10, 512)

Multi-Head Self-Attention: - Q, K, V projections: each (10, 512) @ (512, 512) = (10, 512) - (For h=8: internally reshape to (8, 10, 64)) - Attention output Z: (10, 512) - Add & Norm: X_out = LN(X + Z) → (10, 512)

FFN: - W₁: (10, 512) @ (512, 2048) = (10, 2048) - ReLU: (10, 2048) [approx half zero] - W₂: (10, 2048) @ (2048, 512) = (10, 512) - Add & Norm: X_final = LN(X_out + F) → (10, 512)

Output: (10, 512) — same shape as input, enabling stacking.

Example 2: Parameter Count for One Transformer Layer

Problem: Compute the total parameter count for one encoder layer with d_model=512, d_ff=2048, h=8.

Solution: Multi-Head Attention: 4d² = 4·512² = 1,048,576

FFN: W₁ (512×2048=1,048,576) + b₁ (2048) + W₂ (2048×512=1,048,576) + b₂ (512) = 2,099,712

LayerNorm × 2: each has γ (512) + β (512) = 1,024. Two norms = 2,048

Total per encoder layer: 1,048,576 + 2,099,712 + 2,048 ≈ 3.15M parameters

For N=6 encoder layers: ~18.9M. Plus ~3.15M per decoder layer × 6 + embedding + output projection.

Example 3: Decoder Cross-Attention Shapes

Problem: A decoder processes target sequence of length m=7. Encoder output has length n=15. d_model=512. What are the shapes of Q, K, V in the decoder's cross-attention?

Solution: Q from decoder self-attention output: (7, 512) K from encoder output: (15, 512) V from encoder output: (15, 512)

Attention scores S = QKᵀ/√d_k: (7, 15) — each of 7 target positions computes attention over all 15 source positions.

Output: (7, 512) — same shape as input Q.

Quiz

Q1: What does the concept of Attention Type primarily refer to in this subject?

A) A computational error related to Attention Type B) The definition and application of Attention Type C) A visual representation of Attention Type D) A historical anecdote about Attention Type

Correct: B)

If you chose A: This is incorrect. Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus.
If you chose B: Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus. Correct!
If you chose C: This is incorrect. Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. Attention Type is defined as: the definition and application of attention type. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of Decoder Layer?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to decoder layer in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

If you chose A: This is incorrect. Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose D: This is incorrect. Decoder Layer serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about Encoder Layer is TRUE?

A) Encoder Layer is mentioned only as a historical footnote B) Encoder Layer is an advanced topic beyond this subject's scope C) Encoder Layer is a fundamental concept covered in this subject D) Encoder Layer is not related to this subject

Correct: C)

If you chose A: This is incorrect. Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content.
If you chose B: This is incorrect. Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content.
If you chose C: Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content. Correct!
If you chose D: This is incorrect. Encoder Layer is a fundamental concept covered in this subject. This subject covers Encoder Layer as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) attention mixes positions, FFN transforms features D) A different result from a common mistake

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors.
If you chose C: The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is attention mixes positions, FFN transforms features. The other options represent common errors.

Q5: How are Encoder Layer and End-of-Subject Quiz related?

A) Encoder Layer and End-of-Subject Quiz are completely unrelated topics B) Encoder Layer is the inverse of End-of-Subject Quiz C) Encoder Layer and End-of-Subject Quiz are closely related concepts D) Encoder Layer is a special case of End-of-Subject Quiz

Correct: C)

If you chose A: This is incorrect. Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics.
If you chose C: Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics. Correct!
If you chose D: This is incorrect. Both Encoder Layer and End-of-Subject Quiz are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with Full Forward Pass (Algorithmic View)?

A) Full Forward Pass (Algorithmic View) has no common misconceptions B) The main error with Full Forward Pass (Algorithmic View) is using it when it is not needed C) Full Forward Pass (Algorithmic View) is always computed the same way in all contexts D) A common mistake is confusing Full Forward Pass (Algorithmic View) with a similar concept

Correct: D)

If you chose A: This is incorrect. Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: Students often confuse Full Forward Pass (Algorithmic View) with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!

Q7: When should you apply Information Flow Summary?

A) Information Flow Summary is not practically useful B) Apply Information Flow Summary to solve problems in this subject's domain C) Use Information Flow Summary only in pure mathematics contexts D) Avoid Information Flow Summary unless explicitly instructed

Correct: B)

If you chose A: This is incorrect. Information Flow Summary is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Information Flow Summary is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Information Flow Summary is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Information Flow Summary is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

List all sub-layers of one decoder layer in order.

Answer

1. Masked Multi-Head Self-Attention → Add & Norm 2. Multi-Head Cross-Attention (Q from decoder, K,V from encoder) → Add & Norm 3. Position-wise Feedforward Network → Add & Norm

Problem 2

Why is the FFN called "position-wise"? What would happen if it weren't position-wise?

Answer

The FFN applies the same weights independently to each position — it's a 1×1 convolution in sequence space. If it mixed positions (like a regular feedforward layer), positions would interact through the FFN in addition to attention — redundant and computationally wasteful. Separation of concerns: attention mixes positions, FFN transforms features.

Problem 3

What is the purpose of residual connections in the Transformer?

Answer

Residual connections provide gradient highways through each sub-layer. Without them, the Transformers' depth (6+ layers) would cause vanishing gradients through the attention and FFN operations. They also help with signal propagation: the input can bypass a sub-layer if the sub-layer's transformation isn't useful initially.

Problem 4

Calculate the total parameter count for a 6-layer encoder + 6-layer decoder Transformer with d_model=512, d_ff=2048, h=8, vocabulary V=37000. Include embeddings and output projection.

Answer

Per encoder layer: ~3.15M (from Example 2). 6 layers = 18.9M. Per decoder layer: attention (1,048,576) + cross-attention (1,048,576) + FFN (2,099,712) + 3 LayerNorms (3,072) ≈ 4.2M. 6 layers = 25.2M. Embedding: V·d_model = 37000·512 = 18.94M. Output projection (if not weight-tied): d_model·V = 18.94M. Total: 18.9 + 25.2 + 18.94 + 18.94 ≈ 82M parameters (approximate; depends on weight tying).

Problem 5

Why can the encoder process all tokens in parallel during training, but the decoder cannot during inference?

Answer

The encoder uses unmasked self-attention — every token attends to every other token. No sequential dependency. The decoder uses masked (causal) self-attention — token t can only attend to tokens 0...t. During inference, token t depends on token t−1 (which depends on t−2, etc.), creating a sequential chain. During training, teacher forcing provides all target tokens so they can be processed in parallel with causal masking.

Summary

The Transformer consists of an encoder (bidirectional self-attention + FFN) and a decoder (masked self-attention + cross-attention + FFN), each built from stacked identical layers
Each layer wraps sub-layers with residual connections and LayerNorm: output = LN(x + Sublayer(x))
The position-wise FFN (two linear layers with activation) is applied independently to each position, separating feature transformation from inter-position mixing
Cross-attention in the decoder allows each target position to attend to all source positions via Q (decoder) and K,V (encoder)
Teacher forcing with causal masking enables parallel training of the decoder despite its autoregressive nature

Pitfalls

Confusing post-norm and pre-norm placement. Original Transformers used post-norm (Sublayer → Add → Norm), but modern LLMs use pre-norm (Norm → Sublayer → Add). Post-norm gates even the residual path through LayerNorm's derivative, causing gradient vanishing in deep networks. If implementing from scratch, use pre-norm unless you have a specific reason to replicate the 2017 paper exactly.
Thinking the FFN mixes information across sequence positions. The FFN is position-wise — the same weights are applied independently to each position. Only the attention sub-layers mix information across positions. Adding position-mixing to the FFN (e.g., a 1D convolution) would duplicate attention's role and waste parameters.
Forgetting that decoder cross-attention Q comes from the decoder, but K and V come from the encoder. A common slip: setting K=V=decoder_states in cross-attention, turning it into another self-attention. The whole point of cross-attention is for the decoder to query the encoder's output. Q from decoder, K,V from encoder.
Applying a causal mask to cross-attention. Cross-attention should see the FULL encoder output — the decoder needs access to all source positions simultaneously. Applying a causal mask to cross-attention would prevent the decoder from attending to later source positions, breaking translation and similar tasks where word order differs between source and target.
Treating the encoder and decoder as interchangeable during inference. The encoder processes the entire input in parallel with unmasked attention — fine during both training and inference. But the decoder is autoregressive: during inference it must run sequentially (token by token), even though teacher forcing enabled parallel processing during training. Designing a system that assumes parallel decoder inference will fail at serving time.

Next Steps

Continue to 17-10 — The Transformer Block (Detailed) for a deep mathematical dive into gradient flow, pre-norm vs. post-norm, and the full forward/backward pass of one block.

Progress

Phases

17-09 — Transformer Architecture

Learning Objectives

Core Content

1. The Big Picture

2. Encoder Layer

3. Decoder Layer

4. Position-wise Feedforward Network (FFN)

5. Residual Connections and Layer Normalization

6. Input and Output Processing

7. Information Flow Summary

8. Full Forward Pass (Algorithmic View)

9. Training Objective

Key Terms

Worked Examples

Example 1: Dimension Tracking Through Encoder Layer

Example 2: Parameter Count for One Transformer Layer

Example 3: Decoder Cross-Attention Shapes

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps