Math graphic
πŸ“ Concept diagram

17-09 β€” Transformer Architecture

Phase: 17 β€” Deep Learning Architectures (Math) Subject: 17-09 Prerequisites: 17-08 (Multi-Head Attention), 17-05 (Residual Connections), 16-10 (Other Normalization Methods β€” LayerNorm), 16-02 (Activation Functions) Next subject: 17-10 β€” The Transformer Block (Detailed)


Learning Objectives

By the end of this subject, you will be able to:

  1. Diagram the complete Transformer encoder-decoder architecture with mathematical precision
  2. Derive the forward pass through an encoder layer: self-attention β†’ add & norm β†’ FFN β†’ add & norm
  3. Derive the forward pass through a decoder layer including masked self-attention and cross-attention
  4. Explain the position-wise feedforward network (FFN) and why it's applied independently per position
  5. Compare the information flow in encoder self-attention vs. decoder masked self-attention vs. cross-attention

Core Content

1. The Big Picture

The Transformer (Vaswani et al., 2017) replaces recurrence with attention. The architecture has two main stacks:

Encoder: Processes the input sequence into contextualized representations. Each position can attend to ALL other positions (bidirectional).

Decoder: Generates output autoregressively. Each position can attend to previous decoder positions (causal) AND all encoder positions (cross-attention).

Both stacks are built from repeated identical layers (typically N=6 in the original).

2. Encoder Layer

Each encoder layer has two sub-layers:

Sub-layer 1: Multi-Head Self-Attention

Z = MultiHead(X, X, X) [Q=K=V=X β€” self-attention] X_out = LayerNorm(X + Z) [residual + LayerNorm]

Sub-layer 2: Position-wise Feedforward Network

F = FFN(X_out) = ReLU(X_out Β· W₁ + b₁) Β· Wβ‚‚ + bβ‚‚ X_final = LayerNorm(X_out + F) [residual + LayerNorm]

Where: - X is the input to the layer, shape (n Γ— d_model) - W₁ ∈ ℝ^(d_model Γ— d_ff), Wβ‚‚ ∈ ℝ^(d_ff Γ— d_model) - d_ff is typically 4Β·d_model (e.g., 2048 for d_model=512)

⚠️ THIS IS CRITICAL β€” The FFN is position-wise: the same weights are applied independently to every position. The attention sub-layer handles mixing information across positions; the FFN handles transforming the representation at each position.

3. Decoder Layer

Each decoder layer has THREE sub-layers:

Sub-layer 1: Masked Multi-Head Self-Attention

Z₁ = MultiHead(Y, Y, Y, mask=causal_mask) Y₁ = LayerNorm(Y + Z₁)

Sub-layer 2: Multi-Head Cross-Attention

Zβ‚‚ = MultiHead(Y₁, X_enc, X_enc) [Q from decoder, K,V from encoder] Yβ‚‚ = LayerNorm(Y₁ + Zβ‚‚)

Sub-layer 3: Position-wise FFN

F = FFN(Yβ‚‚) Y_out = LayerNorm(Yβ‚‚ + F)

Each decoder layer attends to the FULL encoder output (last encoder layer, or weighted combination of all encoder layers). This is how the decoder "looks at" the source sequence.

4. Position-wise Feedforward Network (FFN)

The FFN is a two-layer MLP applied independently to each position:

FFN(x) = Wβ‚‚ Β· Οƒ(W₁ Β· x + b₁) + bβ‚‚

Where: - x ∈ ℝ^d_model (a single position's vector) - W₁ ∈ ℝ^(d_ff Γ— d_model) β€” expands from d_model to d_ff (typically 4Γ—) - Wβ‚‚ ∈ ℝ^(d_model Γ— d_ff) β€” projects back to d_model - Οƒ = ReLU (original) or GELU/SiLU (modern variants) - Applied in parallel across all n positions via batched matrix multiply

Why position-wise? The attention mechanism handles inter-position interactions. The FFN handles per-position non-linear transformation. Separating these concerns is a key design principle:

Information MIXING (attention) β†’ Information TRANSFORMATION (FFN)

Why 4Γ— expansion? The expansion to d_ff = 4Β·d_model provides capacity for the FFN to learn complex transformations. The ReLU creates sparsity (~50% of neurons are zero), so the effective capacity is lower than the parameter count suggests.

Parameter count per FFN: 2Β·d_modelΒ·d_ff = 2Β·dΒ·4d = 8dΒ² (plus bias terms)

5. Residual Connections and Layer Normalization

Every sub-layer is wrapped in Add & Norm:

output = LayerNorm(x + Sublayer(x))

The residual connection (x + ...) provides the gradient highway (17-05). The LayerNorm (LN) stabilizes training by normalizing each position's representation.

LayerNorm (review from 16-10): For a vector x ∈ ℝ^d:

LN(x) = Ξ³ βŠ™ (x βˆ’ ΞΌ)/Οƒ + Ξ²

Where ΞΌ = (1/d)Ξ£x_i, σ² = (1/d)Ξ£(x_i βˆ’ ΞΌ)Β², and Ξ³, Ξ² are learnable parameters.

The normalization order matters: The original Transformer uses post-norm (sublayer β†’ add β†’ norm). Modern Transformers often use pre-norm (norm β†’ sublayer β†’ add) for better training stability (see 17-10 for detailed comparison).

6. Input and Output Processing

Encoder Input: 1. Token embedding lookup: input tokens β†’ E_in ∈ ℝ^(nΓ—d_model) 2. Add positional encoding: X = E_in + PE (see 18-03 for details) 3. Feed into encoder stack

Decoder Input: 1. Token embedding lookup: output tokens β†’ E_out ∈ ℝ^(mΓ—d_model) 2. Add positional encoding: Y = E_out + PE 3. Feed into decoder stack 4. Final linear projection + softmax: P = softmax(Y_final Β· W_out)

Where W_out ∈ ℝ^(d_model Γ— V) maps to vocabulary size V, and softmax gives probability distribution over tokens.

7. Information Flow Summary

Attention Type Q source K, V source Masking Purpose
Encoder Self-Attn Encoder input Encoder input None (bidirectional) Contextualize each token
Decoder Masked Self-Attn Decoder input Decoder input Causal Autoregressive context
Decoder Cross-Attn Decoder state Encoder output None Attend to source sequence

The encoder processes the entire input at once (non-autoregressive). The decoder generates one token at a time (autoregressive), but during training it processes all target tokens in parallel using teacher forcing with causal masking.

8. Full Forward Pass (Algorithmic View)

Encoder(input_ids, input_mask):
    X = Embed(input_ids) + PositionalEncoding
    for each encoder_layer:
        X = encoder_layer(X, mask=input_mask)
    return X  # shape: (n, d_model)

Decoder(target_ids, enc_output, target_mask, enc_mask):
    Y = Embed(target_ids) + PositionalEncoding
    for each decoder_layer:
        Y = decoder_layer(Y, enc_output, 
                          self_mask=target_mask,
                          cross_mask=enc_mask)
    logits = Y @ W_out + b_out  # (m, V)
    return softmax(logits)

9. Training Objective

The Transformer is typically trained with teacher forcing:

Loss = βˆ’(1/N) Ξ£_{t=1}^m Ξ£_{v=1}^V y_{t,v} Β· log(p_{t,v})

This is categorical cross-entropy loss (see 16-04). The model predicts the next token given ALL previous target tokens (teacher forcing) because the causal mask prevents seeing future tokens.



Key Terms

Worked Examples

Example 1: Dimension Tracking Through Encoder Layer

Problem: Input sequence of length n=10, d_model=512, d_ff=2048. Track the shapes through one encoder layer.

Solution: Input X: (10, 512)

Multi-Head Self-Attention: - Q, K, V projections: each (10, 512) @ (512, 512) = (10, 512) - (For h=8: internally reshape to (8, 10, 64)) - Attention output Z: (10, 512) - Add & Norm: X_out = LN(X + Z) β†’ (10, 512)

FFN: - W₁: (10, 512) @ (512, 2048) = (10, 2048) - ReLU: (10, 2048) [approx half zero] - Wβ‚‚: (10, 2048) @ (2048, 512) = (10, 512) - Add & Norm: X_final = LN(X_out + F) β†’ (10, 512)

Output: (10, 512) β€” same shape as input, enabling stacking.

Example 2: Parameter Count for One Transformer Layer

Problem: Compute the total parameter count for one encoder layer with d_model=512, d_ff=2048, h=8.

Solution: Multi-Head Attention: 4dΒ² = 4Β·512Β² = 1,048,576

FFN: W₁ (512Γ—2048=1,048,576) + b₁ (2048) + Wβ‚‚ (2048Γ—512=1,048,576) + bβ‚‚ (512) = 2,099,712

LayerNorm Γ— 2: each has Ξ³ (512) + Ξ² (512) = 1,024. Two norms = 2,048

Total per encoder layer: 1,048,576 + 2,099,712 + 2,048 β‰ˆ 3.15M parameters

For N=6 encoder layers: ~18.9M. Plus ~3.15M per decoder layer Γ— 6 + embedding + output projection.

Example 3: Decoder Cross-Attention Shapes

Problem: A decoder processes target sequence of length m=7. Encoder output has length n=15. d_model=512. What are the shapes of Q, K, V in the decoder's cross-attention?

Solution: Q from decoder self-attention output: (7, 512) K from encoder output: (15, 512) V from encoder output: (15, 512)

Attention scores S = QKα΅€/√d_k: (7, 15) β€” each of 7 target positions computes attention over all 15 source positions.

Output: (7, 512) β€” same shape as input Q.


Quiz

Q1: What does the concept of Attention Type primarily refer to in this subject?

A) A computational error related to Attention Type B) The definition and application of Attention Type C) A visual representation of Attention Type D) A historical anecdote about Attention Type

Correct: B)

Q2: What is the primary purpose of Decoder Layer?

A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It is used to decoder layer in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

Q3: Which statement about Encoder Layer is TRUE?

A) Encoder Layer is mentioned only as a historical footnote B) Encoder Layer is an advanced topic beyond this subject's scope C) Encoder Layer is a fundamental concept covered in this subject D) Encoder Layer is not related to this subject

Correct: C)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) attention mixes positions, FFN transforms features D) A different result from a common mistake

Correct: C)

Q5: How are Encoder Layer and End-of-Subject Quiz related?

A) Encoder Layer and End-of-Subject Quiz are completely unrelated topics B) Encoder Layer is the inverse of End-of-Subject Quiz C) Encoder Layer and End-of-Subject Quiz are closely related concepts D) Encoder Layer is a special case of End-of-Subject Quiz

Correct: C)

Q6: What is a common pitfall when working with Full Forward Pass (Algorithmic View)?

A) Full Forward Pass (Algorithmic View) has no common misconceptions B) The main error with Full Forward Pass (Algorithmic View) is using it when it is not needed C) Full Forward Pass (Algorithmic View) is always computed the same way in all contexts D) A common mistake is confusing Full Forward Pass (Algorithmic View) with a similar concept

Correct: D)

Q7: When should you apply Information Flow Summary?

A) Information Flow Summary is not practically useful B) Apply Information Flow Summary to solve problems in this subject's domain C) Use Information Flow Summary only in pure mathematics contexts D) Avoid Information Flow Summary unless explicitly instructed

Correct: B)

Practice Problems

Problem 1

List all sub-layers of one decoder layer in order.

Answer 1. Masked Multi-Head Self-Attention β†’ Add & Norm 2. Multi-Head Cross-Attention (Q from decoder, K,V from encoder) β†’ Add & Norm 3. Position-wise Feedforward Network β†’ Add & Norm

Problem 2

Why is the FFN called "position-wise"? What would happen if it weren't position-wise?

Answer The FFN applies the same weights independently to each position β€” it's a 1Γ—1 convolution in sequence space. If it mixed positions (like a regular feedforward layer), positions would interact through the FFN in addition to attention β€” redundant and computationally wasteful. Separation of concerns: attention mixes positions, FFN transforms features.

Problem 3

What is the purpose of residual connections in the Transformer?

Answer Residual connections provide gradient highways through each sub-layer. Without them, the Transformers' depth (6+ layers) would cause vanishing gradients through the attention and FFN operations. They also help with signal propagation: the input can bypass a sub-layer if the sub-layer's transformation isn't useful initially.

Problem 4

Calculate the total parameter count for a 6-layer encoder + 6-layer decoder Transformer with d_model=512, d_ff=2048, h=8, vocabulary V=37000. Include embeddings and output projection.

Answer Per encoder layer: ~3.15M (from Example 2). 6 layers = 18.9M. Per decoder layer: attention (1,048,576) + cross-attention (1,048,576) + FFN (2,099,712) + 3 LayerNorms (3,072) β‰ˆ 4.2M. 6 layers = 25.2M. Embedding: VΒ·d_model = 37000Β·512 = 18.94M. Output projection (if not weight-tied): d_modelΒ·V = 18.94M. Total: 18.9 + 25.2 + 18.94 + 18.94 β‰ˆ 82M parameters (approximate; depends on weight tying).

Problem 5

Why can the encoder process all tokens in parallel during training, but the decoder cannot during inference?

Answer The encoder uses unmasked self-attention β€” every token attends to every other token. No sequential dependency. The decoder uses masked (causal) self-attention β€” token t can only attend to tokens 0...t. During inference, token t depends on token tβˆ’1 (which depends on tβˆ’2, etc.), creating a sequential chain. During training, teacher forcing provides all target tokens so they can be processed in parallel with causal masking.

Summary

  1. The Transformer consists of an encoder (bidirectional self-attention + FFN) and a decoder (masked self-attention + cross-attention + FFN), each built from stacked identical layers
  2. Each layer wraps sub-layers with residual connections and LayerNorm: output = LN(x + Sublayer(x))
  3. The position-wise FFN (two linear layers with activation) is applied independently to each position, separating feature transformation from inter-position mixing
  4. Cross-attention in the decoder allows each target position to attend to all source positions via Q (decoder) and K,V (encoder)
  5. Teacher forcing with causal masking enables parallel training of the decoder despite its autoregressive nature

Pitfalls



Next Steps

Continue to 17-10 β€” The Transformer Block (Detailed) for a deep mathematical dive into gradient flow, pre-norm vs. post-norm, and the full forward/backward pass of one block.