📐 Concept diagram

17-10 — The Transformer Block (Detailed)

Phase: 17 — Deep Learning Architectures (Math) Subject: 17-10 Prerequisites: 17-09 (Transformer Architecture), 17-08 (Multi-Head Attention), 17-05 (Residual Connections), 16-10 (LayerNorm/RMSNorm), 16-05 (Backpropagation) Next subject: 18-01 — Tokenization Mathematics

Learning Objectives

By the end of this subject, you will be able to:

Derive the complete forward pass of one Transformer block as a system of matrix equations
Trace the gradient through the attention mechanism, residual connections, and FFN
Mathematically prove why pre-norm is more stable than post-norm for deep Transformers
Compute the gradient norm scaling through repeated blocks under both norm placements
Explain how RMSNorm differs from LayerNorm and why modern LLMs prefer it

Core Content

1. The Transformer Block as a System of Equations

Let x ∈ ℝ^(n×d) be the input to a block (after positional encoding + previous blocks). A modern pre-norm Transformer block computes:

Step 1: Attention sub-block

x₁ = RMSNorm(x) a = MultiHead(x₁, x₁, x₁) [self-attention] x₂ = x + a [residual]

Step 2: FFN sub-block

x₃ = RMSNorm(x₂) f = FFN(x₃) = W₂ · σ(W₁ · x₃ + b₁) + b₂ x₄ = x₂ + f [residual]

⚠️ THIS IS CRITICAL — Note the order: NORM → SUBLAYER → ADD. This is pre-norm. The normalization is applied BEFORE each sub-layer, and the residual connection adds the PRE-normalized input. This means the residual stream carries the "raw" signal, and normalization only affects what the sub-layer sees.

2. Post-Norm vs. Pre-Norm: The Critical Difference

Post-norm (original Transformer):

x₂ = LayerNorm(x + Sublayer(x))

The normalization is applied AFTER the residual addition. The residual stream passes through LayerNorm at every block.

Pre-norm (modern Transformers, GPT, Llama, etc.):

x₂ = x + Sublayer(Norm(x))

The normalization is applied BEFORE the sub-layer. The residual stream accumulates WITHOUT normalization gating.

3. Why Pre-Norm Wins: The Gradient Analysis

Let's analyze gradient propagation through L consecutive blocks.

Post-norm (simplified):

Block ℓ: x_{ℓ+1} = LN(x_ℓ + F_ℓ(x_ℓ))

The Jacobian involves LN derivatives:

∂x_{ℓ+1}/∂x_ℓ = ∂LN/∂(·) · (I + ∂F_ℓ/∂x_ℓ)

The LayerNorm derivative is: ∂LN(x)/∂x = (γ/σ)(I − 11ᵀ/d − (x̂x̂ᵀ)/d) where x̂ is the normalized vector. This has eigenvalues bounded by γ/σ.

If the scale γ is small (typical at initialization) or σ is large, the LN derivative attenuates gradients. After L blocks:

||∂x_L/∂x₀|| ≤ (||∂LN|| · (1 + ||J_F||))^L

Each block's gradient is multiplied by ||∂LN||, which can be ∂x_{ℓ+1}/∂x_ℓ = I + ∂F_ℓ/∂(LN(x_ℓ)) · ∂LN/∂x_ℓ

The identity term I is UNTOUCHED. No matter what LN does inside F_ℓ, the residual path has gradient multiplier I.

After L blocks:

∂x_L/∂x₀ → I + (terms involving F derivatives)

The identity term ensures that: ||∂x_L/∂x₀|| ≥ 1 — gradients can't entirely vanish.

Empirical confirmation: Post-norm Transformers are typically limited to ~12-24 layers. Pre-norm Transformers can scale to 100+ layers (GPT-3: 96 layers, Llama 70B: 80 layers).

4. Gradient Through Attention (Pre-Norm Block)

Let's trace the full gradient. Loss L → x₄. We need ∂L/∂x and ∂L/∂(attention params).

Backward through the attention sub-block:

x₂ = x + MultiHead(Norm(x), Norm(x), Norm(x))

Given ∂L/∂x₂:

∂L/∂x = ∂L/∂x₂ (identity path from residual)

Plus gradient through the attention path:

∂L/∂x += ∂L/∂x₂ · ∂Attn/∂(Norm(x)) · ∂Norm/∂x

The identity term guarantees ∂L/∂x doesn't vanish. The attention path contributes additional gradient that depends on the attention computation itself.

Gradient through attention parameters:

∂L/∂W_i^Q = ∂L/∂x₂ · (∂Attn/∂Q_i) · (∂Q_i/∂W_i^Q)

Where ∂Attn/∂Q_i involves the softmax gradient from 17-06 — potentially small if the attention is saturated, but the FFN gradient provides a complementary signal.

5. Gradient Through FFN (Pre-Norm Block)

x₄ = x₂ + FFN(Norm(x₂))

Backward:

∂L/∂x₂ = ∂L/∂x₄ (identity) ∂L/∂x₂ += ∂L/∂x₄ · ∂FFN/∂(Norm(x₂)) · ∂Norm/∂x₂

The FFN gradient decomposes as:

∂FFN/∂x = W₂ · diag(σ'(W₁x + b₁)) · W₁

For ReLU: diag entries are 0 or 1. For GELU: entries are between ~0 and ~1.1.

Key insight: The FFN gradient can vanish if too many neurons are inactive (dying ReLU), but the residual identity path always provides a baseline gradient.

6. Effect on Training Dynamics

Post-norm: The output of each block is ALWAYS normalized → all positions have the same scale → gradient magnitude controlled but potentially attenuated.

Pre-norm: The residual stream accumulates → later layers can have larger activation norms → larger gradients → faster learning. This is beneficial early in training but can make later layers have larger updates than early layers.

Modern practice: Pre-norm + proper initialization ensures stable training. The identity gradient path acts as a "gradient amplifier" that prevents any layer from being starved of learning signal.

7. RMSNorm vs LayerNorm

Modern LLMs (Llama, Mistral, Gemma) use RMSNorm instead of LayerNorm:

LayerNorm:

μ = (1/d)Σx_i σ² = (1/d)Σ(x_i − μ)² LN(x) = γ ⊙ (x − μ)/σ + β

RMSNorm:

rms = √((1/d)Σx_i²) RMSNorm(x) = γ ⊙ x / rms (no mean subtraction, no β shift)

Why RMSNorm? 1. Fewer operations (no mean computation) 2. Slightly faster (one less pass over data) 3. Empirically equivalent or better for Transformers 4. The mean subtraction might remove useful information about activation magnitudes

Gradient of RMSNorm:

∂RMSNorm(x)/∂x = (γ/rms) · (I − x̂x̂ᵀ/d)

Where x̂ = x/rms. The structure is similar to LayerNorm without the 11ᵀ term.

8. Activation Checkpointing (Gradient Checkpointing)

Since the attention matrix O(n²) dominates memory, Transformers use activation checkpointing: store only block inputs, recompute intermediate activations during backward pass.

Forward: store only x₀, x₁, ..., x_L (block inputs, O(L·n·d) memory) Backward: recompute each block's forward pass to get intermediate activations, then backprop through the block.

This trades compute (2× forward passes) for memory (O(n²) attention matrices not stored).

9. Full Block Parameter Update

For one training step, the parameter gradients for a pre-norm block are:

Attention parameters:

∂L/∂W_i^{Q,K,V} = accumulated via attention backward (17-08) ∂L/∂W^O = accumulated via output projection backward

FFN parameters:

∂L/∂W₁ = (∂L/∂f_post) ⊙ σ'(·) · x₃ᵀ ∂L/∂W₂ = (∂L/∂x₄) · f_preᵀ

RMSNorm parameters:

∂L/∂γ₁ = Σ over positions: ∂L/∂x₂ · (∂x₂/∂γ₁) ∂L/∂γ₂ = Σ over positions: ∂L/∂x₄ · (∂x₄/∂γ₂)

Key Terms

17 10 Transformer Block Detailed
Activation Checkpointing (Gradient Checkpointing)
Effect on Training Dynamics
End-of-Subject Quiz
Example 1: Pre-Norm Forward Pass (Single Block, Simplified)
Example 2: Gradient Identity Path Verification
Example 3: Post-Norm Gradient Degradation
FFN gradient can vanish if too many neurons
Full Block Parameter Update
Gradient Through Attention (Pre-Norm Block)
Gradient Through FFN (Pre-Norm Block)
LayerNorm derivative

Worked Examples

Example 1: Pre-Norm Forward Pass (Single Block, Simplified)

Problem: Input x = [[1,2],[3,4]] (n=2, d=2). RMSNorm with γ=[1,1]. FFN: W₁=[[1,0],[0,1]], W₂=[[1,0],[0,1]], no biases, ReLU activation. Skip attention (a=0). Compute x₄.

Solution: RMSNorm(x): Row 0: rms = √((1+4)/2) = √2.5 ≈ 1.581. x̂₀ = [1/1.581, 2/1.581] = [0.632, 1.265] Row 1: rms = √((9+16)/2) = √12.5 ≈ 3.536. x̂₁ = [3/3.536, 4/3.536] = [0.848, 1.131]

With γ=[1,1]: norm output x₃ = [[0.632, 1.265],[0.848, 1.131]]

FFN (since W₁=W₂=I): f = ReLU(x₃) = [[0.632, 1.265],[0.848, 1.131]]

x₄ = x₂ + f = x + f (a=0) = [[1+0.632, 2+1.265],[3+0.848, 4+1.131]] = [[1.632, 3.265],[3.848, 5.131]]

Example 2: Gradient Identity Path Verification

Problem: For a pre-norm block: x₂ = x + F(Norm(x)). Show that if ∂F/∂x = 0 (F doesn't depend on x), the gradient ∂x₂/∂x = I.

Solution: ∂x₂/∂x = I + ∂F(Norm(x))/∂x = I + ∂F/∂(Norm(x)) · ∂Norm/∂x = I + 0 = I

The identity gradient is always present, regardless of what F does. If F learned to be the zero function, gradients still flow perfectly through the block.

Example 3: Post-Norm Gradient Degradation

Problem: A post-norm block: x₂ = LN(x + F(x)). Assume LayerNorm acts as L2 normalization (simplified: divides by ||·||). If ||x + F(x)|| grows by factor 2 each block, how does the gradient scale after 10 blocks?

Solution: After 1 block: ∂x₁/∂x₀ ≈ (1/||x₀+F(x₀)||)·(I + J_F)

If norm grows 2× per block: after block ℓ, ||x_ℓ|| ≈ 2^ℓ·||x₀||.

The LN derivative scales as 1/||x|| ≈ 2^{-ℓ}. So: ||∂x_10/∂x₀|| ≈ ∏_{ℓ=1}^{10} (1/2^ℓ) · (1+||J||) ≈ 2^{-55} ≈ 2.8×10^{-17}

Catastrophic gradient vanishing! Even with residual connections, the post-norm placement destroys gradients in deep networks.

The √d_k scaling doesn't fix this — it's a separate normalization problem specific to post-norm placement.

Quiz

Q1: What is the key mathematical advantage of pre-norm over post-norm in Transformer blocks?

A) Pre-norm is faster to compute B) Pre-norm places normalization BEFORE the sub-layer, so the residual skip's gradient is I, never attenuated by normalization C) Pre-norm uses fewer parameters D) Pre-norm eliminates the need for residual connections

Answer & Explanation

**B** — Pre-norm: ∂(x + Sublayer(Norm(x)))/∂x = I + term. The I is never multiplied by Norm's derivative. Post-norm passes the residual through Norm, whose derivative < 1 causes catastrophic gradient decay.

Q2: How does RMSNorm differ from LayerNorm?

A) RMSNorm uses batch statistics B) RMSNorm removes mean subtraction and only normalizes by the root mean square C) RMSNorm adds an extra residual connection D) RMSNorm uses a different activation function

Answer & Explanation

**B** — RMSNorm(x) = γ · x/rms(x) where rms = √(mean(x²)). LayerNorm additionally subtracts the mean: LN(x) = γ · (x − μ)/σ + β. Modern LLMs (Llama, Mistral, Gemma) use RMSNorm.

Q3: In the backward pass of a pre-norm block, how does gradient flow from output x₄ to input x?

A) Only through the FFN path B) Through I (identity, from both residuals) + paths through attention and FFN C) Only through attention D) It does not flow at all

Answer & Explanation

**B** — x₂ = x + Attn(Norm(x)) gives ∂x₂/∂x = I + ∂Attn/∂x. x₄ = x₂ + FFN(Norm(x₂)) gives ∂x₄/∂x₂ = I + ∂FFN/∂x₂. The I terms guarantee baseline gradient flow.

Q4: What is the purpose of activation checkpointing?

A) To make training faster B) To reduce memory by recomputing activations during backward instead of storing them C) To check for NaN gradients D) To reduce the number of parameters

Answer & Explanation

**B** — Attention matrices are O(n²) per layer, dominating memory. Checkpointing stores only block inputs and recomputes attention during backward, trading 2× forward compute for much lower peak memory.

Q5: Why can post-norm not scale to 100+ layers while pre-norm can?

A) Post-norm uses more parameters B) In post-norm, the LayerNorm Jacobian multiplies gradient at every block: ||∂x_L/∂x₀|| ≤ (||∂LN||)^L. Pre-norm's I term ensures ||∂x_L/∂x₀|| ≥ 1. C) Post-norm requires more FLOPs D) Post-norm uses a different activation function

Answer & Explanation

**B** — Post-norm: x_{ℓ+1} = LN(x_ℓ + F(x_ℓ)). ∂LN/∂(·) has eigenvalues bounded by γ/σ. After L blocks, gradients can decay exponentially. Pre-norm's identity term is untouched by normalization, enabling GPT-3's 96 layers.

Practice Problems

Problem 1

Write the forward equations for a pre-norm Transformer block. Identify where the identity gradient path enters.

Answer

x₁ = Norm(x); a = Attn(x₁); x₂ = x + a; x₃ = Norm(x₂); f = FFN(x₃); x₄ = x₂ + f. Identity paths: x → x₂ (first residual) and x₂ → x₄ (second residual). Both contribute I to the Jacobian.

Problem 2

Why does the identity term I in the pre-norm Jacobian not get attenuated by normalization?

Answer

In pre-norm: x_{ℓ+1} = x_ℓ + F(Norm(x_ℓ)). The derivative is I + ∂F(Norm(x_ℓ))/∂x_ℓ. I comes from the skip connection x_ℓ, which is NOT passed through Norm. In post-norm: x_{ℓ+1} = Norm(x_ℓ + F(x_ℓ)), the entire sum goes through Norm.

Problem 3

How does RMSNorm differ from LayerNorm mathematically?

Answer

LayerNorm: x̂ = (x − μ)/σ, using both first (μ) and second (σ) moments. RMSNorm: x̂ = x/rms, using only the root mean square. Fewer operations, similar performance in Transformers.

Problem 4

Compute the gradient of RMSNorm(x) for x ∈ ℝ² with γ = 1.

Answer

rms = √((x₁² + x₂²)/2). RMSNorm(x)_i = x_i/rms. ∂RMSNorm_i/∂x_j = (1/rms)(δ_ij − x̂_i·x̂_j/2) where x̂_i = x_i/rms. In matrix form: (1/rms)(I − x̂x̂ᵀ/2).

Problem 5

Why does activation checkpointing reduce memory at the cost of compute?

Answer

Without checkpointing: all intermediate activations (including O(n²) attention matrices from every layer) are stored. Checkpointing stores only block inputs (O(n·d)). During backward, each block's forward is recomputed to obtain activations for its backward pass. This doubles forward compute but reduces memory from O(L·n²) to O(L·n·d + n²).

Summary

Modern Transformers use pre-norm: Norm → Sublayer → Add, giving the residual stream an un-normalized identity gradient path
Post-norm (original Transformer: Add → Norm) causes normalization to gate even the residual gradient, preventing scaling to deep networks
The identity Jacobian term I in pre-norm ensures ||∂x_L/∂x₀|| ≥ 1 — gradients never vanish completely through the residual
RMSNorm (normalize by RMS only, no mean subtraction) replaces LayerNorm in modern LLMs for efficiency with equal performance
Activation checkpointing trades compute for memory by storing only block inputs and recomputing intermediates during backward pass

Pitfalls

Using post-norm for deep Transformers (50+ layers). Post-norm passes the entire residual stream through LayerNorm, whose Jacobian has eigenvalues bounded by γ/σ. After L blocks, gradient magnitude can decay like (||∂LN||)^L, making post-norm infeasible beyond ~12-24 layers. If you're building a deep Transformer from scratch, use pre-norm — the identity gradient path in pre-norm ensures ||∂x_L/∂x₀|| ≥ 1 regardless of depth.
Treating RMSNorm as "LayerNorm without mean subtraction" and adding a bias term. RMSNorm = γ·x/rms(x), with no β (bias) parameter. LayerNorm = γ·(x−μ)/σ + β includes both γ and β. If you follow a LayerNorm implementation pattern and add a learnable bias to RMSNorm, you've created a non-standard normalization that may behave differently than what the literature reports.
Thinking activation checkpointing is "free." Checkpointing reduces peak memory from O(L·n²) to O(L·n·d + n²) by storing only block inputs, but it requires recomputing every block's forward pass during the backward pass — effectively doubling the forward compute cost. For large models, this trade is almost always worth it, but it's not free and must be factored into throughput calculations.
Forgetting that the identity gradient path only exists in pre-norm. In post-norm (x_{ℓ+1} = LN(x_ℓ + F(x_ℓ))), the derivative is ∂LN/∂(·) · (I + ∂F/∂x_ℓ) — even the identity term is gated through LN's derivative. In pre-norm (x_{ℓ+1} = x_ℓ + F(LN(x_ℓ))), the derivative is I + ∂F(LN(x_ℓ))/∂x_ℓ — the I is untouched. This is the fundamental mathematical reason pre-norm scales to arbitrary depth.
Not accounting for dying neurons in the FFN with ReLU activation. If the FFN uses ReLU and a large fraction of neurons have negative pre-activations (zero output with zero gradient), those pathways contribute nothing to learning. The 4× expansion ratio (d_ff = 4·d_model) partially compensates, but modern LLMs prefer GELU or SiLU activations precisely because they provide non-zero gradients for negative inputs, preventing dead neurons.

Next Steps

Continue to Phase 18 with 18-01 — Tokenization Mathematics to learn how text is converted into the numerical tokens that feed into the Transformer.

Progress

Phases

17-10 — The Transformer Block (Detailed)

Learning Objectives

Core Content

1. The Transformer Block as a System of Equations

2. Post-Norm vs. Pre-Norm: The Critical Difference

3. Why Pre-Norm Wins: The Gradient Analysis

4. Gradient Through Attention (Pre-Norm Block)

5. Gradient Through FFN (Pre-Norm Block)

6. Effect on Training Dynamics

7. RMSNorm vs LayerNorm

8. Activation Checkpointing (Gradient Checkpointing)

9. Full Block Parameter Update

Key Terms

Worked Examples

Example 1: Pre-Norm Forward Pass (Single Block, Simplified)

Example 2: Gradient Identity Path Verification

Example 3: Post-Norm Gradient Degradation

Quiz

Q1: What is the key mathematical advantage of pre-norm over post-norm in Transformer blocks?

Q2: How does RMSNorm differ from LayerNorm?

Q3: In the backward pass of a pre-norm block, how does gradient flow from output x₄ to input x?

Q4: What is the purpose of activation checkpointing?

Q5: Why can post-norm not scale to 100+ layers while pre-norm can?

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Next Steps