Math graphic
📐 Concept diagram

16-10 — Other Normalization Methods

Phase: 16 — Neural Network Mathematics Subject: 16-10 Prerequisites: 16-09 (Batch Normalization), 16-02 (Activation Functions), Phase 6 (partial derivatives), Phase 11 (expectation and variance) Next subject: 17-01 — Convolutional Neural Networks


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain why Layer Normalization works for sequence models and transformers where Batch Normalization fails
  2. Derive the LayerNorm forward and backward passes for a single vector
  3. Distinguish Instance Normalization, Group Normalization, and their relationships to LayerNorm and BatchNorm
  4. Explain RMS Normalization mathematically and why modern LLMs use it instead of LayerNorm
  5. Choose the appropriate normalization method for a given architecture and task

Core Content

1. Beyond Batch Normalization

Batch Normalization (16-09) normalizes across the batch dimension. This works well for computer vision where batch sizes are large (64+) and batch statistics are stable. But two critical use cases break BatchNorm:

Sequence models (RNNs, Transformers): The batch dimension can be 1 during autoregressive decoding. Computing μ_B and σ²_B from a single example is meaningless — σ²_B = 0 and division by zero occurs.

Small batch sizes: When memory constraints force batch sizes of 2-4, batch statistics become too noisy. The variance estimates fluctuate wildly, destabilizing training.

⚠️ THIS IS CRITICAL — Batch Normalization, Layer Normalization, Instance Normalization, Group Normalization, and RMS Normalization are all the SAME mathematical operation (normalize then affine-transform). The ONLY difference is which axes/dimensions the mean and variance are computed over. Understanding this unification is more important than memorizing each variant separately.

2. Layer Normalization (LN)

LayerNorm normalizes across the feature dimension for each sample independently:

For a vector x ∈ ℝ^d (a single example's features):

μ = (1/d) Σ_{j=1}^d x_j σ² = (1/d) Σ_{j=1}^d (x_j − μ)² y_i = γ_i · (x_i − μ) / √(σ² + ε) + β_i

Where γ, β ∈ ℝ^d are learnable scale and shift parameters (initialized to 1 and 0), and ε ≈ 10⁻⁵ prevents division by zero.

For a sequence model with shape (batch, seq_len, d_model), LayerNorm operates on the last dimension d_model. Each token at each batch position gets its own normalization statistics, computed purely from its own features — no cross-token or cross-batch averaging.

Why LayerNorm for Transformers? - Batch dimension can be 1 during autoregressive decoding — BatchNorm would divide by σ²=0 - Averaging across the batch mixes unrelated sequences, destroying semantic information - RNNs process sequences one time step at a time; per-time-step BatchNorm statistics are unstable

3. Backpropagation Through LayerNorm

The gradient computation follows the same structure as BatchNorm (16-09) but operating over the feature dimension instead of the batch dimension. Given ∂L/∂y_i:

∂L/∂γ = Σ_i ∂L/∂y_i · ((x_i − μ)/√(σ² + ε)) ∂L/∂β = Σ_i ∂L/∂y_i

The gradient ∂L/∂x_i is derived identically to the BN case (see 16-09), with sums over the feature dimension rather than the batch dimension.

4. Instance Normalization (IN)

Instance Norm is LayerNorm applied independently to each channel of each sample:

For each sample i and channel c: μ_{i,c} = mean of channel c values (over spatial dimensions) σ²_{i,c} = variance of channel c values y_{i,c} = γ_c · (x_{i,c} − μ_{i,c}) / √(σ²_{i,c} + ε) + β_c

Origins: Proposed by Ulyanov et al. (2016) for style transfer. The key insight: contrast and color style are encoded in per-channel statistics. Normalizing each channel removes style while preserving content structure.

Practical effect: In vision, IN treats each channel as its own "style" dimension. After normalization, all channels have the same mean/variance regardless of input style, enabling content-style disentanglement.

5. Group Normalization (GN)

GroupNorm generalizes both LayerNorm and InstanceNorm by splitting channels into G groups and normalizing within each group:

For group g of sample i (containing C/G channels): μ_{i,g} = mean of channels in group g (across spatial dimensions) σ²_{i,g} = variance of channels in group g y_{i,g} = γ_g · (x_{i,g} − μ_{i,g}) / √(σ²_{i,g} + ε) + β_g

Unification property: - When G = 1: GN = LayerNorm (all channels in one group) - When G = C (number of channels): GN = InstanceNorm (each channel its own group) - GN is independent of batch size, making it ideal when memory constraints force small batches.

6. RMS Normalization (RMSNorm)

RMSNorm drops the mean-centering step entirely and only rescales by the root-mean-square:

rms(x) = √((1/d) Σ_{j=1}^d x_j²) y_i = γ_i · x_i / rms(x)

No β parameter, no mean subtraction. This is simpler and empirically works better than LayerNorm for Transformers.

Why does removing the mean help?

  1. In attention and feed-forward layers, activations are often approximately zero-mean already (especially with GELU/SiLU which are approximately zero-mean at initialization)
  2. The mean subtraction removes one source of numerical noise
  3. Fewer operations per normalization step — meaningful at scale (billions of parameters, trillions of tokens)
  4. Zhang & Sennrich (2019) showed RMSNorm matches or beats LayerNorm on machine translation with faster convergence

⚠️ THIS IS CRITICAL — Modern LLMs (Llama, Mistral, Gemma, Qwen) exclusively use RMSNorm. Understanding it is essential for working with current-generation language models.

7. The Unified View of Normalization

All normalization methods can be expressed in a common framework:

ŷ = γ ⊙ (x − μ) / √(σ² + ε) + β

The only difference is which dimensions μ and σ² are computed over:

Method Normalization axes Batch-independent? Typical use case
BatchNorm (N, H, W) per channel No CNNs (large batches)
LayerNorm (C) per example Yes Transformers, RNNs
InstanceNorm (H, W) per channel per example Yes Style transfer
GroupNorm (C/G, H, W) per group per example Yes CNNs (small batches)
RMSNorm (C) per example (no mean) Yes Modern LLMs


Key Terms

Worked Examples

Example 1: LayerNorm Forward Pass

Problem: Given x = [2, 4, 6, 8], ε = 1e⁻⁵, γ = [1, 1, 1, 1], β = [0, 0, 0, 0], compute the LayerNorm output.

Solution:

μ = (2 + 4 + 6 + 8) / 4 = 5 σ² = ((2−5)² + (4−5)² + (6−5)² + (8−5)²) / 4 = (9 + 1 + 1 + 9) / 4 = 5

ŷ = (x − 5) / √5 = [−3/2.236, −1/2.236, 1/2.236, 3/2.236] ≈ [−1.3416, −0.4472, 0.4472, 1.3416]

Verify: mean ≈ 0, variance ≈ 1 ✓

Example 2: GroupNorm with 2 Groups

Problem: Input x = [1, 2, 3, 4, 5, 6] with 6 channels, G = 2 groups. Channels split as [1,2,3] and [4,5,6]. Compute the normalized values (γ = 1, β = 0, ε = 0).

Solution:

Group 1 (channels 1-3): μ₁ = (1+2+3)/3 = 2, σ²₁ = ((−1)²+0²+1²)/3 = 2/3 Group 2 (channels 4-6): μ₂ = (4+5+6)/3 = 5, σ²₂ = ((−1)²+0²+1²)/3 = 2/3

Normalized: Group 1: [−1/√(2/3), 0, 1/√(2/3)] = [−1.225, 0, 1.225] Group 2: [−1/√(2/3), 0, 1/√(2/3)] = [−1.225, 0, 1.225]

Each group is independently zero-mean, unit-variance.

Example 3: RMSNorm vs LayerNorm — Same Input

Problem: Compute both LayerNorm and RMSNorm for x = [2, 4, 6] (γ = 1, β = 0, ε = 0). Compare the results numerically and conceptually.

Solution:

LayerNorm: μ = (2+4+6)/3 = 4 σ² = ((−2)²+0²+2²)/3 = 8/3 ≈ 2.667 ŷ_LN = [−2/1.633, 0/1.633, 2/1.633] = [−1.225, 0, 1.225]

RMSNorm: rms = √((4+16+36)/3) = √(56/3) = √18.667 ≈ 4.320 ŷ_RMS = [2/4.320, 4/4.320, 6/4.320] = [0.463, 0.926, 1.389]

Comparison: LayerNorm centers the data first (mean becomes 0), so values are symmetric around zero. RMSNorm preserves the "DC offset" — the mean is unchanged relative to the RMS scaling. This preservation of overall activation magnitude can be beneficial in Transformers where the residual stream accumulates information.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Compute LayerNorm for x = [3, 6, 9] with γ = [2, 2, 2], β = [1, 1, 1]. Verify the mean and variance of the normalized output (before γ, β).

Problem 2: Explain mathematically why BatchNorm fails when batch size = 1 during inference.

Problem 3: When is GroupNorm strictly better than InstanceNorm? When is it strictly better than LayerNorm?

Problem 4: Compute RMSNorm for x = [1, 0, 3] with γ = 1. Compare the result to what LayerNorm would produce.

Problem 5: In a transformer with input shape (32, 128, 768) — batch=32, seq_len=128, d_model=768: - What dimensions does LayerNorm operate over? - What dimensions would BatchNorm operate over? - Explain why BatchNorm would be problematic here.

Problem 6: Show that RMSNorm has gradient ∂y_i/∂x_j = (γ_i/rms) · (δ_{ij} − x̂_i·x̂_j/d) where x̂_i = x_i/rms. Compare this to the LayerNorm gradient structure.

Answers (click to expand) **Problem 1:** μ = 6, σ² = ((−3)²+0²+3²)/3 = 18/3 = 6 **z** = [−3/√6, 0, 3/√6] = [−1.225, 0, 1.225] **y** = 2·**z** + 1 = [−1.449, 1, 3.449] Before γ, β: mean(z) = 0, var(z) = 1 ✓ **Problem 2:** With batch size 1, σ²_B = 0 (single value has no variance). Division by √(ε) is numerically unstable even with the ε correction — the output depends almost entirely on ε rather than the data. Running averages can substitute for μ but can't fix the variance degeneracy. This is a fundamental limitation: normalization requires a population. **Problem 3:** GN > IN: When you want to normalize in larger semantic groups than individual channels, preserving intra-group statistics. GN with G=32 groups 32 channels together, keeping within-group structure that IN would erase. GN > LN: When operating on images where LayerNorm (normalizing all channels together) removes too much channel-specific information. GN with intermediate G preserves some inter-channel variation. **Problem 4:** RMSNorm: rms = √((1+0+9)/3) = √(10/3) ≈ 1.826. **y** = [0.548, 0, 1.643] LayerNorm: μ = 4/3, σ² = ((−1/3)²+(−4/3)²+(5/3)²)/3 = (1+16+25)/27 = 42/27 ≈ 1.556. **y** = [−0.267, −1.069, 1.336] RMSNorm preserves the sign structure and relative magnitudes more directly. **Problem 5:** LayerNorm: normalizes over d_model=768 (last dimension). Per-token stats — 32×128 = 4096 independent normalizations. BatchNorm: normalizes over batch dimension (32), averaging across all sequences at each token position — mixes unrelated sentences, destroys autoregressive property. LayerNorm is correct for NLP because each token should be normalized independently of other tokens and other sequences. **Problem 6:** For RMSNorm: y_i = γ_i·x_i/rms where rms = √(Σx_j²/d). ∂y_i/∂x_j = γ_i·(δ_{ij}·rms − x_i·(x_j/(d·rms)))/rms² = (γ_i/rms)·(δ_{ij} − x̂_i·x̂_j/d). For LayerNorm: ∂LN_i/∂x_j = (γ_i/σ)·(δ_{ij} − 1/d − x̂_i·x̂_j/d). The RMSNorm gradient lacks the −1/d term from mean subtraction — it's structurally simpler, which contributes to its training stability.

Summary

  1. LayerNorm normalizes per-sample across features, making it batch-size independent — the standard for Transformers and RNNs
  2. InstanceNorm is per-sample per-channel normalization, originally developed for style transfer to remove instance-specific contrast
  3. GroupNorm generalizes LN and IN by splitting channels into groups — with G=1 it equals LN, with G=C it equals IN; ideal for small-batch CNN training
  4. RMSNorm drops mean-centering, normalizing only by RMS. Fewer operations, empirically equal or better than LN, and used by all modern LLMs (Llama, Mistral, Gemma)
  5. All normalization methods are the same mathematical operation applied over different axes — the choice of axes is the architectural decision

Quiz

Q1: Why does BatchNorm fail with batch size = 1 during inference?

A) Because gradients vanish B) Because variance is 0, causing division by zero (or dependence solely on ε) C) Because the learning rate must be adjusted D) Because activations saturate

Answer & Explanation **B** — A single sample has variance 0. The normalization becomes (x − μ)/√(0 + ε) = (x − x)/√ε = 0. Every input produces the same (zero) output. Running averages help for the mean but can't fix zero variance — normalization requires a population.

Q2: What does RMSNorm remove compared to LayerNorm?

A) The learnable γ parameter B) The mean-centering step (and typically the β shift parameter) C) The variance computation D) The ε stabilization term

Answer & Explanation **B** — RMSNorm computes y = γ · x / rms(x) — no mean subtraction, no β shift. LayerNorm computes y = γ · (x − μ)/σ + β. RMSNorm is simpler: fewer operations, and the mean is often already near zero in Transformer activations with GELU/SiLU.

Q3: Which normalization method was specifically designed for style transfer?

A) Batch Normalization B) Layer Normalization C) Instance Normalization D) Group Normalization

Answer & Explanation **C** — InstanceNorm normalizes each channel independently per sample. Contrast and color style are encoded in per-channel statistics — removing them via IN strips style while preserving spatial content structure. This is the key insight from Ulyanov et al. (2016).

Q4: In a Transformer with input shape (batch=32, seq=128, d_model=768), LayerNorm operates over which dimension?

A) The batch dimension (32) B) The sequence dimension (128) C) The feature dimension (768) D) All dimensions simultaneously

Answer & Explanation **C** — LayerNorm computes μ and σ² across the d_model=768 features for each token independently, producing 32×128=4096 separate normalizations. A describes BatchNorm (mixing sequences, problematic for NLP). B would break autoregressive decoding.

Q5: If GroupNorm has G = C (where C is the number of channels), which normalization does it reduce to?

A) BatchNorm B) LayerNorm C) InstanceNorm D) RMSNorm

Answer & Explanation **C** — With G = C, each group contains exactly one channel. Per-channel normalization is InstanceNorm. With G = 1 (all channels in one group), GroupNorm = LayerNorm. GroupNorm provides a continuous spectrum between IN and LN.

Pitfalls



Next Steps

Move on to 17-01 — Convolutional Neural Networks to learn about the convolution operation, stride, padding, and how CNNs process images mathematically.