📐 Concept diagram

16-07 — Weight Initialization

Phase: 16 — Neural Network Mathematics Subject: 16-07 Prerequisites: 16-06 (Gradient Flow), Phase 10-11 (variance, random variables) Next subject: 16-08 — Regularization

Learning Objectives

By the end of this subject, you will be able to:

Explain why naive initialization (all zeros, all identical, very small random, very large random) fails
Derive the Xavier/Glorot initialization from variance preservation requirements
Derive the He/Kaiming initialization and explain why it differs from Xavier for ReLU networks
Compute the variance of activations and gradients at initialization for a given scheme
Understand orthogonal initialization and when it's preferred

Core Content

1. Why Initialization Matters

At the start of training, the network produces random outputs. Backprop computes random gradients. If the initial weights are chosen poorly, either: - Activations explode or vanish as data flows forward - Gradients explode or vanish as errors flow backward

Once either happens, training may never recover — the network starts in a regime where no useful gradient signal exists.

⚠️ THIS IS CRITICAL — Weight initialization is not a minor detail. It determines whether your network can be trained at all. Poor initialization is the most common silent cause of training failure in deep networks.

2. What Naive Initialization Gets Wrong

All zeros: Every neuron computes the same function. All gradients are identical. The network collapses to effectively one neuron per layer — no symmetry breaking, no learning.

All identical (non-zero): Same problem — all neurons in a layer have identical weights and receive identical gradients, so they stay identical forever. The network never develops diverse features.

Very small random weights (e.g., w ∼ N(0, 0.01²)): In a deep network, activations shrink exponentially: Var(a^(L)) ≈ Var(x) · ∏ (n_in · Var(w)). If n_in · Var(w) < 1, forward signals vanish. Backward gradients vanish even faster.

Very large random weights (e.g., w ∼ N(0, 100²)): Activations explode. For sigmoid/tanh, neurons saturate (outputs stuck near ±1 or 0/1). Gradients are zero. For ReLU, activations grow unboundedly, causing numerical overflow.

3. The Variance Analysis Framework

Consider a linear layer: z = Wx, where W ∈ ℝ^{d_out × d_in} and entries of W and x are i.i.d. with zero mean.

The variance of one output unit:

Var(zᵢ) = Var(Σⱼ W_{ij} xⱼ)

Assuming W_{ij} and xⱼ are independent with zero mean:

Var(zᵢ) = Σⱼ Var(W_{ij} xⱼ) = Σⱼ Var(W_{ij}) · Var(xⱼ) = d_in · Var(W) · Var(x)

For the variance to be preserved (Var(z) = Var(x)), we need:

d_in · Var(W) = 1 → Var(W) = 1/d_in

4. Xavier/Glorot Initialization

Xavier initialization considers BOTH the forward and backward passes. In the forward pass, variance is preserved if Var(W) = 1/d_in. In the backward pass (treating gradients as flowing through Wᵀ), variance is preserved if Var(W) = 1/d_out.

Xavier takes the harmonic mean:

Var(W) = 2/(d_in + d_out)

For a uniform distribution U(−a, a), the variance is a²/3. Setting a²/3 = 2/(d_in + d_out):

a = √(6/(d_in + d_out))

So: W ∼ U(−√(6/(n_in + n_out)), √(6/(n_in + n_out)))

For a normal distribution:

W ∼ N(0, 2/(d_in + d_out))

This is the standard initialization for tanh/sigmoid networks. It balances forward activation variance and backward gradient variance.

5. He/Kaiming Initialization

He initialization accounts for ReLU's property that it zeros out half the activations (in expectation, assuming symmetric input distribution).

With ReLU, the forward pass variance analysis changes. For z ∼ N(0, σ²):

E[ReLU(z)²] = E[max(0, z)²]

For z ∼ N(0, σ²), the positive half has variance σ²/2 (it's a half-normal distribution). Specifically:

Var(ReLU(z)) = σ² · (1/2 − 1/(2π)) ≈ 0.182 · σ² (for zero-mean inputs)

The exact calculation: E[ReLU(z)] = σ/√(2π) and E[ReLU(z)²] = σ²/2, so Var(ReLU(z)) = σ²(1/2 − 1/(2π)).

But the commonly used simplified He analysis uses: Var(ReLU(z)) ≈ Var(z)/2 (losing exactly half the variance).

For variance preservation through a ReLU layer:

d_in · Var(W) · (1/2) = 1 → Var(W) = 2/d_in

Hence the name "Kaiming" or "He" initialization (after Kaiming He):

W ∼ N(0, 2/d_in) or W ∼ U(−√(6/d_in), √(6/d_in))

Key difference from Xavier: He uses 2/d_in (ReLU loses half the variance), while Xavier uses 2/(d_in + d_out) (balances forward/backward for tanh). He initialization is the standard for ReLU-based networks.

6. Derivation of Var(ReLU(z)) for Gaussian Inputs

Let z ∼ N(0, σ²). The ReLU output is a = max(0, z).

E[a] = ∫₀^∞ z · (1/(σ√(2π))) · exp(−z²/(2σ²)) dz

Let u = z²/(2σ²), du = z/σ² dz: E[a] = σ/√(2π) · ∫₀^∞ e⁻ᵘ du = σ/√(2π)

E[a²] = ∫₀^∞ z² · (1/(σ√(2π))) · exp(−z²/(2σ²)) dz

Since E[z²] = σ² and by symmetry of the normal, half the second moment is from positive z: E[a²] = σ²/2

Therefore: Var(a) = E[a²] − (E[a])² = σ²/2 − σ²/(2π) = σ² · (π − 1)/(2π)

For π ≈ 3.1416: Var(a) ≈ 0.3408·σ².

The simplified "lose half" approximation gives Var(a) ≈ 0.5·σ² if we ignore the non-zero mean. Both motivate Var(W) ∝ 1/d_in with a factor of ~2.

Practical He initialization uses 2/d_in — the factor of 2 accounts for ReLU discarding the negative half of the distribution.

7. Orthogonal Initialization

Instead of random i.i.d. weights, initialize W as a (scaled) orthogonal matrix: WᵀW = I · scale.

Properties: - All singular values are exactly 1 (or the scale factor) - Preserves norm: ||Wx|| = ||x|| (for scale=1) - Perfect forward and backward variance preservation in linear networks - No vanishing/exploding at initialization

How to generate: Take a random matrix M and compute its QR decomposition M = QR. Set W = Q (the orthogonal factor). For rectangular matrices, generate a random matrix and use the SVD: W = UVᵀ where U, V are from SVD of a random matrix.

Use cases: RNNs (where repeated multiplication by W amplifies small eigenvalue errors), and very deep networks where i.i.d. initialization variance accumulates errors.

8. Bias Initialization

Biases are typically initialized to 0 (or a small positive constant like 0.01 for ReLU to encourage initial activity). Since biases are added (not multiplied), they don't cause variance explosion/vanishing.

For ReLU networks specifically, initializing biases to a small positive value (e.g., 0.01) ensures most neurons start active, reducing the chance of dead neurons at initialization.

Key Terms

Biases
Orthogonal initialization
Poor initialization
Very large random weights
Very small random weights

Worked Examples

Example 1: Choosing Initialization by Layer Size

Problem: A fully-connected layer maps from 256 inputs to 512 outputs. Compute the appropriate initialization variance for (a) tanh activation and (b) ReLU activation.

Solution:

(a) Xavier/Glorot (for tanh): Var(W) = 2/(256 + 512) = 2/768 ≈ 0.002604 Std dev = √0.002604 ≈ 0.0510 Using uniform: a = √(6/768) = √0.0078125 ≈ 0.0884 W ∼ U(−0.0884, 0.0884)

(b) He/Kaiming (for ReLU): Var(W) = 2/256 = 0.0078125 Std dev = √0.0078125 ≈ 0.0884 Using uniform: a = √(6/256) = √0.02344 ≈ 0.1531 W ∼ U(−0.1531, 0.1531)

He initialization uses larger weights (roughly 73% larger standard deviation) because it needs to compensate for the variance lost to ReLU's zeroing of negative activations.

Example 2: Verifying Variance Preservation

Problem: A 10-layer ReLU network has 100 neurons per layer. Weights initialized with He normal: W ∼ N(0, 2/100). Inputs x ∼ N(0, 1). Walk through the first three layers and compute the expected output variance at each. Assume pre-activations are roughly Gaussian (reasonable for large layer widths).

Solution:

Layer 0 (input): Var(x) = 1

Layer 1 forward: z₁ = W₁x, Var(z₁) = 100 · (2/100) · 1 = 2 a₁ = ReLU(z₁), Var(a₁) = 2 · (π−1)/(2π) ≈ 0.6816

Layer 2 forward: z₂ = W₂a₁, Var(z₂) = 100 · (2/100) · 0.6816 = 1.3632 a₂ = ReLU(z₂), Var(a₂) = 1.3632 · (π−1)/(2π) ≈ 0.4645

Layer 3 forward: z₃ = W₃a₂, Var(z₃) = 100 · (2/100) · 0.4645 = 0.9290 a₃ = ReLU(z₃), Var(a₃) = 0.9290 · (π−1)/(2π) ≈ 0.3166

The variance is decreasing over layers. The simplified He factor of 2 isn't quite enough to maintain exact variance (due to the non-zero mean of ReLU), but it's close enough to prevent collapse — the variance decreases polynomially, not exponentially. Using 2/(d_in · 0.5) = 4/d_in would be needed for exact variance preservation, but 2/d_in works well in practice.

Example 3: Orthogonal vs Random Initialization for RNN

Problem: An RNN applies ht = tanh(Wh{t−1} + ...). W is 128×128. Compare the expected norm of h_T given h₀ over T=50 steps for (a) Xavier initialization Var(W) = 1/128 and (b) orthogonal initialization WᵀW = I.

Solution:

(a) Xavier: Each entry of W has variance 1/128. For large random W, the expected squared norm: E[||Wh||²] = E[hᵀWᵀWh]. For random W, the expected singular values squared are distributed around 1. But the non-linearity tanh contracts the norm (||tanh(z)|| < ||z|| for non-zero z). After 50 steps, the hidden state norm has likely decayed significantly.

(b) Orthogonal: ||Wh|| = ||h|| (exact norm preservation). Although tanh still contracts, the linear transformation itself doesn't amplify or attenuate. The hidden state norm decays much more slowly, enabling the RNN to maintain information over longer sequences.

This is why orthogonal initialization is preferred for RNNs: it eliminates one source of norm decay/explosion, leaving the activation function as the only source of contraction.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Derive the variance of a ReLU output when the input is drawn from N(0, 1) and explain why it's not 0.5 (the simplified "half the variance" claim).

Problem 2: A layer has d_in = 1024, d_out = 256. Compute the Xavier uniform bounds and He normal standard deviation.

Problem 3: Show that if all weights are initialized to the same non-zero value (and biases to zero), all neurons in a layer will have identical gradients throughout training, regardless of the data. Prove that the network effectively has only one neuron per layer.

Problem 4: For a deep linear network with orthogonal weight matrices and d_in = d_out for all layers, prove that ||∂L/∂x|| = ||∂L/∂ŷ|| at initialization.

Problem 5: A common mistake: initializing biases to random values. Explain why this is problematic and always unnecessary.

Answers (click to expand)

**Problem 1:** z ∼ N(0,1). E[ReLU(z)] = 1/√(2π) ≈ 0.3989. E[ReLU(z)²] = 1/2 = 0.5. Var = 0.5 − (0.3989)² = 0.5 − 0.1592 = 0.3408. It's not 0.5 because the non-zero mean (0.3989) eats into the variance. The "half the variance" approximation ignores this mean effect. In deep networks, the mean accumulates but the variance calculation using 0.5 is close enough for practical initialization. **Problem 2:** Xavier uniform: a = √(6/(1024+256)) = √(6/1280) = √0.0046875 ≈ 0.0685. W ∼ U(−0.0685, 0.0685). He normal: σ = √(2/1024) = √0.001953 = 0.0442. W ∼ N(0, 0.0442²). **Problem 3:** Let all entries of W^(ℓ) equal c and b^(ℓ) = 0. Then z^(ℓ) = c · 1 · (sum of a^(ℓ−1)). All neurons in layer ℓ have identical pre-activations → identical activations. Backward: δ^(ℓ) = (c · sum of δ^(ℓ+1)) · f'(same z) → identical across neurons. ∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ → all weight rows are identical gradients. After update: W^(ℓ) stays all-equal. By induction, symmetry is preserved forever. The effective capacity = 1 neuron per layer. **Problem 4:** ŷ = W^(L) W^(L−1) ... W^(1) x. Product of orthogonal matrices is orthogonal. So ||ŷ|| = ||x||. ∂L/∂x = (W^(1))ᵀ ... (W^(L))ᵀ ∂L/∂ŷ. Product of orthogonal transposes is orthogonal. So ||∂L/∂x|| = ||∂L/∂ŷ||. Perfect gradient preservation. ✓ **Problem 5:** Biases add to pre-activations: z = Wx + b. If b is random (e.g., large magnitude), it can push all neurons into saturated regions of the activation function BEFORE any training — causing immediate vanishing gradients. Also, zero initialization is already unbiased. There's no benefit to random biases and significant risk of harm. Initialize biases to 0 for symmetry, or to a small positive constant for ReLU.

Summary

Poor initialization (all zeros, too small, too large) prevents training: activations/gradients vanish or explode, or symmetry prevents learning diverse features.
Xavier/Glorot sets Var(W) = 2/(d_in + d_out), balancing forward activation variance and backward gradient variance for tanh/sigmoid networks.
He/Kaiming sets Var(W) = 2/d_in, accounting for ReLU discarding ~half the variance (negative half zeroed out). Standard for ReLU networks.
Orthogonal initialization (WᵀW = I) gives exact norm preservation, preferred for RNNs and very deep networks.
Biases should be initialized to zero (or small positive for ReLU) — random biases are unnecessary and harmful.

Pitfalls

Using Xavier/Glorot initialization for ReLU networks. Xavier assumes symmetric activations (like tanh/sigmoid) and balances forward and backward variance. ReLU zeros roughly half the activations, so the forward variance needs a compensating factor of 2 — always use He/Kaiming initialization (2/n_in) for ReLU-based networks.
Applying He initialization blindly to non-ReLU activations. GELU, SiLU, and Swish have different variance properties than ReLU. The zeroing rate and expected squared output differ. Test activation variance empirically for non-standard activations rather than assuming He is optimal.
Initializing biases to random values. Biases add (not multiply), so random initialization can push activations into saturated regimes before any training begins. Always initialize biases to zero, or a small positive constant (e.g., 0.01) for ReLU to encourage initial neuron activity.
Forgetting the LSTM forget gate bias. Standard zero initialization for the LSTM forget gate bias b_f causes σ(0) = 0.5, meaning the LSTM starts with no memory bias. Initializing b_f to 1 (or a positive value) gives an initial "remember" prior — critical for learning long-range dependencies.
Assuming orthogonal initialization is always better. Orthogonal initialization helps RNNs by preventing repeated eigenvalue amplification, but provides negligible benefit for standard feedforward CNNs. It's also expensive to compute for wide layers. Use it for RNNs and extremely deep networks; stick with He/Xavier for standard feedforward architectures.

Quiz

Q1: Xavier initialization sets Var(W) = 2/(n_in + n_out). Why does it use BOTH n_in and n_out?

A) It's an arbitrary choice; either n_in or n_out alone would work B) To balance between forward activation variance (depends on n_in) and backward gradient variance (depends on n_out) C) Because weights connect n_in inputs to n_out outputs, so both dimensions matter for numerical stability D) To maximize the learning rate

Answer and Explanations

**Correct: B) To balance between forward activation variance (depends on n_in) and backward gradient variance (depends on n_out)** Forward pass: Var(z) = n_in · Var(W) · Var(x) → needs Var(W) = 1/n_in. Backward pass: Var(δ^(ℓ−1)) = n_out · Var(W) · Var(δ^(ℓ)) → needs Var(W) = 1/n_out. Xavier's harmonic mean 2/(n_in+n_out) is a compromise. - A) Incorrect. Using only n_in would cause backward gradient explosion when n_out ≫ n_in. - B) ✓ Correct. The harmonic mean balances both directions. - C) While both matter, this isn't specific enough — the variance argument is the real reason. - D) Incorrect. Learning rate is a separate hyperparameter.

Q2: Why does He initialization use Var(W) = 2/n_in (instead of Xavier's 2/(n_in+n_out))?

A) ReLU networks are always wider, so n_out doesn't matter B) ReLU zeros out roughly half the activations, so the forward variance needs a compensating factor of 2 C) He initialization was designed for convolutional layers where n_in and n_out differ greatly D) It's just a different convention; Xavier and He are equivalent in practice

Answer and Explanations

**Correct: B) ReLU zeros out roughly half the activations, so the forward variance needs a compensating factor of 2** With ReLU, Var(output) ≈ Var(input)/2 (simplified). To maintain variance: n_in · Var(W) · (1/2) = 1 → Var(W) = 2/n_in. The factor of 2 compensates for ReLU discarding negative activations. - A) Incorrect. Width doesn't determine the initialization formula. - B) ✓ Correct. The factor of 2 directly compensates for ReLU's zeroing of negative values. - C) Incorrect. The derivation doesn't depend on layer type. - D) Incorrect. They give different variances and He works noticeably better for ReLU networks.

Q3: A 100-layer network has all weights initialized to exactly 0.01 (same value for every weight). Biases are 0. What happens during training?

A) The network trains normally B) The network cannot break symmetry — all neurons in each layer learn identical features C) Gradients explode immediately D) The network converges but more slowly

Answer and Explanations

**Correct: B) The network cannot break symmetry — all neurons in each layer learn identical features** With all-equal weights, every neuron in a layer computes the identical function on identical inputs. Their gradients are identical. After any update, weights remain all-equal. The network collapses to effectively one feature per layer, regardless of width. - A) Incorrect. Symmetry prevents diverse feature learning. - B) ✓ Correct. Identical initialization means identical gradients means identical updates — symmetry is preserved forever. - C) Incorrect. Small identical weights don't explode; the problem is symmetry, not magnitude. - D) Incorrect. It converges to a degenerate solution with no width benefit.

Q4: What is the key property of orthogonal initialization?

A) All weights are positive B) The weight matrix preserves the norm of input vectors: ||Wx|| = ||x|| C) Weights are drawn from a uniform distribution D) The determinant of W is 1

Answer and Explanations

**Correct: B) The weight matrix preserves the norm of input vectors: ||Wx|| = ||x||** When WᵀW = I (orthogonal matrix), ||Wx||² = xᵀWᵀWx = xᵀIx = ||x||². This gives perfect forward variance preservation and (for square matrices) perfect backward gradient preservation. - A) Incorrect. Orthogonal matrices can have negative entries. - B) ✓ Correct. Norm preservation is the defining property; all singular values are exactly 1. - C) Incorrect. Orthogonal initialization is about structure (WᵀW=I), not distribution. - D) det(W) = ±1 for orthogonal matrices, but norm preservation (not determinant) is the key property.

Q5: For a ReLU network, why might you initialize biases to a small positive value (e.g., 0.01) instead of zero?

A) To help break symmetry between neurons B) To encourage most neurons to be initially active (positive pre-activations), reducing dead neurons at the start C) To increase the learning rate D) Biases must be non-zero for ReLU to work

Answer and Explanations

**Correct: B) To encourage most neurons to be initially active (positive pre-activations), reducing dead neurons at the start** If biases are 0 and input distribution is centered at 0, roughly half the neurons will have negative pre-activations and output 0. A small positive bias shifts the distribution so that, say, 60-70% of neurons start active. This gives more gradient paths for early learning. - A) Biases don't break symmetry — random WEIGHTS do that. - B) ✓ Correct. Small positive bias increases the fraction of initially active ReLU neurons. - C) Incorrect. Bias magnitude is unrelated to learning rate. - D) Incorrect. ReLU works fine with zero bias.

Next Steps

Move on to 16-08 — Regularization to learn L1/L2 regularization, dropout, data augmentation, and other techniques for preventing overfitting.

Progress

Phases

16-07 — Weight Initialization

Learning Objectives

Core Content

1. Why Initialization Matters

2. What Naive Initialization Gets Wrong

3. The Variance Analysis Framework

4. Xavier/Glorot Initialization

5. He/Kaiming Initialization

6. Derivation of Var(ReLU(z)) for Gaussian Inputs

7. Orthogonal Initialization

8. Bias Initialization

Key Terms

Worked Examples

Example 1: Choosing Initialization by Layer Size

Example 2: Verifying Variance Preservation

Example 3: Orthogonal vs Random Initialization for RNN

Practice Problems

Summary

Pitfalls

Quiz

Next Steps