Math graphic
πŸ“ Concept diagram

16-06 β€” Gradient Flow in Deep Networks

Phase: 16 β€” Neural Network Mathematics Subject: 16-06 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), Phase 9 (matrix norms) Next subject: 16-07 β€” Weight Initialization


Learning Objectives

By the end of this subject, you will be able to:

  1. Prove mathematically why sigmoid and tanh activations cause vanishing gradients in deep networks
  2. Explain how ReLU mitigates vanishing gradients and quantify the improvement
  3. Derive the conditions for exploding gradients and explain gradient clipping
  4. Analyze gradient magnitude through a network using the product-of-Jacobians decomposition
  5. Connect initialization strategies to stable gradient flow at initialization time

Core Content

1. The Gradient as a Product of Jacobians

Recall from 16-05 that the error signal propagates backward:

Ξ΄^(β„“) = ((W^(β„“+1))α΅€ Ξ΄^(β„“+1)) βŠ™ f'_β„“(z^(β„“))

Unrolling this from the output back to layer β„“:

Ξ΄^(β„“) = (∏_{k=β„“+1}^{L} diag(f'_k(z^(k))) (W^(k))α΅€) Ξ΄^(L)

This is a long product of matrices. The norm of this product determines whether gradients survive or die.

⚠️ THIS IS CRITICAL β€” Gradient flow is determined by the product of Jacobians along the backward path. If the eigenvalues of these Jacobians are consistently < 1, gradients VANISH. If they're > 1, gradients EXPLODE. Stable training requires eigenvalues β‰ˆ 1 on average.

2. Vanishing Gradients: The Sigmoid/Tanh Problem

For a deep network using sigmoid activations, let's analyze the gradient magnitude for early layers.

Consider the simplified case where all layers have the same width and weights are initialized with small values (a common early practice). The backward recurrence is:

Ξ΄^(β„“) = Wα΅€ Ξ΄^(β„“+1) βŠ™ Οƒ'(z^(β„“))

Taking norms and assuming independence:

||Ξ΄^(β„“)|| β‰ˆ ||W|| Β· Οƒ'_max Β· ||Ξ΄^(β„“+1)||

Where Οƒ'_max = max_z Οƒ'(z) = 0.25.

Even if ||W|| = 1, each layer multiplies the gradient by at most 0.25. After L layers:

||Ξ΄^(0)|| β‰ˆ (0.25)^L Β· ||Ξ΄^(L)||

For a 10-layer network: (0.25)¹⁰ β‰ˆ 9.5 Γ— 10⁻⁷ β€” the gradient reaching the first layer is a MILLIONTH of the output gradient. The early layers effectively receive zero learning signal.

Why tanh helps only slightly: tanh'(z) ≀ 1 (max at z=0), but saturates quickly. For random inputs z ∼ N(0, 1), the expected tanh'(z) is about 0.3-0.5 β€” still less than 1. After many layers, gradients still vanish, just more slowly than with sigmoid.

3. ReLU to the Rescue

ReLU has derivative 1 for ALL positive inputs:

ReLU'(z) = 1 for z > 0

If the network is initialized so that roughly half the neurons are active (z > 0), the expected Jacobian norm product doesn't decay:

Ξ΄^(β„“) = Wα΅€ Ξ΄^(β„“+1) βŠ™ ReLU'(z^(β„“))

The ReLU'(z) factor is 1 for active neurons and 0 for inactive ones. For the active paths, gradients propagate with NO multiplicative attenuation from the activation.

The catch β€” dying ReLUs: If initialization or learning pushes too many neurons permanently into the negative regime, those path gradients are permanently 0. The effective depth is reduced, but with proper initialization (16-07), this is manageable.

Quantitative comparison: For a 100-layer network: - Sigmoid: gradient to layer 1 β‰ˆ (0.25)¹⁰⁰ β‰ˆ 6 Γ— 10⁻⁢¹ β€” completely gone - ReLU (50% active): gradient to layer 1 passes through ∼50 active Γ— 1.0 multiplications β€” full magnitude preserved along active paths!

4. Exploding Gradients

The opposite problem: if weight matrices have large singular values, gradients GROW exponentially:

||Ξ΄^(β„“)|| β‰ˆ (Οƒ_max(W))^L Β· ||Ξ΄^(L)||

If Οƒ_max(W) > 1, gradients explode. This is especially problematic in RNNs (unrolled through time) and very deep networks.

Symptoms of exploding gradients: - Loss suddenly jumps to NaN - Weight updates become enormous - Training diverges catastrophically

Solution β€” Gradient Clipping:

g ← g Β· min(1, C / ||g||)

If the gradient norm ||g|| exceeds threshold C, scale it down to norm C while preserving direction. This prevents any single update from being destructively large.

Alternative β€” value clipping:

gα΅’ ← clip(gα΅’, βˆ’C, C)

Clips each gradient component individually to [βˆ’C, C].

5. The Gradient Norm Across Layers

A useful diagnostic: plot ||βˆ‚L/βˆ‚W^(β„“)|| for each layer β„“ during training.

Expected healthy behavior: Gradient norms are roughly similar across layers (within an order of magnitude).

Vanishing gradient signature: Early layers have dramatically smaller gradient norms than later layers β€” sometimes 10⁻¹⁰× smaller.

Exploding gradient signature: Early layers have dramatically larger gradient norms than later layers.

6. Mathematical Analysis of Gradient Flow at Initialization

At initialization (before training), we can analyze gradient flow analytically. Consider a linear network with orthogonal weight matrices (Wα΅€W = I) of equal dimension:

a^(β„“) = W^(β„“) a^(β„“βˆ’1)

The gradient: βˆ‚L/βˆ‚a^(0) = (W^(L))α΅€ (W^(Lβˆ’1))α΅€ ... (W^(1))α΅€ βˆ‚L/βˆ‚a^(L)

Since each W^(β„“) is orthogonal, ||βˆ‚L/βˆ‚a^(0)|| = ||βˆ‚L/βˆ‚a^(L)|| β€” perfect gradient preservation!

With non-linear activations: Even with orthogonal weights, activations introduce scaling. The Jacobian of the activation f at layer β„“ has eigenvalues f'(zα΅’^(β„“)). The product:

||βˆ‚L/βˆ‚a^(0)|| = (∏{β„“=1}^{L} ∏{i} |f'(zα΅’^(β„“))|)^{1/d} Β· ||βˆ‚L/βˆ‚a^(L)||

If f = ReLU and half the neurons are active: each active path contributes factor 1. The gradient magnitude at the input is approximately (1/2)^{L/2} Β· βˆ‚L/βˆ‚a^(L) β€” decaying, but MUCH more slowly than with sigmoid.

7. Residual Connections and Gradient Highways

Residual connections (formalized in Phase 17) provide a direct gradient path:

a^(β„“+1) = F(a^(β„“)) + a^(β„“)

The backward gradient:

βˆ‚L/βˆ‚a^(β„“) = βˆ‚L/βˆ‚a^(β„“+1) Β· (βˆ‚F/βˆ‚a^(β„“) + I) = βˆ‚L/βˆ‚a^(β„“+1) + βˆ‚L/βˆ‚a^(β„“+1)Β·βˆ‚F/βˆ‚a^(β„“)

The identity term (I) provides a "gradient highway" β€” even if βˆ‚F/βˆ‚a^(β„“) has small eigenvalues, the +I term ensures at least that portion of the gradient reaches earlier layers unchanged. This is why ResNets can be 1000+ layers deep while plain networks fail beyond ~20 layers.



Key Terms

Worked Examples

Example 1: Computing Gradient Attenuation

Problem: A 5-layer network with sigmoid activations has all pre-activations z ∼ N(0, 1) i.i.d. Assume orthogonal W matrices with ||W|| = 1. What fraction of the output gradient reaches the input?

Solution:

For z ∼ N(0,1), we need E[Οƒ'(z)]. Οƒ'(z) = Οƒ(z)(1βˆ’Οƒ(z)).

Using numerical approximation: for z ∼ N(0,1), the expected value of Οƒ'(z) is approximately 0.207 (can be computed via integration or sampling).

Per layer: gradient multiplied by ∼0.207 on average. After 5 layers: 0.207⁡ β‰ˆ 0.00038.

Only 0.038% of the gradient survives. The first layer learns 2600Γ— slower than the last.

With ReLU instead (half active): per-layer factor is 0.5 (only active paths survive). After 5 layers with ReLU: 0.5⁡ = 0.03125. About 3% survives β€” 82Γ— more than with sigmoid.

Example 2: Exploding Gradients in an RNN

Problem: An RNN with scalar hidden state h_t = wΒ·h_{tβˆ’1} + ... is unrolled for T = 100 steps. The recurrent weight w = 1.1. The loss depends on the final state h_T. What is βˆ‚L/βˆ‚hβ‚€?

Solution:

βˆ‚L/βˆ‚hβ‚€ = βˆ‚L/βˆ‚h_T Β· βˆ‚h_T/βˆ‚hβ‚€

βˆ‚h_T/βˆ‚hβ‚€ = ∏{t=1}^{T} βˆ‚h_t/βˆ‚h{tβˆ’1} = ∏_{t=1}^{T} w = w^T

With w = 1.1, T = 100: 1.1¹⁰⁰ β‰ˆ 13,780.

The gradient is amplified by a factor of nearly 14,000! This causes wildly unstable training. Gradient clipping would cap this at C/||g||.

If w = 0.9 instead: 0.9¹⁰⁰ β‰ˆ 2.66 Γ— 10⁻⁡ β€” nearly zero. The RNN forgets the distant past.

This is the fundamental challenge of training RNNs: the weight magnitude must be very close to 1 for long-range dependencies, which is why LSTMs/GRUs (Phase 17) use gating mechanisms to learn when to remember/forget rather than relying on a fixed scalar multiplier.

Example 3: Gradient Clipping Threshold Selection

Problem: During training, you observe that gradient norms fluctuate between 0.1 and 50. You want to clip to prevent the largest updates from destabilizing training. The average healthy gradient norm is around 1. What threshold C should you choose, and what would be the effective update step for a gradient with ||g|| = 50 when C = 5?

Solution:

Choose C around the 90th-95th percentile of gradient norms, not the average. If norms range from 0.1 to 50 with median near 1, try C = 5.

For ||g|| = 50 with C = 5: g_clipped = g Β· (5/50) = 0.1 Β· g

The update direction is preserved, but the step size is reduced by 10Γ—. This prevents a single outlier batch from destroying progress while still allowing learning from it.

Typical values in practice: C = 1.0, C = 5.0, or C = 10.0, tuned based on monitoring.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: For a 20-layer ReLU network where exactly 40% of neurons are active at each layer (all weights orthogonal, ||W||=1), compute the expected fraction of gradient paths that survive from the output to the first layer.

Problem 2: A deep linear network has weight matrices with singular values all equal to Οƒ. Derive the condition on Οƒ for neither vanishing nor exploding gradients.

Problem 3: In a network with tanh activation, the pre-activations are distributed as z ∼ Uniform(βˆ’2, 2) at each layer. Estimate the per-layer gradient attenuation factor E[tanh'(z)].

Problem 4: Gradient clipping with threshold C scales g to have norm C when ||g|| > C. Prove that this operation is a projection onto the β„“β‚‚ ball of radius C.

Problem 5: Show that in a residual network with update a^(β„“+1) = F_β„“(a^(β„“)) + a^(β„“), the gradient βˆ‚L/βˆ‚a^(β„“) contains an additive term βˆ‚L/βˆ‚a^(β„“+1) that does NOT pass through the Jacobian of F_β„“. Explain why this prevents vanishing gradients even if F_β„“ has very small Jacobian.

Answers (click to expand) **Problem 1:** Each active neuron contributes factor 1; the 60% inactive neurons dead-end. A gradient path survives only if all 20 neurons along that path are active. P(survive one layer) = 0.4. P(survive all 20) = 0.4²⁰ β‰ˆ 1.1 Γ— 10⁻⁸. Surprisingly, nearly ALL paths die! But in practice, the network adapts during training β€” weights shift so that important paths stay active. Also, each neuron connects to MANY neurons in adjacent layers, so multiple paths connect any two nodes, and at least some survive. **Problem 2:** For a linear network with L identical layers: βˆ‚L/βˆ‚a^(0) = (Wα΅€)^L βˆ‚L/βˆ‚a^(L). The singular values of (Wα΅€)^L are Οƒ^L. For stable gradient flow, we need Οƒ^L β‰ˆ 1 for all L, which requires Οƒ = 1 exactly (or very close). Any Οƒ β‰  1 leads to exponential growth or decay in depth. **Problem 3:** tanh'(z) = 1 βˆ’ tanhΒ²(z). For z ∼ Uniform(βˆ’2, 2): E[tanh'(z)] = (1/4) ∫_{-2}^{2} (1 βˆ’ tanhΒ²(z)) dz = (1/4) [z βˆ’ tanh(z)]_{-2}^{2} = (1/4)[(2 βˆ’ tanh(2)) βˆ’ (βˆ’2 βˆ’ tanh(βˆ’2))] = (1/4)[2 βˆ’ 0.964 + 2 βˆ’ 0.964] = (1/4)[2.072] β‰ˆ 0.518 Each layer attenuates gradients by about half on average. After 10 layers: 0.518¹⁰ β‰ˆ 0.0014. Still vanishing, though slower than sigmoid. **Problem 4:** The β„“β‚‚ ball of radius C is B_C = {x : ||x|| ≀ C}. For a point g with ||g|| > C, the projection onto B_C is argmin_{||x||≀C} ||x βˆ’ g||. By geometry, the closest point on the sphere is the scaled version: gΒ·(C/||g||). This is exactly the gradient clipping formula. βœ“ **Problem 5:** βˆ‚L/βˆ‚a^(β„“) = βˆ‚L/βˆ‚a^(β„“+1) Β· βˆ‚a^(β„“+1)/βˆ‚a^(β„“) = βˆ‚L/βˆ‚a^(β„“+1) Β· (βˆ‚F_β„“/βˆ‚a^(β„“) + I) = βˆ‚L/βˆ‚a^(β„“+1) Β· βˆ‚F_β„“/βˆ‚a^(β„“) + βˆ‚L/βˆ‚a^(β„“+1) The second term βˆ‚L/βˆ‚a^(β„“+1) is the gradient signal sent directly backward through the skip connection WITHOUT multiplication by any Jacobian. Even if ||βˆ‚F_β„“/βˆ‚a^(β„“)|| β‰ˆ 0 (meaning F_β„“ is effectively dead), the gradient βˆ‚L/βˆ‚a^(β„“+1) still flows through verbatim. This additive identity path guarantees that some gradient always reaches all layers, preventing complete vanishing.

Summary

  1. Gradient norm after L layers = product of Jacobian norms. Sigmoid (max derivative 0.25) causes exponential decay; tanh is slightly better but still problematic beyond ~10 layers.
  2. ReLU mitigates vanishing gradients because its derivative is exactly 1 for active neurons, but "dying ReLUs" can block paths entirely.
  3. Exploding gradients occur when ||W|| > 1; solved by gradient clipping: scale down any gradient exceeding threshold C.
  4. Residual connections add an identity term to the gradient path, ensuring βˆ‚L/βˆ‚a^(β„“) receives βˆ‚L/βˆ‚a^(β„“+1) directly β€” a gradient highway that prevents vanishing even in very deep networks.
  5. Monitoring gradient norms per layer is the primary diagnostic: healthy networks show similar magnitudes across all layers.

Pitfalls


Quiz

Q1: In a 50-layer sigmoid network with random weight matrices (assume ||W|| β‰ˆ 1), what is the approximate gradient magnitude at layer 1 relative to layer 50?

A) About the same B) About 0.25⁴⁹ β‰ˆ 0 C) About 4⁴⁹ (exploded) D) It depends only on the loss function

Answer and Explanations **Correct: B) About 0.25^49 β‰ˆ 0** Each sigmoid layer multiplies the gradient by at most Οƒ'(z) ≀ 0.25. Over 49 backward steps: 0.25⁴⁹ β‰ˆ 2.5 Γ— 10⁻³⁰ β€” practically zero. Early layers receive essentially no learning signal. - A) Incorrect. Sigmoid's max derivative of 0.25 guarantees gradient attenuation at every layer. - B) βœ“ Correct. Exponential decay at rate ≀0.25 per layer makes gradients vanish. - C) Incorrect. Exploding gradients require Οƒ_max(W) > 4 to overcome sigmoid attenuation. - D) Incorrect. While loss matters, the activation function's derivative is the dominant factor here.

Q2: What is the primary advantage of ReLU over sigmoid for gradient flow?

A) ReLU is faster to compute B) ReLU's derivative is exactly 1 for positive inputs, eliminating multiplicative attenuation C) ReLU guarantees all gradients are positive D) ReLU has a smaller output range

Answer and Explanations **Correct: B) ReLU's derivative is exactly 1 for positive inputs, eliminating multiplicative attenuation** When a ReLU neuron is active, d(ReLU(z))/dz = 1. The gradient passes through unchanged β€” no decay. The product of many 1s is still 1, so gradients can survive arbitrarily deep networks along active paths. - A) True but not the primary advantage for gradient flow. - B) βœ“ Correct. The constant-1 derivative is the key property for deep network training. - C) Incorrect and false β€” ReLU gradients are 0 or 1, not all positive in a meaningful sense. - D) Irrelevant to gradient flow.

Q3: Gradient clipping with threshold C:

A) Changes the direction of the gradient B) Scales the gradient to have norm C if ||g|| > C C) Sets all gradients to exactly C D) Only clips positive gradients

Answer and Explanations **Correct: B) Scales the gradient to have norm C if ||g|| > C** g_clipped = g Β· min(1, C/||g||). If ||g|| ≀ C, the gradient is unchanged. If ||g|| > C, it's scaled down to norm C while preserving direction. This prevents destructively large updates while maintaining the correct descent direction. - A) Incorrect. The direction is preserved; only the magnitude is capped. - B) βœ“ Correct. Uniform scaling preserves direction while bounding magnitude. - C) Incorrect. Only gradients exceeding C are affected; smaller ones are untouched. - D) Incorrect. Clipping applies to the norm, which is always non-negative.

Q4: Why do residual connections (skip connections) help with vanishing gradients?

A) They add more parameters to learn B) They provide an additive identity path in the gradient, bypassing the Jacobian of the residual function C) They make the network deeper D) They replace ReLU with a better activation

Answer and Explanations **Correct: B) They provide an additive identity path in the gradient, bypassing the Jacobian of the residual function** With a^(β„“+1) = F(a^(β„“)) + a^(β„“), the gradient is βˆ‚L/βˆ‚a^(β„“+1)Β·(βˆ‚F/βˆ‚a^(β„“) + I). The +I term means βˆ‚L/βˆ‚a^(β„“) always includes βˆ‚L/βˆ‚a^(β„“+1) unchanged. Even if βˆ‚F/βˆ‚a^(β„“) β†’ 0, the gradient still flows. - A) Incorrect. Skip connections don't add parameters. - B) βœ“ Correct. The identity mapping creates a gradient highway. - C) Incorrect. They ENABLE greater depth by solving the vanishing gradient problem. - D) Incorrect. Residual connections are orthogonal to activation choice.

Q5: An RNN uses the recurrence h_t = Οƒ(wΒ·h_{tβˆ’1} + UΒ·x_t). For the gradient βˆ‚L/βˆ‚hβ‚€ to neither vanish nor explode over T=1000 steps, what must the recurrent weight w approximately satisfy?

A) w β‰ˆ 0 B) w β‰ˆ 1 C) w > 4 to overcome sigmoid saturation D) Any value works as long as the network is trained long enough

Answer and Explanations **Correct: C) w > 4 to overcome sigmoid saturation** The effective Jacobian is wΒ·Οƒ'(Β·). Since Οƒ'(Β·) ≀ 0.25, even with w=1, the product is ≀ 0.25 β€” vanishing. To get an effective multiplier near 1, we need w β‰ˆ 4 so that wΒ·Οƒ'(0) = 4Β·0.25 = 1. This is why RNNs are hard to train β€” the weight must be precisely tuned, and saturation regions still cause problems. - A) wβ‰ˆ0: Gradients vanish immediately. - B) wβ‰ˆ1: wΒ·Οƒ'(0) = 0.25, still vanishing exponentially. - C) βœ“ Correct. Must compensate for Οƒ' ≀ 0.25 with larger w. LSTMs solve this via gating instead. - D) Incorrect. Training time doesn't fix exponential vanishing β€” the gradients are numerically zero.

Next Steps

Move on to 16-07 β€” Weight Initialization to learn how proper initialization (Xavier/Glorot, He/Kaiming) sets up the initial gradient flow to be stable from the very first training step.