16-06 β Gradient Flow in Deep Networks
Phase: 16 β Neural Network Mathematics Subject: 16-06 Prerequisites: 16-05 (Backpropagation), 16-02 (Activation Functions), Phase 9 (matrix norms) Next subject: 16-07 β Weight Initialization
Learning Objectives
By the end of this subject, you will be able to:
- Prove mathematically why sigmoid and tanh activations cause vanishing gradients in deep networks
- Explain how ReLU mitigates vanishing gradients and quantify the improvement
- Derive the conditions for exploding gradients and explain gradient clipping
- Analyze gradient magnitude through a network using the product-of-Jacobians decomposition
- Connect initialization strategies to stable gradient flow at initialization time
Core Content
1. The Gradient as a Product of Jacobians
Recall from 16-05 that the error signal propagates backward:
Ξ΄^(β) = ((W^(β+1))α΅ Ξ΄^(β+1)) β f'_β(z^(β))
Unrolling this from the output back to layer β:
Ξ΄^(β) = (β_{k=β+1}^{L} diag(f'_k(z^(k))) (W^(k))α΅) Ξ΄^(L)
This is a long product of matrices. The norm of this product determines whether gradients survive or die.
β οΈ THIS IS CRITICAL β Gradient flow is determined by the product of Jacobians along the backward path. If the eigenvalues of these Jacobians are consistently < 1, gradients VANISH. If they're > 1, gradients EXPLODE. Stable training requires eigenvalues β 1 on average.
2. Vanishing Gradients: The Sigmoid/Tanh Problem
For a deep network using sigmoid activations, let's analyze the gradient magnitude for early layers.
Consider the simplified case where all layers have the same width and weights are initialized with small values (a common early practice). The backward recurrence is:
Ξ΄^(β) = Wα΅ Ξ΄^(β+1) β Ο'(z^(β))
Taking norms and assuming independence:
||Ξ΄^(β)|| β ||W|| Β· Ο'_max Β· ||Ξ΄^(β+1)||
Where Ο'_max = max_z Ο'(z) = 0.25.
Even if ||W|| = 1, each layer multiplies the gradient by at most 0.25. After L layers:
||Ξ΄^(0)|| β (0.25)^L Β· ||Ξ΄^(L)||
For a 10-layer network: (0.25)ΒΉβ° β 9.5 Γ 10β»β· β the gradient reaching the first layer is a MILLIONTH of the output gradient. The early layers effectively receive zero learning signal.
Why tanh helps only slightly: tanh'(z) β€ 1 (max at z=0), but saturates quickly. For random inputs z βΌ N(0, 1), the expected tanh'(z) is about 0.3-0.5 β still less than 1. After many layers, gradients still vanish, just more slowly than with sigmoid.
3. ReLU to the Rescue
ReLU has derivative 1 for ALL positive inputs:
ReLU'(z) = 1 for z > 0
If the network is initialized so that roughly half the neurons are active (z > 0), the expected Jacobian norm product doesn't decay:
Ξ΄^(β) = Wα΅ Ξ΄^(β+1) β ReLU'(z^(β))
The ReLU'(z) factor is 1 for active neurons and 0 for inactive ones. For the active paths, gradients propagate with NO multiplicative attenuation from the activation.
The catch β dying ReLUs: If initialization or learning pushes too many neurons permanently into the negative regime, those path gradients are permanently 0. The effective depth is reduced, but with proper initialization (16-07), this is manageable.
Quantitative comparison: For a 100-layer network: - Sigmoid: gradient to layer 1 β (0.25)ΒΉβ°β° β 6 Γ 10β»βΆΒΉ β completely gone - ReLU (50% active): gradient to layer 1 passes through βΌ50 active Γ 1.0 multiplications β full magnitude preserved along active paths!
4. Exploding Gradients
The opposite problem: if weight matrices have large singular values, gradients GROW exponentially:
||Ξ΄^(β)|| β (Ο_max(W))^L Β· ||Ξ΄^(L)||
If Ο_max(W) > 1, gradients explode. This is especially problematic in RNNs (unrolled through time) and very deep networks.
Symptoms of exploding gradients: - Loss suddenly jumps to NaN - Weight updates become enormous - Training diverges catastrophically
Solution β Gradient Clipping:
g β g Β· min(1, C / ||g||)
If the gradient norm ||g|| exceeds threshold C, scale it down to norm C while preserving direction. This prevents any single update from being destructively large.
Alternative β value clipping:
gα΅’ β clip(gα΅’, βC, C)
Clips each gradient component individually to [βC, C].
5. The Gradient Norm Across Layers
A useful diagnostic: plot ||βL/βW^(β)|| for each layer β during training.
Expected healthy behavior: Gradient norms are roughly similar across layers (within an order of magnitude).
Vanishing gradient signature: Early layers have dramatically smaller gradient norms than later layers β sometimes 10β»ΒΉβ°Γ smaller.
Exploding gradient signature: Early layers have dramatically larger gradient norms than later layers.
6. Mathematical Analysis of Gradient Flow at Initialization
At initialization (before training), we can analyze gradient flow analytically. Consider a linear network with orthogonal weight matrices (Wα΅W = I) of equal dimension:
a^(β) = W^(β) a^(ββ1)
The gradient: βL/βa^(0) = (W^(L))α΅ (W^(Lβ1))α΅ ... (W^(1))α΅ βL/βa^(L)
Since each W^(β) is orthogonal, ||βL/βa^(0)|| = ||βL/βa^(L)|| β perfect gradient preservation!
With non-linear activations: Even with orthogonal weights, activations introduce scaling. The Jacobian of the activation f at layer β has eigenvalues f'(zα΅’^(β)). The product:
||βL/βa^(0)|| = (β{β=1}^{L} β{i} |f'(zα΅’^(β))|)^{1/d} Β· ||βL/βa^(L)||
If f = ReLU and half the neurons are active: each active path contributes factor 1. The gradient magnitude at the input is approximately (1/2)^{L/2} Β· βL/βa^(L) β decaying, but MUCH more slowly than with sigmoid.
7. Residual Connections and Gradient Highways
Residual connections (formalized in Phase 17) provide a direct gradient path:
a^(β+1) = F(a^(β)) + a^(β)
The backward gradient:
βL/βa^(β) = βL/βa^(β+1) Β· (βF/βa^(β) + I) = βL/βa^(β+1) + βL/βa^(β+1)Β·βF/βa^(β)
The identity term (I) provides a "gradient highway" β even if βF/βa^(β) has small eigenvalues, the +I term ensures at least that portion of the gradient reaches earlier layers unchanged. This is why ResNets can be 1000+ layers deep while plain networks fail beyond ~20 layers.
Key Terms
- Exploding gradients
- Gradient norm after L layers
- Monitoring gradient norms per layer
- ReLU mitigates vanishing gradients
- Residual connections
Worked Examples
Example 1: Computing Gradient Attenuation
Problem: A 5-layer network with sigmoid activations has all pre-activations z βΌ N(0, 1) i.i.d. Assume orthogonal W matrices with ||W|| = 1. What fraction of the output gradient reaches the input?
Solution:
For z βΌ N(0,1), we need E[Ο'(z)]. Ο'(z) = Ο(z)(1βΟ(z)).
Using numerical approximation: for z βΌ N(0,1), the expected value of Ο'(z) is approximately 0.207 (can be computed via integration or sampling).
Per layer: gradient multiplied by βΌ0.207 on average. After 5 layers: 0.207β΅ β 0.00038.
Only 0.038% of the gradient survives. The first layer learns 2600Γ slower than the last.
With ReLU instead (half active): per-layer factor is 0.5 (only active paths survive). After 5 layers with ReLU: 0.5β΅ = 0.03125. About 3% survives β 82Γ more than with sigmoid.
Example 2: Exploding Gradients in an RNN
Problem: An RNN with scalar hidden state h_t = wΒ·h_{tβ1} + ... is unrolled for T = 100 steps. The recurrent weight w = 1.1. The loss depends on the final state h_T. What is βL/βhβ?
Solution:
βL/βhβ = βL/βh_T Β· βh_T/βhβ
βh_T/βhβ = β{t=1}^{T} βh_t/βh{tβ1} = β_{t=1}^{T} w = w^T
With w = 1.1, T = 100: 1.1ΒΉβ°β° β 13,780.
The gradient is amplified by a factor of nearly 14,000! This causes wildly unstable training. Gradient clipping would cap this at C/||g||.
If w = 0.9 instead: 0.9ΒΉβ°β° β 2.66 Γ 10β»β΅ β nearly zero. The RNN forgets the distant past.
This is the fundamental challenge of training RNNs: the weight magnitude must be very close to 1 for long-range dependencies, which is why LSTMs/GRUs (Phase 17) use gating mechanisms to learn when to remember/forget rather than relying on a fixed scalar multiplier.
Example 3: Gradient Clipping Threshold Selection
Problem: During training, you observe that gradient norms fluctuate between 0.1 and 50. You want to clip to prevent the largest updates from destabilizing training. The average healthy gradient norm is around 1. What threshold C should you choose, and what would be the effective update step for a gradient with ||g|| = 50 when C = 5?
Solution:
Choose C around the 90th-95th percentile of gradient norms, not the average. If norms range from 0.1 to 50 with median near 1, try C = 5.
For ||g|| = 50 with C = 5: g_clipped = g Β· (5/50) = 0.1 Β· g
The update direction is preserved, but the step size is reduced by 10Γ. This prevents a single outlier batch from destroying progress while still allowing learning from it.
Typical values in practice: C = 1.0, C = 5.0, or C = 10.0, tuned based on monitoring.
Practice Problems
(Answers are below. Try each problem before checking.)
Problem 1: For a 20-layer ReLU network where exactly 40% of neurons are active at each layer (all weights orthogonal, ||W||=1), compute the expected fraction of gradient paths that survive from the output to the first layer.
Problem 2: A deep linear network has weight matrices with singular values all equal to Ο. Derive the condition on Ο for neither vanishing nor exploding gradients.
Problem 3: In a network with tanh activation, the pre-activations are distributed as z βΌ Uniform(β2, 2) at each layer. Estimate the per-layer gradient attenuation factor E[tanh'(z)].
Problem 4: Gradient clipping with threshold C scales g to have norm C when ||g|| > C. Prove that this operation is a projection onto the ββ ball of radius C.
Problem 5: Show that in a residual network with update a^(β+1) = F_β(a^(β)) + a^(β), the gradient βL/βa^(β) contains an additive term βL/βa^(β+1) that does NOT pass through the Jacobian of F_β. Explain why this prevents vanishing gradients even if F_β has very small Jacobian.
Answers (click to expand)
**Problem 1:** Each active neuron contributes factor 1; the 60% inactive neurons dead-end. A gradient path survives only if all 20 neurons along that path are active. P(survive one layer) = 0.4. P(survive all 20) = 0.4Β²β° β 1.1 Γ 10β»βΈ. Surprisingly, nearly ALL paths die! But in practice, the network adapts during training β weights shift so that important paths stay active. Also, each neuron connects to MANY neurons in adjacent layers, so multiple paths connect any two nodes, and at least some survive. **Problem 2:** For a linear network with L identical layers: βL/βa^(0) = (Wα΅)^L βL/βa^(L). The singular values of (Wα΅)^L are Ο^L. For stable gradient flow, we need Ο^L β 1 for all L, which requires Ο = 1 exactly (or very close). Any Ο β 1 leads to exponential growth or decay in depth. **Problem 3:** tanh'(z) = 1 β tanhΒ²(z). For z βΌ Uniform(β2, 2): E[tanh'(z)] = (1/4) β«_{-2}^{2} (1 β tanhΒ²(z)) dz = (1/4) [z β tanh(z)]_{-2}^{2} = (1/4)[(2 β tanh(2)) β (β2 β tanh(β2))] = (1/4)[2 β 0.964 + 2 β 0.964] = (1/4)[2.072] β 0.518 Each layer attenuates gradients by about half on average. After 10 layers: 0.518ΒΉβ° β 0.0014. Still vanishing, though slower than sigmoid. **Problem 4:** The ββ ball of radius C is B_C = {x : ||x|| β€ C}. For a point g with ||g|| > C, the projection onto B_C is argmin_{||x||β€C} ||x β g||. By geometry, the closest point on the sphere is the scaled version: gΒ·(C/||g||). This is exactly the gradient clipping formula. β **Problem 5:** βL/βa^(β) = βL/βa^(β+1) Β· βa^(β+1)/βa^(β) = βL/βa^(β+1) Β· (βF_β/βa^(β) + I) = βL/βa^(β+1) Β· βF_β/βa^(β) + βL/βa^(β+1) The second term βL/βa^(β+1) is the gradient signal sent directly backward through the skip connection WITHOUT multiplication by any Jacobian. Even if ||βF_β/βa^(β)|| β 0 (meaning F_β is effectively dead), the gradient βL/βa^(β+1) still flows through verbatim. This additive identity path guarantees that some gradient always reaches all layers, preventing complete vanishing.Summary
- Gradient norm after L layers = product of Jacobian norms. Sigmoid (max derivative 0.25) causes exponential decay; tanh is slightly better but still problematic beyond ~10 layers.
- ReLU mitigates vanishing gradients because its derivative is exactly 1 for active neurons, but "dying ReLUs" can block paths entirely.
- Exploding gradients occur when ||W|| > 1; solved by gradient clipping: scale down any gradient exceeding threshold C.
- Residual connections add an identity term to the gradient path, ensuring βL/βa^(β) receives βL/βa^(β+1) directly β a gradient highway that prevents vanishing even in very deep networks.
- Monitoring gradient norms per layer is the primary diagnostic: healthy networks show similar magnitudes across all layers.
Pitfalls
- Confusing dying ReLUs with vanishing gradients. A dying ReLU (permanent zero output) is a forward-pass problem caused by poor initialization or large negative biases β the neuron stops contributing entirely. Vanishing gradients is a backward-pass attenuation problem where the signal shrinks exponentially with depth. They have different causes and different fixes.
- Assuming residual connections eliminate all gradient issues. Residuals provide an additive identity path that prevents gradient vanishing, but they don't fix exploding gradients. If weight matrix norms exceed 1, gradient clipping is still essential even in residual networks.
- Setting gradient clipping threshold C too low. If C is below the typical healthy gradient norm, every step is clipped, crippling learning. Monitor gradient norms during the first few hundred training steps and set C around the 90th-95th percentile β not the average.
- Overestimating tanh's help with vanishing gradients. tanh'(z) β€ 1 (max at z=0) with expected value ~0.3β0.5 for typical inputs. This is better than sigmoid's 0.25, but still causes exponential decay over many layers. A 50-layer tanh network still vanishes; ReLU or residuals are needed for depth.
- Analyzing gradient flow only at initialization. Gradient flow evolves during training as weights change. A network stable at initialization can develop vanishing or exploding gradients mid-training. Monitor per-layer gradient norms throughout training, not just at step 0.
Quiz
Q1: In a 50-layer sigmoid network with random weight matrices (assume ||W|| β 1), what is the approximate gradient magnitude at layer 1 relative to layer 50?
A) About the same B) About 0.25β΄βΉ β 0 C) About 4β΄βΉ (exploded) D) It depends only on the loss function
Answer and Explanations
**Correct: B) About 0.25^49 β 0** Each sigmoid layer multiplies the gradient by at most Ο'(z) β€ 0.25. Over 49 backward steps: 0.25β΄βΉ β 2.5 Γ 10β»Β³β° β practically zero. Early layers receive essentially no learning signal. - A) Incorrect. Sigmoid's max derivative of 0.25 guarantees gradient attenuation at every layer. - B) β Correct. Exponential decay at rate β€0.25 per layer makes gradients vanish. - C) Incorrect. Exploding gradients require Ο_max(W) > 4 to overcome sigmoid attenuation. - D) Incorrect. While loss matters, the activation function's derivative is the dominant factor here.Q2: What is the primary advantage of ReLU over sigmoid for gradient flow?
A) ReLU is faster to compute B) ReLU's derivative is exactly 1 for positive inputs, eliminating multiplicative attenuation C) ReLU guarantees all gradients are positive D) ReLU has a smaller output range
Answer and Explanations
**Correct: B) ReLU's derivative is exactly 1 for positive inputs, eliminating multiplicative attenuation** When a ReLU neuron is active, d(ReLU(z))/dz = 1. The gradient passes through unchanged β no decay. The product of many 1s is still 1, so gradients can survive arbitrarily deep networks along active paths. - A) True but not the primary advantage for gradient flow. - B) β Correct. The constant-1 derivative is the key property for deep network training. - C) Incorrect and false β ReLU gradients are 0 or 1, not all positive in a meaningful sense. - D) Irrelevant to gradient flow.Q3: Gradient clipping with threshold C:
A) Changes the direction of the gradient B) Scales the gradient to have norm C if ||g|| > C C) Sets all gradients to exactly C D) Only clips positive gradients
Answer and Explanations
**Correct: B) Scales the gradient to have norm C if ||g|| > C** g_clipped = g Β· min(1, C/||g||). If ||g|| β€ C, the gradient is unchanged. If ||g|| > C, it's scaled down to norm C while preserving direction. This prevents destructively large updates while maintaining the correct descent direction. - A) Incorrect. The direction is preserved; only the magnitude is capped. - B) β Correct. Uniform scaling preserves direction while bounding magnitude. - C) Incorrect. Only gradients exceeding C are affected; smaller ones are untouched. - D) Incorrect. Clipping applies to the norm, which is always non-negative.Q4: Why do residual connections (skip connections) help with vanishing gradients?
A) They add more parameters to learn B) They provide an additive identity path in the gradient, bypassing the Jacobian of the residual function C) They make the network deeper D) They replace ReLU with a better activation
Answer and Explanations
**Correct: B) They provide an additive identity path in the gradient, bypassing the Jacobian of the residual function** With a^(β+1) = F(a^(β)) + a^(β), the gradient is βL/βa^(β+1)Β·(βF/βa^(β) + I). The +I term means βL/βa^(β) always includes βL/βa^(β+1) unchanged. Even if βF/βa^(β) β 0, the gradient still flows. - A) Incorrect. Skip connections don't add parameters. - B) β Correct. The identity mapping creates a gradient highway. - C) Incorrect. They ENABLE greater depth by solving the vanishing gradient problem. - D) Incorrect. Residual connections are orthogonal to activation choice.Q5: An RNN uses the recurrence h_t = Ο(wΒ·h_{tβ1} + UΒ·x_t). For the gradient βL/βhβ to neither vanish nor explode over T=1000 steps, what must the recurrent weight w approximately satisfy?
A) w β 0 B) w β 1 C) w > 4 to overcome sigmoid saturation D) Any value works as long as the network is trained long enough
Answer and Explanations
**Correct: C) w > 4 to overcome sigmoid saturation** The effective Jacobian is wΒ·Ο'(Β·). Since Ο'(Β·) β€ 0.25, even with w=1, the product is β€ 0.25 β vanishing. To get an effective multiplier near 1, we need w β 4 so that wΒ·Ο'(0) = 4Β·0.25 = 1. This is why RNNs are hard to train β the weight must be precisely tuned, and saturation regions still cause problems. - A) wβ0: Gradients vanish immediately. - B) wβ1: wΒ·Ο'(0) = 0.25, still vanishing exponentially. - C) β Correct. Must compensate for Ο' β€ 0.25 with larger w. LSTMs solve this via gating instead. - D) Incorrect. Training time doesn't fix exponential vanishing β the gradients are numerically zero.Next Steps
Move on to 16-07 β Weight Initialization to learn how proper initialization (Xavier/Glorot, He/Kaiming) sets up the initial gradient flow to be stable from the very first training step.