17-05 β Residual Connections
Phase: 17 β Deep Learning Architectures (Math) Subject: 17-05 Prerequisites: 16-05 (Backpropagation), 16-06 (Gradient Flow), 16-08 (Regularization β optional), 17-03/17-04 (gating concepts) Next subject: 17-06 β Attention Mechanism (General)
Learning Objectives
By the end of this subject, you will be able to:
- Derive the residual block equation and the gradient identity that enables training of very deep networks
- Explain why βy/βx = I + βF(x)/βx prevents vanishing gradients regardless of depth
- Contrast pre-activation and post-activation residual blocks with mathematical justification
- Prove that a residual network of depth L has at least one gradient path of length 1 (identity path)
- Compute the number of gradient paths of each length in a residual network
Core Content
1. The Problem: Why Deep Networks Stall
From 16-06, we know that standard deep networks suffer vanishing gradients. Even with ReLU, gradients can be attenuated by weight matrices. A 100-layer network has gradient:
βL/βxβ = β{β=2}^{L} βh_β/βh{β-1} = β_{β=2}^{L} diag(ReLU'(z_β)) Β· W_β
If the singular values of W_β are less than 1 on average, this product vanishes exponentially.
Residual connections solve this by providing an unobstructed "gradient highway" through the network β a direct identity path alongside the transformed path.
2. The Residual Block
A residual block computes:
y = F(x, {W_i}) + x
Where F is a small neural network (typically 2-3 layers) and x is the input, added back via a "skip connection" or "identity shortcut."
Shapes must match: If F changes the dimension, a learned linear projection W_s is used:
y = F(x, {W_i}) + W_s Β· x
β οΈ THIS IS CRITICAL β The residual connection isn't just a trick. It fundamentally changes the gradient dynamics. The derivative βy/βx = I + βF/βx guarantees that the gradient has a term equal to the IDENTITY MATRIX, which doesn't decay no matter how small βF/βx is.
3. The Gradient Identity
Consider a network built from residual blocks. For block β:
hβ = F_β(h{β-1}) + h_{β-1}
The Jacobian:
βhβ/βh{β-1} = I + βF_β/βh_{β-1}
Now the full gradient from layer L back to layer 0:
βL/βhβ = (β{β=1}^{L} (I + βF_β/βh{β-1})) Β· βL/βh_L
Expand the product:
β(I + J_β) = I + Ξ£J_β + Ξ£_{i βL/βhβ = Ξ£_{S β {1,...,L}} (β_{ββS} J_β) Β· βL/βh_L
Where S is any subset of blocks where we take the J_β path (taking I at the others).
This means there are 2^L distinct gradient paths: - 1 path of "length 0" (all identity β I) - L paths of "length 1" (one J, rest I) - C(L,2) paths of "length 2" - ... - 1 path of "length L" (all J β the vanilla deep network path)
The gradient is the SUM over all paths. Even if long paths vanish, the short paths (especially the length-0 identity path) ensure gradients flow.
5. Why This Enables 1000+ Layer Networks
In a vanilla 1000-layer network, there is exactly ONE gradient path, and every layer must successfully propagate the gradient. The probability of the entire chain working is exponentially small in L.
In a residual network, there are 2^1000 β 10^301 paths. Most involve only a few J blocks and are short enough that gradients don't vanish. The network can learn even if only the short paths carry useful gradients β and gradually, as training progresses, the longer paths can be recruited.
This is the "effective depth" interpretation: early in training, the network behaves like a shallow ensemble of shorter paths; as training progresses, deeper paths become useful.
6. Pre-Activation vs Post-Activation Residual
Post-activation (original ResNet):
y = ReLU(F(x) + x)
The ReLU is applied AFTER the addition. This means the identity path also passes through ReLU, which zeros out negative values β the gradient through the identity path is gated by ReLU'.
Pre-activation (ResNet v2):
y = F(ReLU(x)) + x
The ReLU is applied BEFORE F, and the skip connection bypasses it entirely. This gives a TRUE identity path: the gradient βy/βx = I + βF(ReLU(x))/βx, and the I term has no ReLU' gating.
Mathematical advantage of pre-activation: The identity path is truly unobstructed. In post-activation, if the sum F(x) + x is negative, ReLU' = 0 and even the identity path is blocked.
7. The Training Dynamics Insight
A residual network with block function F(x) = WβΒ·Ο(WβΒ·x) can be seen as a dynamical system:
h_{t+1} = h_t + F(h_t)
This is a forward Euler discretization of the ODE dh/dt = F(h). Very deep residual networks approximate continuous-depth models (Neural ODEs), which gives them a principled mathematical foundation.
8. Gradient Flow Quantification
For a residual block with F(x) = WβΒ·ReLU(WβΒ·x):
βy/βx = I + WβΒ·diag(ReLU'(WβΒ·x))Β·Wβ
The eigenvalues of this Jacobian are 1 + Ξ»_i, where Ξ»_i are eigenvalues of J_F = βF/βx.
If J_F is "small" (||J_F|| < 1, which is typical with proper initialization), then the eigenvalues are clustered near 1 β neither vanishing nor exploding. This is the sweet spot for gradient propagation.
Key Terms
- Residual connections
Worked Examples
Example 1: Two-Block Gradient Decomposition
Problem: A residual network with 2 blocks: hβ = x + Fβ(x), hβ = hβ + Fβ(hβ), loss L applied at hβ. Write βL/βx as a sum of path contributions.
Solution: βhβ/βx = I + Jβ where Jβ = βFβ/βx βhβ/βhβ = I + Jβ where Jβ = βFβ/βhβ
βL/βx = βL/βhβ Β· βhβ/βhβ Β· βhβ/βx = βL/βhβ Β· (I + Jβ) Β· (I + Jβ) = βL/βhβ Β· (I + Jβ + Jβ + JβJβ)
Four paths: - I: gradient flows through both identity connections (βL/βhβ Β· I) - Jβ: through Fβ only (βL/βhβ Β· Jβ) - Jβ: through Fβ only (βL/βhβ Β· Jβ) - JβJβ: through both (βL/βhβ Β· JβJβ β the "vanilla" deep path)
Example 2: Gradient Magnitude Bound
Problem: If ||J_β|| β€ 0.1 for all β in a 100-block residual network, what is a lower bound on ||βL/βhβ|| relative to ||βL/βhβββ||?
Solution: βL/βhβ = (β(I + J_β)) Β· βL/βhβββ
The identity path contributes βL/βhβββ directly. So: ||βL/βhβ|| β₯ ||βL/βhβββ|| β (terms from other paths)
But more precisely, if we consider the I term alone: ||βL/βhβ|| β₯ ||βL/βhβββ|| (the gradient at layer 0 is at least as large as that at layer 100, via the pure identity path).
This is radically different from vanilla nets where ||βL/βhβ|| β€ (0.1)^100 ||βL/βhβββ||.
Example 3: Pre-Activation vs Post-Activation Gradient
Problem: Compute the gradient βy/βx for both pre-act and post-act residual blocks. Use F(x) = WβΒ·Ο(WβΒ·x) with Wβ = Wβ = I (identity) and x = [β1, 1]α΅. Use ReLU for Ο.
Solution:
Post-activation: y = ReLU(F(x) + x) F(x) = Ο(x) = ReLU([β1,1]) = [0,1] F(x) + x = [0,1] + [β1,1] = [β1,2] y = ReLU([β1,2]) = [0,2] Gradient: βy/βx = diag(ReLU'(F(x)+x)) Β· (I + βF/βx) ReLU'([β1,2]) = diag([0,1]) βF/βx = diag(ReLU'([β1,1])) Β· I = diag([0,1]) Β· I = diag([0,1]) I + βF/βx = diag([1,1]) + diag([0,1]) = diag([1,2]) βy/βx = diag([0,1]) Β· diag([1,2]) = diag([0,2])
Pre-activation: y = F(ReLU(x)) + x ReLU(x) = [0,1] F(ReLU(x)) = ReLU(IΒ·[0,1]) = [0,1] y = [0,1] + [β1,1] = [β1,2] βy/βx = βF(ReLU(x))/βx + I = diag(ReLU'(x)) Β· I Β· diag(ReLU'(ReLU(x))) Β· I + I = diag([0,1]) + I = diag([1,2])
The pre-activation gradient for the first dimension (xβ=β1): diag entry = 1 (pure identity). The post-activation: 0 (identity path blocked). Pre-activation wins for gradient flow.
Quiz
Q1: What is the Jacobian βy/βx for a residual block y = F(x) + x?
A) βF/βx B) I C) I + βF/βx D) βF/βx Β· I
Answer & Explanation
**C** β The derivative of F(x) + x is βF/βx + I. The skip connection adds the identity matrix. This I term prevents gradient vanishing: even if ||βF/βx|| βͺ 1, the gradient always has an unobstructed identity path.Q2: In a residual network of depth L, how many distinct gradient paths exist from output to input?
A) L B) LΒ² C) 2^L D) L!
Answer & Explanation
**C) 2^L** β At each block, choose identity (I) or transformed (J_β) path. 2 choices per block Γ L blocks = 2^L paths, from the length-0 identity path to the length-L vanilla deep network path.Q3: What is the key advantage of pre-activation over post-activation residual blocks?
A) Fewer parameters B) The identity path bypasses the activation function entirely, so the gradient I is never gated by ReLU' C) Faster computation D) Better regularization
Answer & Explanation
**B** β Pre-activation: y = F(Ο(x)) + x. The skip adds x directly β βy/βx = I + βF(Ο(x))/βx, I is untouched. Post-activation: y = Ο(F(x) + x). If F(x) + x negative, ReLU' = 0 blocks even the identity path.Q4: Why do residual connections enable 1000+ layer networks?
A) They use better activation functions B) The identity path ensures ||βx_L/βxβ|| β₯ 1 β gradients never vanish completely C) They reduce the number of parameters D) They eliminate the need for batch normalization
Answer & Explanation
**B** β A vanilla 1000-layer network has one gradient path that decays exponentially. A residual network has 2^1000 paths; the pure identity path has norm 1, guaranteeing non-zero gradient regardless of other paths' decay.Q5: What must be done when a residual block changes the channel dimension?
A) The skip connection cannot be used B) A learned linear projection W_s Β· x must replace the identity skip connection C) The network must be restructured D) Zero-padding is applied to the skip
Answer & Explanation
**B** β y = F(x) + x requires dim(F(x)) = dim(x). Dimension changes (stride-2 downsampling) need W_s Β· x to match. Forgetting this causes a shape mismatch error.Practice Problems
Problem 1
Write the forward and backward equations for a residual block with F(x) = WβΒ·tanh(WβΒ·x). Include the skip connection.
Answer
Forward: y = WβΒ·tanh(WβΒ·x) + x Backward: βy/βx = I + WβΒ·diag(tanh'(WβΒ·x))Β·Wβ The I term is the skip connection gradient.Problem 2
How many gradient paths exist in a 3-block residual network? List them.
Answer
2Β³ = 8 paths: III, JβII, IJβI, IIJβ, JβJβI, JβIJβ, IJβJβ, JβJβJβ. At each block you choose I or J_β.Problem 3
Why does pre-activation residual design give a "truer" identity path than post-activation?
Answer
Pre-activation: y = F(ReLU(x)) + x, so βy/βx = I + βF(ReLU(x))/βx. I is always present. Post-activation: y = ReLU(F(x)+x), so βy/βx = ReLU'(F(x)+x)Β·(I+J). If F(x)+x β€ 0, even the identity path is gated to zero.Problem 4
A residual network has 50 blocks. If ||J_β|| < 1 for all β, what happens to the longest gradient path? What about the shortest?
Answer
Longest path: ||JβJβ...Jβ β|| β€ (max||J||)^50 β exponential decay. Shortest path: ||I|| = 1 β no decay at all. The network learns via short paths.Problem 5
Explain the connection between residual networks and Neural ODEs.
Answer
h_{t+1} = h_t + F(h_t) is a forward Euler step of dh/dt = F(h) with step size 1. As blocks β β and F becomes smoother, the residual network approximates a continuous-depth model. ResNets are discrete approximations of continuous dynamical systems.Summary
- A residual block computes y = F(x) + x, creating a skip connection that adds I to the Jacobian: βy/βx = I + βF/βx
- The gradient decomposes into 2^L paths β including a pure identity path (length 0) that never attenuates
- Even if ||βF/βx|| βͺ 1 for all blocks, the identity path ensures gradients never vanish completely
- Pre-activation design (y = F(Ο(x)) + x) provides a cleaner identity path than post-activation (y = Ο(F(x)+x))
- Deep residual networks can be viewed as discrete ODE solvers, connecting them to continuous-depth models
Pitfalls
- Forgetting the projection shortcut when dimensions change. The identity skip connection requires matching shapes: y = F(x) + x only works when dim(F(x)) = dim(x). When F changes the channel count or spatial dimensions (e.g., stride-2 downsampling), you must add a learned linear projection W_sΒ·x. Forgetting this causes an immediate shape mismatch error.
- Using post-activation residual blocks for very deep networks (50+ layers). Post-activation (ReLU after addition) gates the identity gradient path through ReLU', meaning negative F(x) + x values block even the skip connection's gradient. Pre-activation design (ReLU before F, skip bypasses it) provides a truly unobstructed identity path and is strongly preferred beyond ~50 layers.
- Thinking residual connections eliminate the need for careful initialization. Residuals prevent complete gradient vanishing, but if F(x) dominates the skip connection (||F(x)|| β« ||x||), the network behaves like a vanilla deep net with all the associated gradient problems. Proper initialization ensures the residual block starts near-identity: F(x) β 0 at initialization.
- Ignoring BatchNorm placement within residual blocks. The standard pre-activation order is BN β ReLU β Conv β BN β ReLU β Conv β Add. Deviating from this (e.g., placing BN after addition, or omitting the second BN) changes the signal distribution entering each block and can degrade performance measurably.
- Adding skip connections between arbitrary layers without considering the "near-identity" prior. Residual connections work because the network can learn small corrections to an identity mapping. If the skip connection spans a large transformation (e.g., across a bottleneck with major dimension reduction), the identity interpretation breaks down and the gradient highway benefit is lost.
Next Steps
Continue to 17-06 β Attention Mechanism (General) to learn about the Query-Key-Value abstraction that powers Transformers.