Math graphic
πŸ“ Concept diagram

17-05 β€” Residual Connections

Phase: 17 β€” Deep Learning Architectures (Math) Subject: 17-05 Prerequisites: 16-05 (Backpropagation), 16-06 (Gradient Flow), 16-08 (Regularization β€” optional), 17-03/17-04 (gating concepts) Next subject: 17-06 β€” Attention Mechanism (General)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the residual block equation and the gradient identity that enables training of very deep networks
  2. Explain why βˆ‚y/βˆ‚x = I + βˆ‚F(x)/βˆ‚x prevents vanishing gradients regardless of depth
  3. Contrast pre-activation and post-activation residual blocks with mathematical justification
  4. Prove that a residual network of depth L has at least one gradient path of length 1 (identity path)
  5. Compute the number of gradient paths of each length in a residual network

Core Content

1. The Problem: Why Deep Networks Stall

From 16-06, we know that standard deep networks suffer vanishing gradients. Even with ReLU, gradients can be attenuated by weight matrices. A 100-layer network has gradient:

βˆ‚L/βˆ‚x₁ = ∏{β„“=2}^{L} βˆ‚h_β„“/βˆ‚h{β„“-1} = ∏_{β„“=2}^{L} diag(ReLU'(z_β„“)) Β· W_β„“

If the singular values of W_β„“ are less than 1 on average, this product vanishes exponentially.

Residual connections solve this by providing an unobstructed "gradient highway" through the network β€” a direct identity path alongside the transformed path.

2. The Residual Block

A residual block computes:

y = F(x, {W_i}) + x

Where F is a small neural network (typically 2-3 layers) and x is the input, added back via a "skip connection" or "identity shortcut."

Shapes must match: If F changes the dimension, a learned linear projection W_s is used:

y = F(x, {W_i}) + W_s Β· x

⚠️ THIS IS CRITICAL β€” The residual connection isn't just a trick. It fundamentally changes the gradient dynamics. The derivative βˆ‚y/βˆ‚x = I + βˆ‚F/βˆ‚x guarantees that the gradient has a term equal to the IDENTITY MATRIX, which doesn't decay no matter how small βˆ‚F/βˆ‚x is.

3. The Gradient Identity

Consider a network built from residual blocks. For block β„“:

hβ„“ = F_β„“(h{β„“-1}) + h_{β„“-1}

The Jacobian:

βˆ‚hβ„“/βˆ‚h{β„“-1} = I + βˆ‚F_β„“/βˆ‚h_{β„“-1}

Now the full gradient from layer L back to layer 0:

βˆ‚L/βˆ‚hβ‚€ = (∏{β„“=1}^{L} (I + βˆ‚F_β„“/βˆ‚h{β„“-1})) Β· βˆ‚L/βˆ‚h_L

Expand the product:

∏(I + J_β„“) = I + Ξ£J_β„“ + Ξ£_{i βˆ‚L/βˆ‚hβ‚€ = Ξ£_{S βŠ† {1,...,L}} (∏_{β„“βˆˆS} J_β„“) Β· βˆ‚L/βˆ‚h_L

Where S is any subset of blocks where we take the J_β„“ path (taking I at the others).

This means there are 2^L distinct gradient paths: - 1 path of "length 0" (all identity β€” I) - L paths of "length 1" (one J, rest I) - C(L,2) paths of "length 2" - ... - 1 path of "length L" (all J β€” the vanilla deep network path)

The gradient is the SUM over all paths. Even if long paths vanish, the short paths (especially the length-0 identity path) ensure gradients flow.

5. Why This Enables 1000+ Layer Networks

In a vanilla 1000-layer network, there is exactly ONE gradient path, and every layer must successfully propagate the gradient. The probability of the entire chain working is exponentially small in L.

In a residual network, there are 2^1000 β‰ˆ 10^301 paths. Most involve only a few J blocks and are short enough that gradients don't vanish. The network can learn even if only the short paths carry useful gradients β€” and gradually, as training progresses, the longer paths can be recruited.

This is the "effective depth" interpretation: early in training, the network behaves like a shallow ensemble of shorter paths; as training progresses, deeper paths become useful.

6. Pre-Activation vs Post-Activation Residual

Post-activation (original ResNet):

y = ReLU(F(x) + x)

The ReLU is applied AFTER the addition. This means the identity path also passes through ReLU, which zeros out negative values β€” the gradient through the identity path is gated by ReLU'.

Pre-activation (ResNet v2):

y = F(ReLU(x)) + x

The ReLU is applied BEFORE F, and the skip connection bypasses it entirely. This gives a TRUE identity path: the gradient βˆ‚y/βˆ‚x = I + βˆ‚F(ReLU(x))/βˆ‚x, and the I term has no ReLU' gating.

Mathematical advantage of pre-activation: The identity path is truly unobstructed. In post-activation, if the sum F(x) + x is negative, ReLU' = 0 and even the identity path is blocked.

7. The Training Dynamics Insight

A residual network with block function F(x) = Wβ‚‚Β·Οƒ(W₁·x) can be seen as a dynamical system:

h_{t+1} = h_t + F(h_t)

This is a forward Euler discretization of the ODE dh/dt = F(h). Very deep residual networks approximate continuous-depth models (Neural ODEs), which gives them a principled mathematical foundation.

8. Gradient Flow Quantification

For a residual block with F(x) = Wβ‚‚Β·ReLU(W₁·x):

βˆ‚y/βˆ‚x = I + Wβ‚‚Β·diag(ReLU'(W₁·x))Β·W₁

The eigenvalues of this Jacobian are 1 + Ξ»_i, where Ξ»_i are eigenvalues of J_F = βˆ‚F/βˆ‚x.

If J_F is "small" (||J_F|| < 1, which is typical with proper initialization), then the eigenvalues are clustered near 1 β€” neither vanishing nor exploding. This is the sweet spot for gradient propagation.



Key Terms

Worked Examples

Example 1: Two-Block Gradient Decomposition

Problem: A residual network with 2 blocks: h₁ = x + F₁(x), hβ‚‚ = h₁ + Fβ‚‚(h₁), loss L applied at hβ‚‚. Write βˆ‚L/βˆ‚x as a sum of path contributions.

Solution: βˆ‚h₁/βˆ‚x = I + J₁ where J₁ = βˆ‚F₁/βˆ‚x βˆ‚hβ‚‚/βˆ‚h₁ = I + Jβ‚‚ where Jβ‚‚ = βˆ‚Fβ‚‚/βˆ‚h₁

βˆ‚L/βˆ‚x = βˆ‚L/βˆ‚hβ‚‚ Β· βˆ‚hβ‚‚/βˆ‚h₁ Β· βˆ‚h₁/βˆ‚x = βˆ‚L/βˆ‚hβ‚‚ Β· (I + Jβ‚‚) Β· (I + J₁) = βˆ‚L/βˆ‚hβ‚‚ Β· (I + J₁ + Jβ‚‚ + Jβ‚‚J₁)

Four paths: - I: gradient flows through both identity connections (βˆ‚L/βˆ‚hβ‚‚ Β· I) - J₁: through F₁ only (βˆ‚L/βˆ‚hβ‚‚ Β· J₁) - Jβ‚‚: through Fβ‚‚ only (βˆ‚L/βˆ‚hβ‚‚ Β· Jβ‚‚) - Jβ‚‚J₁: through both (βˆ‚L/βˆ‚hβ‚‚ Β· Jβ‚‚J₁ β€” the "vanilla" deep path)

Example 2: Gradient Magnitude Bound

Problem: If ||J_β„“|| ≀ 0.1 for all β„“ in a 100-block residual network, what is a lower bound on ||βˆ‚L/βˆ‚hβ‚€|| relative to ||βˆ‚L/βˆ‚h₁₀₀||?

Solution: βˆ‚L/βˆ‚hβ‚€ = (∏(I + J_β„“)) Β· βˆ‚L/βˆ‚h₁₀₀

The identity path contributes βˆ‚L/βˆ‚h₁₀₀ directly. So: ||βˆ‚L/βˆ‚hβ‚€|| β‰₯ ||βˆ‚L/βˆ‚h₁₀₀|| βˆ’ (terms from other paths)

But more precisely, if we consider the I term alone: ||βˆ‚L/βˆ‚hβ‚€|| β‰₯ ||βˆ‚L/βˆ‚h₁₀₀|| (the gradient at layer 0 is at least as large as that at layer 100, via the pure identity path).

This is radically different from vanilla nets where ||βˆ‚L/βˆ‚hβ‚€|| ≀ (0.1)^100 ||βˆ‚L/βˆ‚h₁₀₀||.

Example 3: Pre-Activation vs Post-Activation Gradient

Problem: Compute the gradient βˆ‚y/βˆ‚x for both pre-act and post-act residual blocks. Use F(x) = Wβ‚‚Β·Οƒ(W₁·x) with W₁ = Wβ‚‚ = I (identity) and x = [βˆ’1, 1]α΅€. Use ReLU for Οƒ.

Solution:

Post-activation: y = ReLU(F(x) + x) F(x) = Οƒ(x) = ReLU([βˆ’1,1]) = [0,1] F(x) + x = [0,1] + [βˆ’1,1] = [βˆ’1,2] y = ReLU([βˆ’1,2]) = [0,2] Gradient: βˆ‚y/βˆ‚x = diag(ReLU'(F(x)+x)) Β· (I + βˆ‚F/βˆ‚x) ReLU'([βˆ’1,2]) = diag([0,1]) βˆ‚F/βˆ‚x = diag(ReLU'([βˆ’1,1])) Β· I = diag([0,1]) Β· I = diag([0,1]) I + βˆ‚F/βˆ‚x = diag([1,1]) + diag([0,1]) = diag([1,2]) βˆ‚y/βˆ‚x = diag([0,1]) Β· diag([1,2]) = diag([0,2])

Pre-activation: y = F(ReLU(x)) + x ReLU(x) = [0,1] F(ReLU(x)) = ReLU(IΒ·[0,1]) = [0,1] y = [0,1] + [βˆ’1,1] = [βˆ’1,2] βˆ‚y/βˆ‚x = βˆ‚F(ReLU(x))/βˆ‚x + I = diag(ReLU'(x)) Β· I Β· diag(ReLU'(ReLU(x))) Β· I + I = diag([0,1]) + I = diag([1,2])

The pre-activation gradient for the first dimension (x₁=βˆ’1): diag entry = 1 (pure identity). The post-activation: 0 (identity path blocked). Pre-activation wins for gradient flow.


Quiz

Q1: What is the Jacobian βˆ‚y/βˆ‚x for a residual block y = F(x) + x?

A) βˆ‚F/βˆ‚x B) I C) I + βˆ‚F/βˆ‚x D) βˆ‚F/βˆ‚x Β· I

Answer & Explanation **C** β€” The derivative of F(x) + x is βˆ‚F/βˆ‚x + I. The skip connection adds the identity matrix. This I term prevents gradient vanishing: even if ||βˆ‚F/βˆ‚x|| β‰ͺ 1, the gradient always has an unobstructed identity path.

Q2: In a residual network of depth L, how many distinct gradient paths exist from output to input?

A) L B) LΒ² C) 2^L D) L!

Answer & Explanation **C) 2^L** β€” At each block, choose identity (I) or transformed (J_β„“) path. 2 choices per block Γ— L blocks = 2^L paths, from the length-0 identity path to the length-L vanilla deep network path.

Q3: What is the key advantage of pre-activation over post-activation residual blocks?

A) Fewer parameters B) The identity path bypasses the activation function entirely, so the gradient I is never gated by ReLU' C) Faster computation D) Better regularization

Answer & Explanation **B** β€” Pre-activation: y = F(Οƒ(x)) + x. The skip adds x directly β€” βˆ‚y/βˆ‚x = I + βˆ‚F(Οƒ(x))/βˆ‚x, I is untouched. Post-activation: y = Οƒ(F(x) + x). If F(x) + x negative, ReLU' = 0 blocks even the identity path.

Q4: Why do residual connections enable 1000+ layer networks?

A) They use better activation functions B) The identity path ensures ||βˆ‚x_L/βˆ‚xβ‚€|| β‰₯ 1 β€” gradients never vanish completely C) They reduce the number of parameters D) They eliminate the need for batch normalization

Answer & Explanation **B** β€” A vanilla 1000-layer network has one gradient path that decays exponentially. A residual network has 2^1000 paths; the pure identity path has norm 1, guaranteeing non-zero gradient regardless of other paths' decay.

Q5: What must be done when a residual block changes the channel dimension?

A) The skip connection cannot be used B) A learned linear projection W_s Β· x must replace the identity skip connection C) The network must be restructured D) Zero-padding is applied to the skip

Answer & Explanation **B** β€” y = F(x) + x requires dim(F(x)) = dim(x). Dimension changes (stride-2 downsampling) need W_s Β· x to match. Forgetting this causes a shape mismatch error.

Practice Problems

Problem 1

Write the forward and backward equations for a residual block with F(x) = Wβ‚‚Β·tanh(W₁·x). Include the skip connection.

Answer Forward: y = Wβ‚‚Β·tanh(W₁·x) + x Backward: βˆ‚y/βˆ‚x = I + Wβ‚‚Β·diag(tanh'(W₁·x))Β·W₁ The I term is the skip connection gradient.

Problem 2

How many gradient paths exist in a 3-block residual network? List them.

Answer 2Β³ = 8 paths: III, J₁II, IJβ‚‚I, IIJ₃, J₁Jβ‚‚I, J₁IJ₃, IJβ‚‚J₃, J₁Jβ‚‚J₃. At each block you choose I or J_β„“.

Problem 3

Why does pre-activation residual design give a "truer" identity path than post-activation?

Answer Pre-activation: y = F(ReLU(x)) + x, so βˆ‚y/βˆ‚x = I + βˆ‚F(ReLU(x))/βˆ‚x. I is always present. Post-activation: y = ReLU(F(x)+x), so βˆ‚y/βˆ‚x = ReLU'(F(x)+x)Β·(I+J). If F(x)+x ≀ 0, even the identity path is gated to zero.

Problem 4

A residual network has 50 blocks. If ||J_β„“|| < 1 for all β„“, what happens to the longest gradient path? What about the shortest?

Answer Longest path: ||J₁Jβ‚‚...Jβ‚…β‚€|| ≀ (max||J||)^50 β†’ exponential decay. Shortest path: ||I|| = 1 β€” no decay at all. The network learns via short paths.

Problem 5

Explain the connection between residual networks and Neural ODEs.

Answer h_{t+1} = h_t + F(h_t) is a forward Euler step of dh/dt = F(h) with step size 1. As blocks β†’ ∞ and F becomes smoother, the residual network approximates a continuous-depth model. ResNets are discrete approximations of continuous dynamical systems.

Summary

  1. A residual block computes y = F(x) + x, creating a skip connection that adds I to the Jacobian: βˆ‚y/βˆ‚x = I + βˆ‚F/βˆ‚x
  2. The gradient decomposes into 2^L paths β€” including a pure identity path (length 0) that never attenuates
  3. Even if ||βˆ‚F/βˆ‚x|| β‰ͺ 1 for all blocks, the identity path ensures gradients never vanish completely
  4. Pre-activation design (y = F(Οƒ(x)) + x) provides a cleaner identity path than post-activation (y = Οƒ(F(x)+x))
  5. Deep residual networks can be viewed as discrete ODE solvers, connecting them to continuous-depth models

Pitfalls


Next Steps

Continue to 17-06 β€” Attention Mechanism (General) to learn about the Query-Key-Value abstraction that powers Transformers.