$Math graphic$

📐 Concept diagram

16-05 — Backpropagation (Mathematics)

Phase: 16 — Neural Network Mathematics Subject: 16-05 Prerequisites: 16-01 to 16-04 (perceptron, activations, softmax, losses), Phase 4 (chain rule), Phase 6 (partial derivatives), Phase 9 (matrix calculus basics) Next subject: 16-06 — Gradient Flow in Deep Networks

Learning Objectives

By the end of this subject, you will be able to:

Derive the backpropagation equations for a feedforward neural network using the chain rule
Compute ∂L/∂W^(ℓ) and ∂L/∂b^(ℓ) for any layer ℓ of a multi-layer perceptron
Explain the "local gradient" concept and how backprop stores intermediate results to avoid recomputation
Trace gradient flow through common layer types: linear, activation, and loss
Understand why backpropagation is reverse-mode automatic differentiation and compute its computational complexity

Core Content

1. The Problem: How Do We Train a Deep Network?

A feedforward neural network with L layers computes:

a^(0) = x (input) z^(ℓ) = W^(ℓ)a^(ℓ−1) + b^(ℓ) (pre-activation, layer ℓ) a^(ℓ) = f_ℓ(z^(ℓ)) (activation, layer ℓ)

The final output is a^(L) = ŷ. We compute a loss L(ŷ, y).

To train the network, we need the gradient of the loss with respect to EVERY parameter (W^(ℓ) and b^(ℓ) for all ℓ). With millions of parameters, computing each gradient individually via finite differences would be impossibly slow.

⚠️ THIS IS CRITICAL — Backpropagation is the algorithm that makes training deep networks computationally feasible. It computes all gradients in ONE forward pass and ONE backward pass, with the same computational cost (up to a constant factor) as the forward pass itself.

2. The Chain Rule on Computational Graphs

The key insight: the loss L is a composition of many functions. By the multivariate chain rule:

∂L/∂W^(ℓ) = ∂L/∂z^(ℓ) · ∂z^(ℓ)/∂W^(ℓ)

The first factor (∂L/∂z^(ℓ)) depends on ALL layers after ℓ. Backprop computes these "errors" working backward from the output.

Local gradients: Each layer only needs to know: 1. Its own local gradient: ∂z^(ℓ)/∂a^(ℓ−1), ∂z^(ℓ)/∂W^(ℓ), ∂z^(ℓ)/∂b^(ℓ) 2. The "error signal" coming from later layers: ∂L/∂z^(ℓ)

We define δ^(ℓ) = ∂L/∂z^(ℓ) (the error at layer ℓ's pre-activation). These are the quantities we propagate backward.

3. The Backpropagation Equations

For a network with L layers, loss L, and activation function f:

Output layer error (ℓ = L):

δ^(L) = ∂L/∂z^(L) = ∂L/∂a^(L) ⊙ f'_L(z^(L))

where ⊙ denotes element-wise multiplication.

For CCE + softmax: ∂L/∂z^(L) = ŷ − y (the clean gradient we derived in 16-04).

Backward recurrence (ℓ = L−1, ..., 1):

δ^(ℓ) = ((W^(ℓ+1))ᵀ δ^(ℓ+1)) ⊙ f'_ℓ(z^(ℓ))

Parameter gradients:

∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ ∂L/∂b^(ℓ) = δ^(ℓ)

4. Deriving the Backward Recurrence

Let's derive δ^(ℓ) from δ^(ℓ+1) step by step.

Step 1: Apply the chain rule through the activation.

a^(ℓ) = f_ℓ(z^(ℓ)) ∂L/∂z^(ℓ) = ∂L/∂a^(ℓ) · ∂a^(ℓ)/∂z^(ℓ)

The diagonal Jacobian of the activation: ∂aᵢ^(ℓ)/∂zⱼ^(ℓ) = f'ℓ(zᵢ^(ℓ)) · δ{ij} So: ∂L/∂z^(ℓ) = ∂L/∂a^(ℓ) ⊙ f'_ℓ(z^(ℓ))

Step 2: Relate ∂L/∂a^(ℓ) to δ^(ℓ+1).

z^(ℓ+1) = W^(ℓ+1) a^(ℓ) + b^(ℓ+1) ∂zᵢ^(ℓ+1)/∂aⱼ^(ℓ) = W_{ij}^(ℓ+1)

By the chain rule: ∂L/∂aⱼ^(ℓ) = Σᵢ ∂L/∂zᵢ^(ℓ+1) · ∂zᵢ^(ℓ+1)/∂aⱼ^(ℓ) = Σᵢ δᵢ^(ℓ+1) · W_{ij}^(ℓ+1) = ((W^(ℓ+1))ᵀ δ^(ℓ+1))_j

Putting it together:

δ^(ℓ) = ((W^(ℓ+1))ᵀ δ^(ℓ+1)) ⊙ f'_ℓ(z^(ℓ)) ✓

5. Deriving Parameter Gradients

Weight gradient:

∂zᵢ^(ℓ)/∂W_{jk}^(ℓ) = { aₖ^(ℓ−1) if i = j { 0 if i ≠ j

Each weight W_{ij}^(ℓ) affects only zᵢ^(ℓ). By the chain rule:

∂L/∂W_{ij}^(ℓ) = ∂L/∂zᵢ^(ℓ) · ∂zᵢ^(ℓ)/∂W_{ij}^(ℓ) = δᵢ^(ℓ) · aⱼ^(ℓ−1)

In matrix form: ∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ ✓

Bias gradient:

zᵢ^(ℓ) = ... + bᵢ^(ℓ), so ∂zᵢ^(ℓ)/∂bᵢ^(ℓ) = 1. ∂L/∂bᵢ^(ℓ) = δᵢ^(ℓ) · 1 = δᵢ^(ℓ) ∂L/∂b^(ℓ) = δ^(ℓ) ✓

6. Backpropagation Algorithm (Step by Step)

Forward pass: 1. Set a^(0) = x 2. For ℓ = 1 to L: - z^(ℓ) = W^(ℓ) a^(ℓ−1) + b^(ℓ) - a^(ℓ) = f_ℓ(z^(ℓ)) 3. Compute L = loss(a^(L), y)

Backward pass: 4. Compute δ^(L) = ∂L/∂z^(L) (using loss and activation derivatives) 5. For ℓ = L down to 1: - ∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ - ∂L/∂b^(ℓ) = δ^(ℓ) - If ℓ > 1: δ^(ℓ−1) = ((W^(ℓ))ᵀ δ^(ℓ)) ⊙ f'_{ℓ−1}(z^(ℓ−1))

Parameter update: 6. For ℓ = 1 to L: - W^(ℓ) ← W^(ℓ) − η · ∂L/∂W^(ℓ) - b^(ℓ) ← b^(ℓ) − η · ∂L/∂b^(ℓ)

7. Why "Back"-propagation?

The algorithm computes gradients from the OUTPUT backward toward the INPUT. This is reverse-mode automatic differentiation.

Consider the computation graph where each node is an operation. Forward-mode AD (starting from inputs) computes ∂(output)/∂(input) efficiently when there are few inputs. Reverse-mode AD computes ∂(output)/∂(ALL intermediate values) efficiently when there are few outputs — perfect for neural networks where we have ONE scalar output (the loss) and MANY parameters.

Computational complexity: Each backward pass has approximately the same number of operations as the forward pass (about 2-3× the forward pass cost). This is proven by the fact that each operation in the forward graph has a corresponding backward operation of similar complexity.

8. Handling Batches

In practice, we process mini-batches of M examples simultaneously. The forward pass processes a matrix X ∈ ℝ^{n×M}:

Z^(ℓ) = W^(ℓ) A^(ℓ−1) + b^(ℓ) (broadcast)

The parameter gradients become averages:

∂L/∂W^(ℓ) = (1/M) Δ^(ℓ) (A^(ℓ−1))ᵀ ∂L/∂b^(ℓ) = (1/M) Σ_{examples} δ^(ℓ) (sum across batch dimension)

9. Gradient Checking

How do we verify our backprop implementation is correct? Gradient checking via finite differences:

∂L/∂W_{ij} ≈ [L(W_{ij} + ε) − L(W_{ij} − ε)] / (2ε)

For ε ≈ 10⁻⁴, the relative error between analytical and numerical gradients should be < 10⁻⁷ for double precision (or < 10⁻³ for float32). If the error is ∼10⁻³, you likely have a bug. If it's ∼10⁻⁷, your backprop is correct.

Key Terms

Backpropagation
Gradient checking
Parameter gradients

Worked Examples

Example 1: 2-Layer Network Backprop by Hand

Problem: A tiny network: input x = [1] (scalar), one hidden neuron, one output neuron. Parameters: w₁ = 2, b₁ = 0, w₂ = 3, b₂ = 1. Hidden activation: ReLU. Output activation: linear. Loss: MSE. Target: y = 10.

Compute all gradients by hand.

Solution:

Forward pass: z₁ = w₁·x + b₁ = 2·1 + 0 = 2 a₁ = ReLU(2) = 2 z₂ = w₂·a₁ + b₂ = 3·2 + 1 = 7 ŷ = z₂ = 7 L = (7 − 10)² = 9

Backward pass: ∂L/∂z₂ = ∂L/∂ŷ · ∂ŷ/∂z₂ = 2(7−10)·1 = −6

∂L/∂w₂ = ∂L/∂z₂ · a₁ = −6·2 = −12 ∂L/∂b₂ = ∂L/∂z₂ = −6

δ₁ = ∂L/∂z₁ = ∂L/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ = −6 · w₂ · ReLU'(2) = −6 · 3 · 1 = −18

∂L/∂w₁ = δ₁ · x = −18·1 = −18 ∂L/∂b₁ = δ₁ = −18

Verification: If we increase w₁ slightly, z₁ → a₁ → z₂ all increase, and ŷ gets closer to 10, so L decreases. The negative gradient ∂L/∂w₁ = −18 is correct — gradient descent will ADD to w₁ (w₁ ← w₁ − η·(−18) = w₁ + 18η).

Example 2: Backprop Through Sigmoid + BCE

Problem: For a single neuron with sigmoid activation and BCE loss, x = [2], w = 1, b = −1, y = 0, compute ∂L/∂w.

Solution:

Forward: z = w·x + b = 1·2 − 1 = 1 ŷ = σ(1) ≈ 0.7311 L = −[0·log(0.7311) + 1·log(0.2689)] = −log(0.2689) ≈ 1.313

Backward (using clean BCE+sigmoid gradient): ∂L/∂z = ŷ − y = 0.7311 − 0 = 0.7311 ∂L/∂w = ∂L/∂z · ∂z/∂w = 0.7311 · 2 = 1.4622 ∂L/∂b = ∂L/∂z · 1 = 0.7311

Example 3: Matrix Shape Analysis

Problem: A layer has 64 inputs and 128 outputs. Batch size is 32. Determine the shapes of W, a^(ℓ−1), z^(ℓ), δ^(ℓ), and ∂L/∂W^(ℓ).

Solution:

W^(ℓ): 128 × 64 (output_dim × input_dim) a^(ℓ−1): 64 × 32 (input_dim × batch_size) z^(ℓ) = Wa: 128 × 32 δ^(ℓ) = ∂L/∂z^(ℓ): 128 × 32 (same shape as z) ∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ: (128×32) × (32×64) = 128 × 64 ✓ (same shape as W)

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: For the network x → z₁=w₁x+b₁ → a₁=ReLU(z₁) → z₂=w₂a₁+b₂ → ŷ=σ(z₂), derive the full expression for ∂L/∂w₁ when using BCE loss.

Problem 2: A linear layer has W ∈ ℝ^{d_{\text{out}}×d_{\text{in}}}. Show that the backward computation (W)ᵀ δ has complexity O(d_{\text{out}}·d_{\text{in}}·batch_size). Express this in terms of the forward pass complexity.

Problem 3: In backprop, we compute ∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ. Derive this from the element-wise chain rule: ∂L/∂W_{ij}^(ℓ) = Σₖ ∂L/∂zₖ^(ℓ) · ∂zₖ^(ℓ)/∂W_{ij}^(ℓ).

Problem 4: A network has 3 linear+ReLU layers. The gradient ∂L/∂z^(3) = [0.5, −0.3]ᵀ, and z^(2) = [1, −2, 0]ᵀ. W^(3) = [[1,2,3],[−1,0,1]] (2×3). Compute δ^(2) = ∂L/∂z^(2).

Problem 5: Prove that if all activations are linear (f_ℓ(x) = x) and the loss is MSE, a deep network is equivalent to a single linear layer — the gradient of L w.r.t. the effective weight matrix can be computed without backprop through intermediate layers.

Answers (click to expand)

**Problem 1:** ∂L/∂w₁ = ∂L/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂w₁ = (ŷ − y) · w₂ · ReLU'(z₁) · x With ReLU', this is (ŷ − y)·w₂·x if z₁ > 0, and 0 if z₁ ≤ 0. **Problem 2:** Forward pass **W****a**: (d_out × d_in) × (d_in × B) = O(d_out·d_in·B) Backward pass **W**ᵀ**δ**: (d_in × d_out) × (d_out × B) = O(d_in·d_out·B) Same asymptotic complexity! This is the key property of reverse-mode AD: gradient of any parameter can be computed with the same asymptotic cost as the forward pass. **Problem 3:** zₖ^(ℓ) = Σⱼ W_{kj}^(ℓ) aⱼ^(ℓ−1) + bₖ^(ℓ) ∂zₖ^(ℓ)/∂W_{ij}^(ℓ) = ∂/∂W_{ij}^(ℓ)[Σⱼ W_{kj}^(ℓ) aⱼ^(ℓ−1)] = { aⱼ^(ℓ−1) if k=i, j=j; 0 otherwise Actually more precisely: ∂zₖ^(ℓ)/∂W_{ij}^(ℓ) = aⱼ^(ℓ−1) if k=i, else 0. So ∂L/∂W_{ij}^(ℓ) = Σₖ δₖ · (aⱼ if k=i, else 0) = δᵢ · aⱼ. In matrix form: ∂L/∂**W** = **δ** (**a**)ᵀ ✓ **Problem 4:** **δ**^(2) = ((**W**^(3))ᵀ **δ**^(3)) ⊙ ReLU'(**z**^(2)) = ([ [1,−1], [2,0], [3,1] ] · [0.5, −0.3]ᵀ) ⊙ [ReLU'(1), ReLU'(−2), ReLU'(0)] = ([0.5+0.3, 1+0, 1.5−0.3]ᵀ) ⊙ [1, 0, 0] = [0.8, 1.0, 1.2] ⊙ [1, 0, 0] = [0.8, 0, 0]ᵀ **Problem 5:** With linear activations: **a**^(ℓ) = **z**^(ℓ) = **W**^(ℓ)**a**^(ℓ−1) + **b**^(ℓ). The entire network: **ŷ** = **W**^(L)(...**W**^(2)(**W**^(1)**x**+**b**^(1))+**b**^(2)...) + **b**^(L) = **W**_eff **x** + **b**_eff. The gradient w.r.t. **W**_eff can be computed as (ŷ−y)**x**ᵀ (for MSE). No backprop through individual layers is needed because the chain rule simplifies — all intermediate Jacobians are identity. The deep architecture is redundant; a single linear layer is equally expressive.

Summary

Backpropagation applies the chain rule through a computational graph from output to input, computing ∂L/∂W^(ℓ) and ∂L/∂b^(ℓ) for all layers in one backward pass.
The error signal δ^(ℓ) = ∂L/∂z^(ℓ) propagates via δ^(ℓ) = ((W^(ℓ+1))ᵀδ^(ℓ+1)) ⊙ f'(z^(ℓ)).
Parameter gradients are outer products: ∂L/∂W^(ℓ) = δ^(ℓ)(a^(ℓ−1))ᵀ and ∂L/∂b^(ℓ) = δ^(ℓ).
Backprop is reverse-mode automatic differentiation, computing gradients with ~2-3× the forward pass cost regardless of parameter count.
Gradient checking (finite differences) should be used to verify implementations: relative error < 10⁻⁷ is correct for double precision.

Pitfalls

Forgetting to apply the activation derivative in the backward recurrence: The full recurrence is δ^(ℓ) = ((W^(ℓ+1))ᵀ δ^(ℓ+1)) ⊙ f'(z^(ℓ)). Omitting f'(z^(ℓ)) treats every activation as linear, producing gradients that are too large and in the wrong direction. For ReLU, this means gradients flow through dead neurons. For sigmoid/tanh, this means gradients ignore saturation. The ⊙ with f' is NOT optional.
Misordering the chain rule when deriving parameter gradients: The gradient ∂L/∂W^(ℓ) = δ^(ℓ)(a^(ℓ−1))ᵀ is an outer product, not a(δ)ᵀ or element-wise multiplication. Getting the order wrong produces a matrix of the wrong shape (d_in × d_out instead of d_out × d_in) or completely wrong numerical values. Always verify shapes: ∂L/∂W^(ℓ) must have the same shape as W^(ℓ).
Computing bias gradients incorrectly for mini-batches: In batch mode, Z = WA + b broadcasts b across the batch dimension. The bias gradient must sum incoming δ values across the batch: ∂L/∂b = Σ_b δ_b. Simply setting it to δ^(ℓ) without summation averages over only one example, giving a gradient that's too small by a factor of batch_size. Most frameworks handle this automatically, but in custom implementations this is a common bug.
Confusing batch-averaged loss gradients with per-example gradients: If your loss is averaged over the batch (L = (1/M) Σ L_i), the gradient flowing into the network is already divided by M. When computing ∂L/∂W = (1/M) δ (a)ᵀ, don't divide by M again. A telltale sign: your gradients are smaller than expected by a factor of M and training is M× too slow.
Not gradient-checking after implementing custom backward passes: Even experienced practitioners make mistakes in hand-derived backward rules. For any custom autograd.Function or manual backprop implementation, compare against finite differences: compute (f(w+ε) − f(w−ε))/(2ε) for a few random parameter values with ε ≈ 10⁻⁵. A relative error < 10⁻⁶ indicates correct implementation; errors around 10⁻³ indicate a bug. Gradient checking is cheap insurance — it takes minutes and can save days of debugging.

Quiz

Q1: In backpropagation, what does δ^(ℓ) represent?

A) The activation of layer ℓ B) The weights of layer ℓ C) The gradient of the loss with respect to the pre-activation of layer ℓ D) The gradient of the loss with respect to the weights of layer ℓ

Answer and Explanations

**Correct: C) The gradient of the loss with respect to the pre-activation of layer ℓ** **δ**^(ℓ) = ∂L/∂**z**^(ℓ). This is the "error signal" that is propagated backward through the network. It's the central quantity in backprop. - A) That's **a**^(ℓ), not δ. - B) That's **W**^(ℓ). - C) ✓ Correct. δ^(ℓ) = ∂L/∂z^(ℓ) is the error at the pre-activation. - D) That's ∂L/∂W^(ℓ) = δ^(ℓ) (a^(ℓ−1))ᵀ, derived FROM δ^(ℓ).

Q2: What is the gradient ∂L/∂W^(ℓ) in terms of δ^(ℓ) and a^(ℓ−1)?

A) (W^(ℓ))ᵀ δ^(ℓ) B) δ^(ℓ) (a^(ℓ−1))ᵀ C) (a^(ℓ−1))ᵀ δ^(ℓ) D) δ^(ℓ) ⊙ a^(ℓ−1)

Answer and Explanations

**Correct: B) δ^(ℓ) (a^(ℓ−1))ᵀ** This is an outer product: (d_out × 1) × (1 × d_in) = (d_out × d_in), matching the shape of **W**^(ℓ). Each element is δᵢ^(ℓ) · aⱼ^(ℓ−1) — how much changing weight W_{ij} affects the loss depends on the error at output i and the activation of input j. - A) That's part of the backward recurrence for δ^(ℓ−1), not ∂L/∂W. - B) ✓ Correct. Outer product of error and input activation. - C) Wrong shape — would give d_in × d_out instead of d_out × d_in. - D) Element-wise product, not the correct gradient for a linear layer.

Q3: Why is backpropagation more efficient than computing each gradient independently via finite differences?

A) Backprop uses fewer bits of precision B) Backprop reuses intermediate results via the chain rule, computing all gradients in O(forward_cost) time C) Backprop doesn't require the forward pass D) Backprop only computes gradients for parameters that matter

Answer and Explanations

**Correct: B) Backprop reuses intermediate results via the chain rule, computing all gradients in O(forward_cost) time** With P parameters, finite differences requires P+1 forward passes. Backprop computes ALL P gradients in ~2-3 forward passes worth of computation. For a network with millions of parameters, this is the difference between feasible and impossible. - A) Incorrect. Precision is unrelated to efficiency. - B) ✓ Correct. Reverse-mode AD computes all gradients with cost proportional to one forward pass. - C) Incorrect. Backprop REQUIRES the forward pass (to store activations). - D) Incorrect. Backprop computes ALL gradients, not a subset.

Q4: For a linear layer z = Wa + b, what is ∂zᵢ/∂aⱼ?

A) δ_{ij} (Kronecker delta) B) W_{ij} C) W_{ji} D) It depends on the activation function

Answer and Explanations

**Correct: B) W_{ij}** zᵢ = Σₖ W_{iₖ} aₖ + bᵢ. ∂zᵢ/∂aⱼ = W_{ij} (only the j-th term in the sum depends on aⱼ). The Jacobian ∂**z**/∂**a** is exactly the weight matrix **W**. - A) That would be true if zᵢ = aᵢ, which it isn't. - B) ✓ Correct. The Jacobian of a linear transformation is the weight matrix itself. - C) This is ∂zⱼ/∂aᵢ, which equals W_{ji}. Not the same as ∂zᵢ/∂aⱼ. - D) The linear layer comes BEFORE the activation, so it doesn't depend on f.

Q5: When computing backprop through a ReLU activation a = max(0, z), what is the local gradient ∂a/∂z used in the backward pass?

A) Always 1 B) Always 0 C) 1 if z > 0, 0 if z < 0 D) σ(z)(1 − σ(z))

Answer and Explanations

**Correct: C) 1 if z > 0, 0 if z < 0** ReLU'(z) = 1 for z > 0, 0 for z < 0. At z = 0, the subgradient is [0, 1] — in practice, frameworks use either 0 or define it as a non-linearity. This binary behavior is what causes "dying ReLU" when z ≤ 0 for all inputs. - A) That would be linear activation, not ReLU. - B) Always 0 means no gradient flows — the network can't learn. - C) ✓ Correct. The gradient is gated by the sign of the pre-activation. - D) That's the derivative of sigmoid, not ReLU.

Next Steps

Move on to 16-06 — Gradient Flow in Deep Networks to understand why gradients vanish or explode in deep networks, and how to diagnose and mitigate these problems.

Progress

Phases

16-05 — Backpropagation (Mathematics)

Learning Objectives

Core Content

1. The Problem: How Do We Train a Deep Network?

2. The Chain Rule on Computational Graphs

3. The Backpropagation Equations

4. Deriving the Backward Recurrence

5. Deriving Parameter Gradients

6. Backpropagation Algorithm (Step by Step)

7. Why "Back"-propagation?

8. Handling Batches

9. Gradient Checking

Key Terms

Worked Examples

Example 1: 2-Layer Network Backprop by Hand

Example 2: Backprop Through Sigmoid + BCE

Example 3: Matrix Shape Analysis

Practice Problems

Summary

Pitfalls

Quiz

Next Steps