16-05 β Backpropagation (Mathematics)
Phase: 16 β Neural Network Mathematics Subject: 16-05 Prerequisites: 16-01 to 16-04 (perceptron, activations, softmax, losses), Phase 4 (chain rule), Phase 6 (partial derivatives), Phase 9 (matrix calculus basics) Next subject: 16-06 β Gradient Flow in Deep Networks
Learning Objectives
By the end of this subject, you will be able to:
- Derive the backpropagation equations for a feedforward neural network using the chain rule
- Compute βL/βW^(β) and βL/βb^(β) for any layer β of a multi-layer perceptron
- Explain the "local gradient" concept and how backprop stores intermediate results to avoid recomputation
- Trace gradient flow through common layer types: linear, activation, and loss
- Understand why backpropagation is reverse-mode automatic differentiation and compute its computational complexity
Core Content
1. The Problem: How Do We Train a Deep Network?
A feedforward neural network with L layers computes:
a^(0) = x (input) z^(β) = W^(β)a^(ββ1) + b^(β) (pre-activation, layer β) a^(β) = f_β(z^(β)) (activation, layer β)
The final output is a^(L) = yΜ. We compute a loss L(yΜ, y).
To train the network, we need the gradient of the loss with respect to EVERY parameter (W^(β) and b^(β) for all β). With millions of parameters, computing each gradient individually via finite differences would be impossibly slow.
β οΈ THIS IS CRITICAL β Backpropagation is the algorithm that makes training deep networks computationally feasible. It computes all gradients in ONE forward pass and ONE backward pass, with the same computational cost (up to a constant factor) as the forward pass itself.
2. The Chain Rule on Computational Graphs
The key insight: the loss L is a composition of many functions. By the multivariate chain rule:
βL/βW^(β) = βL/βz^(β) Β· βz^(β)/βW^(β)
The first factor (βL/βz^(β)) depends on ALL layers after β. Backprop computes these "errors" working backward from the output.
Local gradients: Each layer only needs to know: 1. Its own local gradient: βz^(β)/βa^(ββ1), βz^(β)/βW^(β), βz^(β)/βb^(β) 2. The "error signal" coming from later layers: βL/βz^(β)
We define Ξ΄^(β) = βL/βz^(β) (the error at layer β's pre-activation). These are the quantities we propagate backward.
3. The Backpropagation Equations
For a network with L layers, loss L, and activation function f:
Output layer error (β = L):
Ξ΄^(L) = βL/βz^(L) = βL/βa^(L) β f'_L(z^(L))
where β denotes element-wise multiplication.
For CCE + softmax: βL/βz^(L) = yΜ β y (the clean gradient we derived in 16-04).
Backward recurrence (β = Lβ1, ..., 1):
Ξ΄^(β) = ((W^(β+1))α΅ Ξ΄^(β+1)) β f'_β(z^(β))
Parameter gradients:
βL/βW^(β) = Ξ΄^(β) (a^(ββ1))α΅ βL/βb^(β) = Ξ΄^(β)
4. Deriving the Backward Recurrence
Let's derive Ξ΄^(β) from Ξ΄^(β+1) step by step.
Step 1: Apply the chain rule through the activation.
a^(β) = f_β(z^(β)) βL/βz^(β) = βL/βa^(β) Β· βa^(β)/βz^(β)
The diagonal Jacobian of the activation: βaα΅’^(β)/βzβ±Ό^(β) = f'β(zα΅’^(β)) Β· Ξ΄{ij} So: βL/βz^(β) = βL/βa^(β) β f'_β(z^(β))
Step 2: Relate βL/βa^(β) to Ξ΄^(β+1).
z^(β+1) = W^(β+1) a^(β) + b^(β+1) βzα΅’^(β+1)/βaβ±Ό^(β) = W_{ij}^(β+1)
By the chain rule: βL/βaβ±Ό^(β) = Ξ£α΅’ βL/βzα΅’^(β+1) Β· βzα΅’^(β+1)/βaβ±Ό^(β) = Ξ£α΅’ Ξ΄α΅’^(β+1) Β· W_{ij}^(β+1) = ((W^(β+1))α΅ Ξ΄^(β+1))_j
Putting it together:
Ξ΄^(β) = ((W^(β+1))α΅ Ξ΄^(β+1)) β f'_β(z^(β)) β
5. Deriving Parameter Gradients
Weight gradient:
βzα΅’^(β)/βW_{jk}^(β) = { aβ^(ββ1) if i = j { 0 if i β j
Each weight W_{ij}^(β) affects only zα΅’^(β). By the chain rule:
βL/βW_{ij}^(β) = βL/βzα΅’^(β) Β· βzα΅’^(β)/βW_{ij}^(β) = Ξ΄α΅’^(β) Β· aβ±Ό^(ββ1)
In matrix form: βL/βW^(β) = Ξ΄^(β) (a^(ββ1))α΅ β
Bias gradient:
zα΅’^(β) = ... + bα΅’^(β), so βzα΅’^(β)/βbα΅’^(β) = 1. βL/βbα΅’^(β) = Ξ΄α΅’^(β) Β· 1 = Ξ΄α΅’^(β) βL/βb^(β) = Ξ΄^(β) β
6. Backpropagation Algorithm (Step by Step)
Forward pass: 1. Set a^(0) = x 2. For β = 1 to L: - z^(β) = W^(β) a^(ββ1) + b^(β) - a^(β) = f_β(z^(β)) 3. Compute L = loss(a^(L), y)
Backward pass: 4. Compute Ξ΄^(L) = βL/βz^(L) (using loss and activation derivatives) 5. For β = L down to 1: - βL/βW^(β) = Ξ΄^(β) (a^(ββ1))α΅ - βL/βb^(β) = Ξ΄^(β) - If β > 1: Ξ΄^(ββ1) = ((W^(β))α΅ Ξ΄^(β)) β f'_{ββ1}(z^(ββ1))
Parameter update: 6. For β = 1 to L: - W^(β) β W^(β) β Ξ· Β· βL/βW^(β) - b^(β) β b^(β) β Ξ· Β· βL/βb^(β)
7. Why "Back"-propagation?
The algorithm computes gradients from the OUTPUT backward toward the INPUT. This is reverse-mode automatic differentiation.
Consider the computation graph where each node is an operation. Forward-mode AD (starting from inputs) computes β(output)/β(input) efficiently when there are few inputs. Reverse-mode AD computes β(output)/β(ALL intermediate values) efficiently when there are few outputs β perfect for neural networks where we have ONE scalar output (the loss) and MANY parameters.
Computational complexity: Each backward pass has approximately the same number of operations as the forward pass (about 2-3Γ the forward pass cost). This is proven by the fact that each operation in the forward graph has a corresponding backward operation of similar complexity.
8. Handling Batches
In practice, we process mini-batches of M examples simultaneously. The forward pass processes a matrix X β β^{nΓM}:
Z^(β) = W^(β) A^(ββ1) + b^(β) (broadcast)
The parameter gradients become averages:
βL/βW^(β) = (1/M) Ξ^(β) (A^(ββ1))α΅ βL/βb^(β) = (1/M) Ξ£_{examples} Ξ΄^(β) (sum across batch dimension)
9. Gradient Checking
How do we verify our backprop implementation is correct? Gradient checking via finite differences:
βL/βW_{ij} β [L(W_{ij} + Ξ΅) β L(W_{ij} β Ξ΅)] / (2Ξ΅)
For Ξ΅ β 10β»β΄, the relative error between analytical and numerical gradients should be < 10β»β· for double precision (or < 10β»Β³ for float32). If the error is βΌ10β»Β³, you likely have a bug. If it's βΌ10β»β·, your backprop is correct.
Key Terms
- Backpropagation
- Gradient checking
- Parameter gradients
Worked Examples
Example 1: 2-Layer Network Backprop by Hand
Problem: A tiny network: input x = [1] (scalar), one hidden neuron, one output neuron. Parameters: wβ = 2, bβ = 0, wβ = 3, bβ = 1. Hidden activation: ReLU. Output activation: linear. Loss: MSE. Target: y = 10.
Compute all gradients by hand.
Solution:
Forward pass: zβ = wβΒ·x + bβ = 2Β·1 + 0 = 2 aβ = ReLU(2) = 2 zβ = wβΒ·aβ + bβ = 3Β·2 + 1 = 7 yΜ = zβ = 7 L = (7 β 10)Β² = 9
Backward pass: βL/βzβ = βL/βyΜ Β· βyΜ/βzβ = 2(7β10)Β·1 = β6
βL/βwβ = βL/βzβ Β· aβ = β6Β·2 = β12 βL/βbβ = βL/βzβ = β6
Ξ΄β = βL/βzβ = βL/βzβ Β· βzβ/βaβ Β· βaβ/βzβ = β6 Β· wβ Β· ReLU'(2) = β6 Β· 3 Β· 1 = β18
βL/βwβ = Ξ΄β Β· x = β18Β·1 = β18 βL/βbβ = Ξ΄β = β18
Verification: If we increase wβ slightly, zβ β aβ β zβ all increase, and yΜ gets closer to 10, so L decreases. The negative gradient βL/βwβ = β18 is correct β gradient descent will ADD to wβ (wβ β wβ β Ξ·Β·(β18) = wβ + 18Ξ·).
Example 2: Backprop Through Sigmoid + BCE
Problem: For a single neuron with sigmoid activation and BCE loss, x = [2], w = 1, b = β1, y = 0, compute βL/βw.
Solution:
Forward: z = wΒ·x + b = 1Β·2 β 1 = 1 yΜ = Ο(1) β 0.7311 L = β[0Β·log(0.7311) + 1Β·log(0.2689)] = βlog(0.2689) β 1.313
Backward (using clean BCE+sigmoid gradient): βL/βz = yΜ β y = 0.7311 β 0 = 0.7311 βL/βw = βL/βz Β· βz/βw = 0.7311 Β· 2 = 1.4622 βL/βb = βL/βz Β· 1 = 0.7311
Example 3: Matrix Shape Analysis
Problem: A layer has 64 inputs and 128 outputs. Batch size is 32. Determine the shapes of W, a^(ββ1), z^(β), Ξ΄^(β), and βL/βW^(β).
Solution:
W^(β): 128 Γ 64 (output_dim Γ input_dim) a^(ββ1): 64 Γ 32 (input_dim Γ batch_size) z^(β) = Wa: 128 Γ 32 Ξ΄^(β) = βL/βz^(β): 128 Γ 32 (same shape as z) βL/βW^(β) = Ξ΄^(β) (a^(ββ1))α΅: (128Γ32) Γ (32Γ64) = 128 Γ 64 β (same shape as W)
Practice Problems
(Answers are below. Try each problem before checking.)
Problem 1: For the network x β zβ=wβx+bβ β aβ=ReLU(zβ) β zβ=wβaβ+bβ β yΜ=Ο(zβ), derive the full expression for βL/βwβ when using BCE loss.
Problem 2: A linear layer has W β β^{d_{\text{out}}Γd_{\text{in}}}. Show that the backward computation (W)α΅ Ξ΄ has complexity O(d_{\text{out}}Β·d_{\text{in}}Β·batch_size). Express this in terms of the forward pass complexity.
Problem 3: In backprop, we compute βL/βW^(β) = Ξ΄^(β) (a^(ββ1))α΅. Derive this from the element-wise chain rule: βL/βW_{ij}^(β) = Ξ£β βL/βzβ^(β) Β· βzβ^(β)/βW_{ij}^(β).
Problem 4: A network has 3 linear+ReLU layers. The gradient βL/βz^(3) = [0.5, β0.3]α΅, and z^(2) = [1, β2, 0]α΅. W^(3) = [[1,2,3],[β1,0,1]] (2Γ3). Compute Ξ΄^(2) = βL/βz^(2).
Problem 5: Prove that if all activations are linear (f_β(x) = x) and the loss is MSE, a deep network is equivalent to a single linear layer β the gradient of L w.r.t. the effective weight matrix can be computed without backprop through intermediate layers.
Answers (click to expand)
**Problem 1:** βL/βwβ = βL/βzβ Β· βzβ/βaβ Β· βaβ/βzβ Β· βzβ/βwβ = (yΜ β y) Β· wβ Β· ReLU'(zβ) Β· x With ReLU', this is (yΜ β y)Β·wβΒ·x if zβ > 0, and 0 if zβ β€ 0. **Problem 2:** Forward pass **W****a**: (d_out Γ d_in) Γ (d_in Γ B) = O(d_outΒ·d_inΒ·B) Backward pass **W**α΅**Ξ΄**: (d_in Γ d_out) Γ (d_out Γ B) = O(d_inΒ·d_outΒ·B) Same asymptotic complexity! This is the key property of reverse-mode AD: gradient of any parameter can be computed with the same asymptotic cost as the forward pass. **Problem 3:** zβ^(β) = Ξ£β±Ό W_{kj}^(β) aβ±Ό^(ββ1) + bβ^(β) βzβ^(β)/βW_{ij}^(β) = β/βW_{ij}^(β)[Ξ£β±Ό W_{kj}^(β) aβ±Ό^(ββ1)] = { aβ±Ό^(ββ1) if k=i, j=j; 0 otherwise Actually more precisely: βzβ^(β)/βW_{ij}^(β) = aβ±Ό^(ββ1) if k=i, else 0. So βL/βW_{ij}^(β) = Ξ£β Ξ΄β Β· (aβ±Ό if k=i, else 0) = Ξ΄α΅’ Β· aβ±Ό. In matrix form: βL/β**W** = **Ξ΄** (**a**)α΅ β **Problem 4:** **Ξ΄**^(2) = ((**W**^(3))α΅ **Ξ΄**^(3)) β ReLU'(**z**^(2)) = ([ [1,β1], [2,0], [3,1] ] Β· [0.5, β0.3]α΅) β [ReLU'(1), ReLU'(β2), ReLU'(0)] = ([0.5+0.3, 1+0, 1.5β0.3]α΅) β [1, 0, 0] = [0.8, 1.0, 1.2] β [1, 0, 0] = [0.8, 0, 0]α΅ **Problem 5:** With linear activations: **a**^(β) = **z**^(β) = **W**^(β)**a**^(ββ1) + **b**^(β). The entire network: **yΜ** = **W**^(L)(...**W**^(2)(**W**^(1)**x**+**b**^(1))+**b**^(2)...) + **b**^(L) = **W**_eff **x** + **b**_eff. The gradient w.r.t. **W**_eff can be computed as (yΜβy)**x**α΅ (for MSE). No backprop through individual layers is needed because the chain rule simplifies β all intermediate Jacobians are identity. The deep architecture is redundant; a single linear layer is equally expressive.Summary
- Backpropagation applies the chain rule through a computational graph from output to input, computing βL/βW^(β) and βL/βb^(β) for all layers in one backward pass.
- The error signal Ξ΄^(β) = βL/βz^(β) propagates via Ξ΄^(β) = ((W^(β+1))α΅Ξ΄^(β+1)) β f'(z^(β)).
- Parameter gradients are outer products: βL/βW^(β) = Ξ΄^(β)(a^(ββ1))α΅ and βL/βb^(β) = Ξ΄^(β).
- Backprop is reverse-mode automatic differentiation, computing gradients with ~2-3Γ the forward pass cost regardless of parameter count.
- Gradient checking (finite differences) should be used to verify implementations: relative error < 10β»β· is correct for double precision.
Pitfalls
- Forgetting to apply the activation derivative in the backward recurrence: The full recurrence is Ξ΄^(β) = ((W^(β+1))α΅ Ξ΄^(β+1)) β f'(z^(β)). Omitting f'(z^(β)) treats every activation as linear, producing gradients that are too large and in the wrong direction. For ReLU, this means gradients flow through dead neurons. For sigmoid/tanh, this means gradients ignore saturation. The β with f' is NOT optional.
- Misordering the chain rule when deriving parameter gradients: The gradient βL/βW^(β) = Ξ΄^(β)(a^(ββ1))α΅ is an outer product, not a(Ξ΄)α΅ or element-wise multiplication. Getting the order wrong produces a matrix of the wrong shape (d_in Γ d_out instead of d_out Γ d_in) or completely wrong numerical values. Always verify shapes: βL/βW^(β) must have the same shape as W^(β).
- Computing bias gradients incorrectly for mini-batches: In batch mode, Z = WA + b broadcasts b across the batch dimension. The bias gradient must sum incoming Ξ΄ values across the batch: βL/βb = Ξ£_b Ξ΄_b. Simply setting it to Ξ΄^(β) without summation averages over only one example, giving a gradient that's too small by a factor of batch_size. Most frameworks handle this automatically, but in custom implementations this is a common bug.
- Confusing batch-averaged loss gradients with per-example gradients: If your loss is averaged over the batch (L = (1/M) Ξ£ L_i), the gradient flowing into the network is already divided by M. When computing βL/βW = (1/M) Ξ΄ (a)α΅, don't divide by M again. A telltale sign: your gradients are smaller than expected by a factor of M and training is MΓ too slow.
- Not gradient-checking after implementing custom backward passes: Even experienced practitioners make mistakes in hand-derived backward rules. For any custom
autograd.Functionor manual backprop implementation, compare against finite differences: compute (f(w+Ξ΅) β f(wβΞ΅))/(2Ξ΅) for a few random parameter values with Ξ΅ β 10β»β΅. A relative error < 10β»βΆ indicates correct implementation; errors around 10β»Β³ indicate a bug. Gradient checking is cheap insurance β it takes minutes and can save days of debugging.
Quiz
Q1: In backpropagation, what does Ξ΄^(β) represent?
A) The activation of layer β B) The weights of layer β C) The gradient of the loss with respect to the pre-activation of layer β D) The gradient of the loss with respect to the weights of layer β
Answer and Explanations
**Correct: C) The gradient of the loss with respect to the pre-activation of layer β** **Ξ΄**^(β) = βL/β**z**^(β). This is the "error signal" that is propagated backward through the network. It's the central quantity in backprop. - A) That's **a**^(β), not Ξ΄. - B) That's **W**^(β). - C) β Correct. Ξ΄^(β) = βL/βz^(β) is the error at the pre-activation. - D) That's βL/βW^(β) = Ξ΄^(β) (a^(ββ1))α΅, derived FROM Ξ΄^(β).Q2: What is the gradient βL/βW^(β) in terms of Ξ΄^(β) and a^(ββ1)?
A) (W^(β))α΅ Ξ΄^(β) B) Ξ΄^(β) (a^(ββ1))α΅ C) (a^(ββ1))α΅ Ξ΄^(β) D) Ξ΄^(β) β a^(ββ1)
Answer and Explanations
**Correct: B) Ξ΄^(β) (a^(ββ1))α΅** This is an outer product: (d_out Γ 1) Γ (1 Γ d_in) = (d_out Γ d_in), matching the shape of **W**^(β). Each element is Ξ΄α΅’^(β) Β· aβ±Ό^(ββ1) β how much changing weight W_{ij} affects the loss depends on the error at output i and the activation of input j. - A) That's part of the backward recurrence for Ξ΄^(ββ1), not βL/βW. - B) β Correct. Outer product of error and input activation. - C) Wrong shape β would give d_in Γ d_out instead of d_out Γ d_in. - D) Element-wise product, not the correct gradient for a linear layer.Q3: Why is backpropagation more efficient than computing each gradient independently via finite differences?
A) Backprop uses fewer bits of precision B) Backprop reuses intermediate results via the chain rule, computing all gradients in O(forward_cost) time C) Backprop doesn't require the forward pass D) Backprop only computes gradients for parameters that matter
Answer and Explanations
**Correct: B) Backprop reuses intermediate results via the chain rule, computing all gradients in O(forward_cost) time** With P parameters, finite differences requires P+1 forward passes. Backprop computes ALL P gradients in ~2-3 forward passes worth of computation. For a network with millions of parameters, this is the difference between feasible and impossible. - A) Incorrect. Precision is unrelated to efficiency. - B) β Correct. Reverse-mode AD computes all gradients with cost proportional to one forward pass. - C) Incorrect. Backprop REQUIRES the forward pass (to store activations). - D) Incorrect. Backprop computes ALL gradients, not a subset.Q4: For a linear layer z = Wa + b, what is βzα΅’/βaβ±Ό?
A) Ξ΄_{ij} (Kronecker delta) B) W_{ij} C) W_{ji} D) It depends on the activation function
Answer and Explanations
**Correct: B) W_{ij}** zα΅’ = Ξ£β W_{iβ} aβ + bα΅’. βzα΅’/βaβ±Ό = W_{ij} (only the j-th term in the sum depends on aβ±Ό). The Jacobian β**z**/β**a** is exactly the weight matrix **W**. - A) That would be true if zα΅’ = aα΅’, which it isn't. - B) β Correct. The Jacobian of a linear transformation is the weight matrix itself. - C) This is βzβ±Ό/βaα΅’, which equals W_{ji}. Not the same as βzα΅’/βaβ±Ό. - D) The linear layer comes BEFORE the activation, so it doesn't depend on f.Q5: When computing backprop through a ReLU activation a = max(0, z), what is the local gradient βa/βz used in the backward pass?
A) Always 1 B) Always 0 C) 1 if z > 0, 0 if z < 0 D) Ο(z)(1 β Ο(z))
Answer and Explanations
**Correct: C) 1 if z > 0, 0 if z < 0** ReLU'(z) = 1 for z > 0, 0 for z < 0. At z = 0, the subgradient is [0, 1] β in practice, frameworks use either 0 or define it as a non-linearity. This binary behavior is what causes "dying ReLU" when z β€ 0 for all inputs. - A) That would be linear activation, not ReLU. - B) Always 0 means no gradient flows β the network can't learn. - C) β Correct. The gradient is gated by the sign of the pre-activation. - D) That's the derivative of sigmoid, not ReLU.Next Steps
Move on to 16-06 β Gradient Flow in Deep Networks to understand why gradients vanish or explode in deep networks, and how to diagnose and mitigate these problems.